0

I have a text vector with five elements named text2. It is a sample of an actual dataset with over 1,800 rows and multiple columns.

I have reviewed other code solutions in stackoverflow and could not find a match.

Input

text2 <- c("Ian Desmond hits an inside-the-park home run (8) on a line drive down the right-field line. Brendan Rodgers scores. Tony Wolters scores." , "Ian Desmond lines out sharply to center fielder Jason Heyward.", "Ian Desmond hits a grand slam (9) to right center field. Charlie Blackmon scores. Trevor Story scores. David Dahl scores.", "Ian Desmond homers (12) on a fly ball to center field. Daniel Murphy scores.", "Wild pitch by pitcher Jake Faria. Sam Hilliard scores.")

Output I want to know which elements in text2 contain both "Wild pitch" and "scores." I would like both the count and the element numbers. For example, in text2 only one element (the last one) is a match. Thus, the output would contain both the count (1) and the element number (5).

Code tried str_detect(text2, ("Wild pitch|scores"))

0

4 Answers 4

2

You're on the right track, however str_detect(text2, ("Wild pitch|scores")) gives you whether Wild pitch OR scores are contained in text2. This gives you your desired output:

ind <- str_detect(text2, "Wild pitch") & str_detect(text2, "scores")
count <- sum(ind)
count 
# 1
pos <- which(ind)
pos 
# 5
Sign up to request clarification or add additional context in comments.

Comments

1

A one-line dplyr solution

require(dplyr)
require(tidyr)

text2 %>% 
  as_tibble() %>% 
  mutate(WP = str_detect(text2,"Wild pitch"),
         S = str_detect(text2,"scores")) %>% 
  summarise(count=sum(WP==T & S==T),
            position=list(which(WP==T & S==T))) %>% 
  unnest(cols=c(position))

Which gives:

# A tibble: 1 x 2
  count position
  <int>    <int>
1     1        5

4 Comments

when I ran your code I got this error: Error in unnest(., cols = c(position)) : could not find function "unnest"
It's because it's in the tidyr package, i edited my post ;)
In summarise(count=sum(WP==T & S==T) where are the values of 'T' coming from? What are WP and S being made equal to?
T means TRUE, so that count=sum(WP==T & S==T) sums the number of elements when a sentence has "Wild pitch" and "scores" in it.
0

You can use the pattern :

pattern <- 'Wild pitch.*scores|scores.*Wild pitch'

To find position, you can use grep

grep(pattern, text2)
#[1] 5

For count you can get the length of grep :

length(grep(pattern, text2))
#Can also use grepl with sum
#sum(grepl(pattern, text2))
#[1] 1

Comments

0

A one-liner solution with positive lookahead:

res <- c(length(grep("(?=Wild pitch).*scores", text2, perl = T)), 
         grep("(?=Wild pitch).*scores", text2, perl = T))

res
[1] 1 5

If the order of co-occurrence of Wild pitchand scoresis variable, then use this pattern:

"(?=Wild pitch)*(?=scores).*"

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.