Find in R elements in same text vector that contain two substrings [duplicate]

Question

I have a text vector with five elements named text2. It is a sample of an actual dataset with over 1,800 rows and multiple columns.

I have reviewed other code solutions in stackoverflow and could not find a match.

Input

text2 <- c("Ian Desmond hits an inside-the-park home run (8) on a line drive down the right-field line. Brendan Rodgers scores. Tony Wolters scores." , "Ian Desmond lines out sharply to center fielder Jason Heyward.", "Ian Desmond hits a grand slam (9) to right center field. Charlie Blackmon scores. Trevor Story scores. David Dahl scores.", "Ian Desmond homers (12) on a fly ball to center field. Daniel Murphy scores.", "Wild pitch by pitcher Jake Faria. Sam Hilliard scores.")

Output I want to know which elements in text2 contain both "Wild pitch" and "scores." I would like both the count and the element numbers. For example, in text2 only one element (the last one) is a match. Thus, the output would contain both the count (1) and the element number (5).

Code tried str_detect(text2, ("Wild pitch|scores"))

kath · Accepted Answer · 2020-07-04 13:54:15Z

2

You're on the right track, however str_detect(text2, ("Wild pitch|scores")) gives you whether Wild pitch OR scores are contained in text2. This gives you your desired output:

ind <- str_detect(text2, "Wild pitch") & str_detect(text2, "scores")
count <- sum(ind)
count 
# 1
pos <- which(ind)
pos 
# 5

answered Jul 4, 2020 at 13:54

kath

7,75419 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Maël · Accepted Answer · 2020-07-04 14:38:18Z

1

A one-line dplyr solution

require(dplyr)
require(tidyr)

text2 %>% 
  as_tibble() %>% 
  mutate(WP = str_detect(text2,"Wild pitch"),
         S = str_detect(text2,"scores")) %>% 
  summarise(count=sum(WP==T & S==T),
            position=list(which(WP==T & S==T))) %>% 
  unnest(cols=c(position))

Which gives:

# A tibble: 1 x 2
  count position
  <int>    <int>
1     1        5

edited Jul 4, 2020 at 14:38

answered Jul 4, 2020 at 14:22

Maël

53k6 gold badges47 silver badges85 bronze badges

4 Comments

Metsfan Over a year ago

when I ran your code I got this error: Error in unnest(., cols = c(position)) : could not find function "unnest"

Maël Over a year ago

It's because it's in the tidyr package, i edited my post ;)

Metsfan Over a year ago

In summarise(count=sum(WP==T & S==T) where are the values of 'T' coming from? What are WP and S being made equal to?

Maël Over a year ago

T means TRUE, so that count=sum(WP==T & S==T) sums the number of elements when a sentence has "Wild pitch" and "scores" in it.

Ronak Shah · Accepted Answer · 2020-07-04 14:27:09Z

0

You can use the pattern :

pattern <- 'Wild pitch.*scores|scores.*Wild pitch'

To find position, you can use grep

grep(pattern, text2)
#[1] 5

For count you can get the length of grep :

length(grep(pattern, text2))
#Can also use grepl with sum
#sum(grepl(pattern, text2))
#[1] 1

answered Jul 4, 2020 at 14:27

Ronak Shah

391k20 gold badges173 silver badges237 bronze badges

Comments

Chris Ruehlemann · Accepted Answer · 2020-07-04 15:48:45Z

0

A one-liner solution with positive lookahead:

res <- c(length(grep("(?=Wild pitch).*scores", text2, perl = T)), 
         grep("(?=Wild pitch).*scores", text2, perl = T))

res
[1] 1 5

If the order of co-occurrence of Wild pitchand scoresis variable, then use this pattern:

"(?=Wild pitch)*(?=scores).*"

edited Jul 4, 2020 at 15:48

answered Jul 4, 2020 at 14:59

Chris Ruehlemann

21.5k4 gold badges15 silver badges45 bronze badges

Collectives™ on Stack Overflow

Find in R elements in same text vector that contain two substrings [duplicate]

4 Answers 4

Comments

4 Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

4 Comments

Comments

Comments

Linked

Related