I want to parse a character vector in R in a dataframe constructed similarily to the one below:
a <- c("abc def. ghi jkl mno pqr", "stu vwx.", "yza bcd. efg hij mno klm", " nop qrs.", "tuv wxy.")
b <- c(TRUE, FALSE, TRUE, FALSE, FALSE)
df <- as.data.frame(cbind(a, b))
df$a <- as.character(df$a)
df$b <- as.logical(df$b)
df
a b
1 abc def. ghi jkl mno pqr TRUE
2 stu vwx. FALSE
3 yza bcd. efg hij mno klm TRUE
4 nop qrs. FALSE
5 tuv wxy. FALSE
> str(df)
'data.frame': 5 obs. of 2 variables:
$ a: chr "abc def. ghi jkl mno pqr" "stu vwx." "yza bcd. efg hij mno klm" " nop qrs." ...
$ b: logi TRUE FALSE TRUE FALSE FALSE
I want to create a new variable, c, that returns NA in all cases where df$b == FALSE and, in all cases where df$b == TRUE, returns the two words that appear immediately prior to mno. As it happens, in all cases, these two desired words are sandwiched between mno and a period (.). I would ultimately like df$c to look like:
> c
[1] "ghi jkl" NA "efg hij" NA
[5] NA
> str(c)
chr [1:5] "ghi jkl" NA "efg hij" NA NA
I've been able to extract the words between two keywords using:
df$c <- ifelse(df$b == TRUE, str_sub(str_extract(df$a, "(?<=\\bdef).+?.(\\bmno)")), NA)
df
a b
1 abc def. ghi jkl mno pqr TRUE
2 stu vwx. FALSE
3 yzab cdef. ghi jkl mno mnop TRUE
4 qrs tuv. FALSE
5 wxy zab. FALSE
c
1 . ghi jkl mno
2 <NA>
3 <NA>
4 <NA>
5 <NA>
But it does not work with punctuation:
df$c <- ifelse(df$b == TRUE, str_sub(str_extract(df$a, "(?<=\\b.).+?.(\\bmno)"), end = -5L), NA)
df
a b
1 abc def. ghi jkl mno pqr TRUE
2 stu vwx. FALSE
3 yzab cdef. ghi jkl mno mnop TRUE
4 qrs tuv. FALSE
5 wxy zab. FALSE
c
1 bc def. ghi jkl
2 <NA>
3 zab cdef. ghi jkl
4 <NA>
5 <NA>
I am somewhat new to R, and don't fully understand general exressions. How do I call just the two words between . and mno?
Thanks for your help!
EDIT
I've also tried to count words backwards from mno using gsub with:
> df$c <- ifelse(df$b == TRUE, gsub("(\\w+\\s)*(\\w+)\\smno.*","\\1\\2", df$a), NA)
> df
a b
1 abc def. ghi jkl mno pqr TRUE
2 stu vwx. FALSE
3 yzab cdef. ghi jkl mno mnop TRUE
4 qrs tuv. FALSE
5 wxy zab. FALSE
c
1 abc def. ghi jkl
2 <NA>
3 yzab cdef. jkl mno
4 <NA>
5 <NA>
Though this has worked for me in the past, here it seems to just return everything before mno. I've also been able to trim my results in the past using start = and end =, but here I would need to count words, as opposed to characters, to utilize that approach. Is there a way to trim my results by counting words instead of by counting characters?
bis true iff there's a regex match, there'stransform(df, c = sapply(regmatches(a, regexec(".*?(\\w+ \\w+) mno .*", a)), `[`, 2))