0

I want to parse a character vector in R in a dataframe constructed similarily to the one below:

a <- c("abc def. ghi jkl mno pqr", "stu vwx.", "yza bcd. efg hij mno klm", " nop qrs.", "tuv wxy.")
b <- c(TRUE, FALSE, TRUE, FALSE, FALSE)
df <- as.data.frame(cbind(a, b))
df$a <- as.character(df$a)
df$b <- as.logical(df$b)

df
                         a     b
1 abc def. ghi jkl mno pqr  TRUE
2                 stu vwx. FALSE
3 yza bcd. efg hij mno klm  TRUE
4                 nop qrs. FALSE
5                 tuv wxy. FALSE
> str(df)
'data.frame':   5 obs. of  2 variables:
 $ a: chr  "abc def. ghi jkl mno pqr" "stu vwx." "yza bcd. efg hij mno klm" " nop qrs." ...
 $ b: logi  TRUE FALSE TRUE FALSE FALSE

I want to create a new variable, c, that returns NA in all cases where df$b == FALSE and, in all cases where df$b == TRUE, returns the two words that appear immediately prior to mno. As it happens, in all cases, these two desired words are sandwiched between mno and a period (.). I would ultimately like df$c to look like:

> c
[1] "ghi jkl" NA        "efg hij" NA       
[5] NA       
> str(c)
 chr [1:5] "ghi jkl" NA "efg hij" NA NA

I've been able to extract the words between two keywords using:

df$c <- ifelse(df$b == TRUE, str_sub(str_extract(df$a, "(?<=\\bdef).+?.(\\bmno)")), NA)

df
                            a     b
1    abc def. ghi jkl mno pqr  TRUE
2                    stu vwx. FALSE
3 yzab cdef. ghi jkl mno mnop  TRUE
4                    qrs tuv. FALSE
5                    wxy zab. FALSE
              c
1 . ghi jkl mno
2          <NA>
3          <NA>
4          <NA>
5          <NA>

But it does not work with punctuation:

df$c <- ifelse(df$b == TRUE, str_sub(str_extract(df$a, "(?<=\\b.).+?.(\\bmno)"), end = -5L), NA)

df
                            a     b
1    abc def. ghi jkl mno pqr  TRUE
2                    stu vwx. FALSE
3 yzab cdef. ghi jkl mno mnop  TRUE
4                    qrs tuv. FALSE
5                    wxy zab. FALSE
                  c
1   bc def. ghi jkl
2              <NA>
3 zab cdef. ghi jkl
4              <NA>
5              <NA>

I am somewhat new to R, and don't fully understand general exressions. How do I call just the two words between . and mno?

Thanks for your help!

EDIT

I've also tried to count words backwards from mno using gsub with:

> df$c <- ifelse(df$b == TRUE, gsub("(\\w+\\s)*(\\w+)\\smno.*","\\1\\2", df$a), NA)
> df
                            a     b
1    abc def. ghi jkl mno pqr  TRUE
2                    stu vwx. FALSE
3 yzab cdef. ghi jkl mno mnop  TRUE
4                    qrs tuv. FALSE
5                    wxy zab. FALSE
                   c
1   abc def. ghi jkl
2               <NA>
3 yzab cdef. jkl mno
4               <NA>
5               <NA>

Though this has worked for me in the past, here it seems to just return everything before mno. I've also been able to trim my results in the past using start = and end =, but here I would need to count words, as opposed to characters, to utilize that approach. Is there a way to trim my results by counting words instead of by counting characters?

1
  • If it's the case that b is true iff there's a regex match, there's transform(df, c = sapply(regmatches(a, regexec(".*?(\\w+ \\w+) mno .*", a)), `[`, 2)) Commented Aug 29, 2017 at 18:38

1 Answer 1

3

We can use sub to match characters (.*), capture two instances of word followed by zero or more spaces as a group followed by space and 'mno', then replace with the backreference, use this in ifelse to FALSE values with NA

df$c <-  with(df, ifelse(b, sub(".*\\b(\\w+\\s+\\w+)\\s+mno\\b.*", "\\1", a), NA))

df$c
#[1] "ghi jkl" NA        "efg hij" NA        NA    

Or as @Frank mentioned, replace can be used as well

with(df, replace(sub(".*\\b(\\w+\\s+\\w+)\\s+mno\\b.*", "\\1", a), !b, NA))
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for your help @Frank and @akun. I appreciate your insights. However, I seem to be getting different results than you did: > df$c <- with(df, ifelse(b, sub(".*\\b((\\w+\\s*){2})\\smno\\b.*", "\\1", a), NA)) > df$c [1] "jkl" NA "hij" NA NA . Any idea why I'm getting only the second of the two words you are? @Frank's solution worked on my sample data set; but, on my real data set, I got the following error: Error: unexpected '=' in "transform(superbowl, superbowl$Fumbler =" where df == superbowl and df$c == superbowl$Fumbler. Thanks for the help!
@DataProphets Not sure what happened. I edited my first option to a compact one (if I remember it was working - did you changed your example any way)?. I reverted back to the earlier regex. It is working for the example

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.