Parsing Character String in R

Question

I want to parse a character vector in R in a dataframe constructed similarily to the one below:

a <- c("abc def. ghi jkl mno pqr", "stu vwx.", "yza bcd. efg hij mno klm", " nop qrs.", "tuv wxy.")
b <- c(TRUE, FALSE, TRUE, FALSE, FALSE)
df <- as.data.frame(cbind(a, b))
df$a <- as.character(df$a)
df$b <- as.logical(df$b)

df
                         a     b
1 abc def. ghi jkl mno pqr  TRUE
2                 stu vwx. FALSE
3 yza bcd. efg hij mno klm  TRUE
4                 nop qrs. FALSE
5                 tuv wxy. FALSE
> str(df)
'data.frame':   5 obs. of  2 variables:
 $ a: chr  "abc def. ghi jkl mno pqr" "stu vwx." "yza bcd. efg hij mno klm" " nop qrs." ...
 $ b: logi  TRUE FALSE TRUE FALSE FALSE

I want to create a new variable, c, that returns NA in all cases where df$b == FALSE and, in all cases where df$b == TRUE, returns the two words that appear immediately prior to mno. As it happens, in all cases, these two desired words are sandwiched between mno and a period (.). I would ultimately like df$c to look like:

> c
[1] "ghi jkl" NA        "efg hij" NA       
[5] NA       
> str(c)
 chr [1:5] "ghi jkl" NA "efg hij" NA NA

I've been able to extract the words between two keywords using:

df$c <- ifelse(df$b == TRUE, str_sub(str_extract(df$a, "(?<=\\bdef).+?.(\\bmno)")), NA)

df
                            a     b
1    abc def. ghi jkl mno pqr  TRUE
2                    stu vwx. FALSE
3 yzab cdef. ghi jkl mno mnop  TRUE
4                    qrs tuv. FALSE
5                    wxy zab. FALSE
              c
1 . ghi jkl mno
2          <NA>
3          <NA>
4          <NA>
5          <NA>

But it does not work with punctuation:

df$c <- ifelse(df$b == TRUE, str_sub(str_extract(df$a, "(?<=\\b.).+?.(\\bmno)"), end = -5L), NA)

df
                            a     b
1    abc def. ghi jkl mno pqr  TRUE
2                    stu vwx. FALSE
3 yzab cdef. ghi jkl mno mnop  TRUE
4                    qrs tuv. FALSE
5                    wxy zab. FALSE
                  c
1   bc def. ghi jkl
2              <NA>
3 zab cdef. ghi jkl
4              <NA>
5              <NA>

I am somewhat new to R, and don't fully understand general exressions. How do I call just the two words between . and mno?

Thanks for your help!

EDIT

I've also tried to count words backwards from mno using gsub with:

> df$c <- ifelse(df$b == TRUE, gsub("(\\w+\\s)*(\\w+)\\smno.*","\\1\\2", df$a), NA)
> df
                            a     b
1    abc def. ghi jkl mno pqr  TRUE
2                    stu vwx. FALSE
3 yzab cdef. ghi jkl mno mnop  TRUE
4                    qrs tuv. FALSE
5                    wxy zab. FALSE
                   c
1   abc def. ghi jkl
2               <NA>
3 yzab cdef. jkl mno
4               <NA>
5               <NA>

Though this has worked for me in the past, here it seems to just return everything before mno. I've also been able to trim my results in the past using start = and end =, but here I would need to count words, as opposed to characters, to utilize that approach. Is there a way to trim my results by counting words instead of by counting characters?

If it's the case that b is true iff there's a regex match, there's transform(df, c = sapply(regmatches(a, regexec(".*?(\\w+ \\w+) mno .*", a)), `[`, 2)) — Frank
– Frank, Commented Aug 29, 2017 at 18:38

akrun · Accepted Answer · 2017-08-31 03:54:35Z

3

We can use sub to match characters (.*), capture two instances of word followed by zero or more spaces as a group followed by space and 'mno', then replace with the backreference, use this in ifelse to FALSE values with NA

df$c <-  with(df, ifelse(b, sub(".*\\b(\\w+\\s+\\w+)\\s+mno\\b.*", "\\1", a), NA))

df$c
#[1] "ghi jkl" NA        "efg hij" NA        NA

Or as @Frank mentioned, replace can be used as well

with(df, replace(sub(".*\\b(\\w+\\s+\\w+)\\s+mno\\b.*", "\\1", a), !b, NA))

edited Aug 31, 2017 at 3:54

answered Aug 29, 2017 at 18:33

akrun

891k38 gold badges590 silver badges700 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

DataProphets Over a year ago

Thanks for your help @Frank and @akun. I appreciate your insights. However, I seem to be getting different results than you did:

> df$c <- with(df, ifelse(b, sub(".*\\b((\\w+\\s*){2})\\smno\\b.*", "\\1", a), NA)) > df$c [1] "jkl" NA    "hij" NA    NA

. Any idea why I'm getting only the second of the two words you are? @Frank's solution worked on my sample data set; but, on my real data set, I got the following error: Error: unexpected '=' in "transform(superbowl, superbowl$Fumbler =" where df == superbowl and df$c == superbowl$Fumbler. Thanks for the help!

akrun Over a year ago

@DataProphets Not sure what happened. I edited my first option to a compact one (if I remember it was working - did you changed your example any way)?. I reverted back to the earlier regex. It is working for the example

Collectives™ on Stack Overflow

Parsing Character String in R

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related