2

Consider the following reproducible dataset which I created on the basis of the Donald Trump-Tweets dataset (which can be found here):

df <- tibble(target = c(rep("jeb-bush", 2), rep("jeb-bush-supporters", 2),
                        "jeb-staffer", rep("the-media", 5)),
             tweet_id = seq(1, 10, 1))

It consists of two columns, the target group of the tweets and the tweet_id:

# A tibble: 10 x 2
   target              tweet_id
   <chr>                  <dbl>
 1 jeb-bush                   1
 2 jeb-bush                   2
 3 jeb-bush-supporters        3
 4 jeb-bush-supporters        4
 5 jeb-staffer                5
 6 the-media                  6
 7 the-media                  7
 8 the-media                  8
 9 the-media                  9
10 the-media                 10

Goal:

Whenever an element in target starts with jeb, I want to extract the string pattern after the -. And whenever there are multiple - in an element which starts with jeb, I want to extract the string pattern after the LAST - (which in this example dataset would only be the case for jeb-bush-supporters). For every element that doesn't start with jeb, I just want to create the string other. In the end, it should look like this:

# A tibble: 10 x 3
   target              tweet_id new_var   
   <chr>                  <dbl> <chr>     
 1 jeb-bush                   1 bush      
 2 jeb-bush                   2 bush      
 3 jeb-bush-supporters        3 supporters
 4 jeb-bush-supporters        4 supporters
 5 jeb-staffer                5 staffer   
 6 the-media                  6 other     
 7 the-media                  7 other     
 8 the-media                  8 other     
 9 the-media                  9 other     
10 the-media                 10 other    

What I have tried:

I have actually managed to create the desired output with the following code:

df %>% 
    mutate(new_var = case_when(str_detect(target, "^jeb-[a-z]+$") ~
                             str_extract(target, "(?<=[a-z]{3}-)[a-z]+"),
                               str_detect(target, "^jeb-[a-z]+-[a-z]+") ~
                             str_extract(target, "(?<=[a-z]{3}-[a-z]{4}-)[a-z]+"),
                               TRUE ~ "other"))

But the problem is this:

In the second str_extract statement, I have to define the exact amount of letters in the 'Positive Look Behind' ([a-z]{4}). Otherwise R is complaining about needing a "bounded maximum length". But what if I don't know the exact pattern length or if it would vary from element to element?

Alternatively, I tried to work with capture groups instead of with "Look Arounds". Therefore, I tried to include str_match to define what I WANT to extract instead of what I DON'T want to extract:

df %>% 
    mutate(new_var = case_when(str_detect(target, "^jeb-[a-z]+$") ~
                             str_match(target, "jeb-([a-z]+)"),
                           str_detect(target, "^jeb-[a-z]+-[a-z]+") ~
                             str_match(target, "jeb-[a-z]+-([a-z]+)"),
                           TRUE ~ "other"))

But then I receive this error message:

Error: Problem with `mutate()` input `new_var`.
x `str_detect(target, "^jeb-[a-z]+$") ~ str_match(target, "jeb-([a-z]+)")`, `str_detect(target, "^jeb-[a-z]+-[a-z]+") ~ str_match(target, 
    "jeb-[a-z]{4}-([a-z]+)")` must be length 10 or one, not 20.
i Input `new_var` is `case_when(...)`.

Question:

Ultimately, I want to know if there is a concise way of extracting specific string patterns in a case_when-statement. How would I work around the problem that I stated here, when I wouldn't be able to use "Look Arounds" (because I can't define a bounded maximum length) nor capture groups (because str_match would return a vector of length 20 and not of the original size 10 or one)?

1 Answer 1

3

An option is to check for target column with 'jeb-' substring from the beginning (^) of the string in case_when, then extract the characters that are not a - ([^-]+) at the end ($) of the string, or else (TRUE) return the 'other'

library(dplyr)
library(stringr)
df %>% 
    mutate(new_var = case_when(str_detect(target, '^jeb-')~ 
        str_extract(target, '[^-]+$'), TRUE ~ 'other'))

-output

# A tibble: 10 x 3
#   target              tweet_id new_var   
#   <chr>                  <dbl> <chr>     
# 1 jeb-bush                   1 bush      
# 2 jeb-bush                   2 bush      
# 3 jeb-bush-supporters        3 supporters
# 4 jeb-bush-supporters        4 supporters
# 5 jeb-staffer                5 staffer   
# 6 the-media                  6 other     
# 7 the-media                  7 other     
# 8 the-media                  8 other     
# 9 the-media                  9 other     
#10 the-media                 10 other    

We can also simplify this with str_match and coalesce

df %>% 
   mutate(new_var = coalesce(str_match(target, '^jeb-.*?([^-]+)$')[,2], 'other')) 
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you! One question: What is the purpose of the ? in your regular expression in the coalesce() function?
@N1loon it is concerned with laziness. You may check here

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.