3

I am trying to extract part of a column in a data frame using regular expressions. Problems I am running into include the facts that grep returns the whole value, not just the matched part, and that str_extract doesn't seem to work in a vectorized way.

Here is what I'm trying. I would like df$match to show alpha.alpha. where the pattern exists and NA otherwise. How can I show only the matched part?

Also, how I can I replace [a-zA-Z] in R regex? Can I use a character class or a POSIX code like [:alpha:]?

v1 <- c(1:4)
v2 <- c("_a.b._", NA, "_C.D._", "_ef_")
df <- data.frame(v1, v2, stringsAsFactors = FALSE)

df$match <- grepl("[a-zA-Z]\\.[a-zA-Z]\\.", df$v2)
df$match

#TRUE FALSE  TRUE FALSE

v2grep <- grep("[a-zA-Z]\\.[a-zA-Z]\\.", df$v2, value = TRUE)

df$match[df$match == TRUE] <- v2grep
df$match[df$match == FALSE] <- NA

df

#v1  v2      match
#1   _a.b._  _a.b._
#2   <NA>    <NA>
#3   _C.D._  _C.D._
#4   _ef_    <NA>

What I want:

#v1  v2      match
#1   _a.b._  a.b.
#2   <NA>    <NA>
#3   _C.D._  C.D.
#4   _ef_    <NA>

3 Answers 3

4

4 Approaches...

Here's 2 approaches in base as well as with rm_default(extract=TRUE) in the qdapRegex package I maintain and the stringi package.

unlist(sapply(regmatches(df[["v2"]], gregexpr("[a-zA-Z]\\.[a-zA-Z]\\.", df[["v2"]])), function(x){
        ifelse(identical(character(0), x), NA, x)
    })
)

## [1] "a.b." NA     "C.D." NA 

pat <- "(.*?)([a-zA-Z]\\.[a-zA-Z]\\.)(.*?)$"
df[["v2"]][!grepl(pat, df[["v2"]])] <- NA
df[["v2"]] <- gsub(pat, "\\2", df[["v2"]])

## [1] "a.b." NA     "C.D." NA

library(qdapRegex)
unlist(rm_default(df[["v2"]], pattern = "[a-zA-Z]\\.[a-zA-Z]\\.", extract = TRUE))

## [1] "a.b." NA     "C.D." NA 

library(stringi)
stri_extract_first_regex(df[["v2"]], "[a-zA-Z]\\.[a-zA-Z]\\.")

## [1] "a.b." NA     "C.D." NA 
Sign up to request clarification or add additional context in comments.

4 Comments

I like the strringi solution. This is really a rich package, there's a lot one can do with it when taking time to study it!
The stringi seems to do just what I need easily. Will it also allow me to separate out parts of a pattern, like you do with gsub(pat, "\\2", df[["v2"]]) in your second solution?
Yes but you need a different pattern and possibly function depending on what you are after.
Please see my new question
4

Base R solution using regmatches, and regexpr which returns -1 if no regex match is found:

r <- regexpr("[a-zA-Z]\\.[a-zA-Z]\\.", df$v2)
df$match <- NA
df$match[which(r != -1)] <- regmatches(df$v2, r)

#  v1     v2 match
#1  1 _a.b._  a.b.
#2  2   <NA>  <NA>
#3  3 _C.D._  C.D.
#4  4   _ef_  <NA>

Comments

3

One possible solution using both grepl and sub:

# First, remove unwanted characters around pattern when detected
df$match <- sub(pattern = ".*([a-zA-Z]\\.[a-zA-Z]\\.).*", 
                replacement = "\\1", x = df$v2)
# Second, check if pattern is present; otherwise set to NA
df$match <- ifelse(grepl(pattern = "[a-zA-Z]\\.[a-zA-Z]\\.", x = df$match),
                   yes = df$match, no = NA)

Results

df

#   v1     v2 match
# 1  1 _a.b._  a.b.
# 2  2   <NA>  <NA>
# 3  3 _C.D._  C.D.
# 4  4   _ef_  <NA>

Data

v1 <- c(1:4)
v2 <- c("_a.b._", NA, "_C.D._", "_ef_")
df <- data.frame(v1, v2, stringsAsFactors = FALSE)

2 Comments

Can you explain what the "\\1" does? Does it refer to the first grouped () section in the pattern? It seems straightforward here, but when I experiment in something more complicated it doesn't seem to work exactly that way. Is there a good resource for that kind of notation?
Yes, the \\1 refers to the first (and only, this time) grouped pattern (inside parentheses). When you have several groups of parentheses, and some nested in others, it becomes a little bit more tricky, but still pretty straightforward. For sources, ?regex is a good start (and has other links at the end).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.