Replace a data frame column based on regex

Question

I am trying to extract part of a column in a data frame using regular expressions. Problems I am running into include the facts that grep returns the whole value, not just the matched part, and that str_extract doesn't seem to work in a vectorized way.

Here is what I'm trying. I would like df$match to show alpha.alpha. where the pattern exists and NA otherwise. How can I show only the matched part?

Also, how I can I replace [a-zA-Z] in R regex? Can I use a character class or a POSIX code like [:alpha:]?

v1 <- c(1:4)
v2 <- c("_a.b._", NA, "_C.D._", "_ef_")
df <- data.frame(v1, v2, stringsAsFactors = FALSE)

df$match <- grepl("[a-zA-Z]\\.[a-zA-Z]\\.", df$v2)
df$match

#TRUE FALSE  TRUE FALSE

v2grep <- grep("[a-zA-Z]\\.[a-zA-Z]\\.", df$v2, value = TRUE)

df$match[df$match == TRUE] <- v2grep
df$match[df$match == FALSE] <- NA

df

#v1  v2      match
#1   _a.b._  _a.b._
#2   <NA>    <NA>
#3   _C.D._  _C.D._
#4   _ef_    <NA>

What I want:

#v1  v2      match
#1   _a.b._  a.b.
#2   <NA>    <NA>
#3   _C.D._  C.D.
#4   _ef_    <NA>

Tyler Rinker · Accepted Answer · 2015-04-09 04:02:17Z

4

4 Approaches...

Here's 2 approaches in base as well as with rm_default(extract=TRUE) in the qdapRegex package I maintain and the stringi package.

unlist(sapply(regmatches(df[["v2"]], gregexpr("[a-zA-Z]\\.[a-zA-Z]\\.", df[["v2"]])), function(x){
        ifelse(identical(character(0), x), NA, x)
    })
)

## [1] "a.b." NA     "C.D." NA 

pat <- "(.*?)([a-zA-Z]\\.[a-zA-Z]\\.)(.*?)$"
df[["v2"]][!grepl(pat, df[["v2"]])] <- NA
df[["v2"]] <- gsub(pat, "\\2", df[["v2"]])

## [1] "a.b." NA     "C.D." NA

library(qdapRegex)
unlist(rm_default(df[["v2"]], pattern = "[a-zA-Z]\\.[a-zA-Z]\\.", extract = TRUE))

## [1] "a.b." NA     "C.D." NA 

library(stringi)
stri_extract_first_regex(df[["v2"]], "[a-zA-Z]\\.[a-zA-Z]\\.")

## [1] "a.b." NA     "C.D." NA

edited Apr 9, 2015 at 4:02

answered Apr 9, 2015 at 3:14

Tyler Rinker

111k74 gold badges335 silver badges536 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Dominic Comtois Over a year ago

I like the strringi solution. This is really a rich package, there's a lot one can do with it when taking time to study it!

Kevin M Over a year ago

The stringi seems to do just what I need easily. Will it also allow me to separate out parts of a pattern, like you do with gsub(pat, "\\2", df[["v2"]]) in your second solution?

Tyler Rinker Over a year ago

Yes but you need a different pattern and possibly function depending on what you are after.

Kevin M Over a year ago

Please see my new question

thelatemail · Accepted Answer · 2015-04-09 03:48:52Z

4

Base R solution using regmatches, and regexpr which returns -1 if no regex match is found:

r <- regexpr("[a-zA-Z]\\.[a-zA-Z]\\.", df$v2)
df$match <- NA
df$match[which(r != -1)] <- regmatches(df$v2, r)

#  v1     v2 match
#1  1 _a.b._  a.b.
#2  2   <NA>  <NA>
#3  3 _C.D._  C.D.
#4  4   _ef_  <NA>

edited Apr 9, 2015 at 3:48

answered Apr 9, 2015 at 3:40

thelatemail

94.3k12 gold badges140 silver badges197 bronze badges

Comments

Dominic Comtois · Accepted Answer · 2015-04-09 03:57:13Z

3

One possible solution using both grepl and sub:

# First, remove unwanted characters around pattern when detected
df$match <- sub(pattern = ".*([a-zA-Z]\\.[a-zA-Z]\\.).*", 
                replacement = "\\1", x = df$v2)
# Second, check if pattern is present; otherwise set to NA
df$match <- ifelse(grepl(pattern = "[a-zA-Z]\\.[a-zA-Z]\\.", x = df$match),
                   yes = df$match, no = NA)

Results

df

#   v1     v2 match
# 1  1 _a.b._  a.b.
# 2  2   <NA>  <NA>
# 3  3 _C.D._  C.D.
# 4  4   _ef_  <NA>

Data

v1 <- c(1:4)
v2 <- c("_a.b._", NA, "_C.D._", "_ef_")
df <- data.frame(v1, v2, stringsAsFactors = FALSE)

edited Apr 9, 2015 at 3:57

answered Apr 9, 2015 at 3:21

Dominic Comtois

10.5k1 gold badge43 silver badges62 bronze badges

2 Comments

Kevin M Over a year ago

Can you explain what the "\\1" does? Does it refer to the first grouped () section in the pattern? It seems straightforward here, but when I experiment in something more complicated it doesn't seem to work exactly that way. Is there a good resource for that kind of notation?

Dominic Comtois Over a year ago

Yes, the \\1 refers to the first (and only, this time) grouped pattern (inside parentheses). When you have several groups of parentheses, and some nested in others, it becomes a little bit more tricky, but still pretty straightforward. For sources, ?regex is a good start (and has other links at the end).

Collectives™ on Stack Overflow

Replace a data frame column based on regex

3 Answers 3

4 Comments

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related