What R function to use for regex capture groups?

Question

I am doing some text wrangling in R, and for a specific extraction I need to use a capture group. For some reason the base/stringr functions I am familiar with don't seem to support capture groups:

str_extract("abcd123asdc", pattern = "([0-9]{3}).+$") 
# Returns: "123asdc"

stri_extract(str = "abcd123asdc", regex = "([0-9]{3}).+$")
# Returns: "123asdc"

grep(x = "abcd123asdc", pattern = "([0-9]{3}).+$", value = TRUE)
# Returns: "abcd123asdc"

The usual googling for "R capture group regex" doesn't give any useful hits for solutions to this problem. Am I missing something, or are capture groups not implemented in R?

EDIT: So after trying to solution suggested in the comments, which works on a small example, it fails for my situation.

Note this is a text from the enron emails dataset, so doesn't contain sensitive information.

txt <- "Message-ID: <24216240.1075855687451.JavaMail.evans@thyme>
Date: Wed, 18 Oct 2000 03:00:00 -0700 (PDT)
From: [email protected]
To: [email protected]
Subject: Re: test
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: Leah Van Arsdall
X-cc: 
X-bcc: 
X-Folder: \\Phillip_Allen_Dec2000\\Notes Folders\\sent mail   
X-Origin: Allen-P
X-FileName: pallen.nsf

test successful.  way to go!!!"

sub("X-FileName:.+\n\n([\\W\\w]+)$", "\\1", txt)
# Returns all of "txt", not the capture group

Since we only have a single capture group, shouldn't the "\1" capture it? I tested the regex with an online regex tester and it should be working. Also tried both \n and \n for the newlines. Any ideas?

@hwnd: The actual regex isn't so easy to explicitly match, but this example was easier to type over. — BallzofFury
– BallzofFury, Commented May 14, 2017 at 20:38
@BallzofFury: In your input text, \P and \N are unknown escape sequences, the backslash must be doubled. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented May 14, 2017 at 21:32

Community · Accepted Answer · 2020-06-20 09:12:55Z

8

Getting job done

You may always extract capture groups with stringr using str_match or str_match_all:

> result <- str_match(txt, "X-FileName:.+\n\n(?s)(.+)$")
> result[,2]
[1] "test successful.  way to go!!!"

Pattern details:

X-FileName: - a literal substring
.+ - any 1+ chars other than line break (since in ICU regex, a dot does not match a line break char)
\n\n - 2 newline symbols
(?s) - an inline DOTALL modifier (now, . that occurs to the right will match a line break char)
(.+) - Group 1 capturing any 1+ chars (incl. line breaks) up to
$ - the end of string.

Or you may use base R regmatches with regexec:

> result <- regmatches(txt, regexec("X-FileName:[^\n]+\n\n(.+)$", txt))
> result[[1]][2]
[1] "test successful.  way to go!!!"

See the online R demo. Here, a TRE regex is used (with regexec, one can't use PCRE regex unfortunately), so . will match any character including a line break char, thus, the pattern will look like X-FileName:[^\n]+\n\n(.+)$:

X-FileName: - a literal string
[^\n]+ - 1+ chars other than newline
\n\n - 2 newlines
(.+) - any 1+ chars (including line break chars), as many as possible, up to
$ - the end of string.

A sub option can also be considered:

sub(".*X-FileName:[^\n]+\n\n", "", txt)
[1] "test successful.  way to go!!!"

See this R demo. Here, .* matches any 0+ chars, as many as possible (all the string), then backtracks to find X-FileName: substring, [^\n]+ matches 1+ chars other than a newline, and then \n\n match 2 newlines.

Comparing peformance

Taking into account hwnd's comment, I added a TRE regex based sub option above, and it seems the fastest from all 4 options suggested, with str_match being almost as fast as my above sub code:

library(microbenchmark)

f1 <- function(text) { return(str_match(txt, "X-FileName:.+\n\n(?s)(.+)$")[,2]) }
f2 <- function(text) { return(regmatches(txt, regexec("X-FileName:[^\n]+\n\n(.+)$", txt))[[1]][2]) }
f3 <- function(text) { return(sub('(?s).*X-FileName:[^\n]+\\R+', '', txt, perl=TRUE)) }
f4 <- function(text) { return(sub('.*X-FileName:[^\n]+\n\n', '', txt)) }

> test <- microbenchmark( f1(txt), f2(txt), f3(txt), f4(txt), times = 500000 )
> test
Unit: microseconds
    expr    min     lq     mean median     uq       max neval  cld
 f1(txt) 21.130 24.451 28.08150 27.168 28.677 53796.565 5e+05  b  
 f2(txt) 29.280 32.903 37.46800 35.318 37.431 54556.635 5e+05   c 
 f3(txt) 57.655 59.466 63.36906 60.674 61.881  1651.448 5e+05    d
 f4(txt) 22.036 23.545 25.56820 24.451 25.356  1660.504 5e+05 a

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered May 14, 2017 at 20:54

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

hwnd Over a year ago

Or sub('(?s).*X-FileName:[^\n]+\\R+', '', txt, perl=TRUE) ... codebunk.com/pb/852138641

Wiktor Stribiżew Over a year ago

@hwnd: It turns out a similar TRE regex with sub performs better.

BallzofFury Over a year ago

Thanks so much, this works great! It wasn't clear to me that the capture group was returned in the second index of the result and that's what caused my confusion. I think you comment will help a lot of people!

Collectives™ on Stack Overflow

What R function to use for regex capture groups?

1 Answer 1

Getting job done

Comparing peformance

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Getting job done

Comparing peformance

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related