8

I am doing some text wrangling in R, and for a specific extraction I need to use a capture group. For some reason the base/stringr functions I am familiar with don't seem to support capture groups:

str_extract("abcd123asdc", pattern = "([0-9]{3}).+$") 
# Returns: "123asdc"

stri_extract(str = "abcd123asdc", regex = "([0-9]{3}).+$")
# Returns: "123asdc"

grep(x = "abcd123asdc", pattern = "([0-9]{3}).+$", value = TRUE)
# Returns: "abcd123asdc"

The usual googling for "R capture group regex" doesn't give any useful hits for solutions to this problem. Am I missing something, or are capture groups not implemented in R?

EDIT: So after trying to solution suggested in the comments, which works on a small example, it fails for my situation.

Note this is a text from the enron emails dataset, so doesn't contain sensitive information.

txt <- "Message-ID: <24216240.1075855687451.JavaMail.evans@thyme>
Date: Wed, 18 Oct 2000 03:00:00 -0700 (PDT)
From: [email protected]
To: [email protected]
Subject: Re: test
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: Leah Van Arsdall
X-cc: 
X-bcc: 
X-Folder: \\Phillip_Allen_Dec2000\\Notes Folders\\sent mail   
X-Origin: Allen-P
X-FileName: pallen.nsf

test successful.  way to go!!!"

sub("X-FileName:.+\n\n([\\W\\w]+)$", "\\1", txt)
# Returns all of "txt", not the capture group

Since we only have a single capture group, shouldn't the "\1" capture it? I tested the regex with an online regex tester and it should be working. Also tried both \n and \n for the newlines. Any ideas?

6
  • sub(".*([0-9]{3}.+$)", "\\1", "abcd123asdc") perhaps Commented May 14, 2017 at 20:36
  • @hwnd: The actual regex isn't so easy to explicitly match, but this example was easier to type over. Commented May 14, 2017 at 20:38
  • @David Arenburg: Awsome, that seems to work! Commented May 14, 2017 at 20:39
  • See also gregexpr Commented May 14, 2017 at 20:43
  • @BallzofFury: In your input text, \P and \N are unknown escape sequences, the backslash must be doubled. Commented May 14, 2017 at 21:32

1 Answer 1

8

Getting job done

You may always extract capture groups with stringr using str_match or str_match_all:

> result <- str_match(txt, "X-FileName:.+\n\n(?s)(.+)$")
> result[,2]
[1] "test successful.  way to go!!!"

Pattern details:

  • X-FileName: - a literal substring
  • .+ - any 1+ chars other than line break (since in ICU regex, a dot does not match a line break char)
  • \n\n - 2 newline symbols
  • (?s) - an inline DOTALL modifier (now, . that occurs to the right will match a line break char)
  • (.+) - Group 1 capturing any 1+ chars (incl. line breaks) up to
  • $ - the end of string.

Or you may use base R regmatches with regexec:

> result <- regmatches(txt, regexec("X-FileName:[^\n]+\n\n(.+)$", txt))
> result[[1]][2]
[1] "test successful.  way to go!!!"

See the online R demo. Here, a TRE regex is used (with regexec, one can't use PCRE regex unfortunately), so . will match any character including a line break char, thus, the pattern will look like X-FileName:[^\n]+\n\n(.+)$:

  • X-FileName: - a literal string
  • [^\n]+ - 1+ chars other than newline
  • \n\n - 2 newlines
  • (.+) - any 1+ chars (including line break chars), as many as possible, up to
  • $ - the end of string.

A sub option can also be considered:

sub(".*X-FileName:[^\n]+\n\n", "", txt)
[1] "test successful.  way to go!!!"

See this R demo. Here, .* matches any 0+ chars, as many as possible (all the string), then backtracks to find X-FileName: substring, [^\n]+ matches 1+ chars other than a newline, and then \n\n match 2 newlines.

Comparing peformance

Taking into account hwnd's comment, I added a TRE regex based sub option above, and it seems the fastest from all 4 options suggested, with str_match being almost as fast as my above sub code:

library(microbenchmark)

f1 <- function(text) { return(str_match(txt, "X-FileName:.+\n\n(?s)(.+)$")[,2]) }
f2 <- function(text) { return(regmatches(txt, regexec("X-FileName:[^\n]+\n\n(.+)$", txt))[[1]][2]) }
f3 <- function(text) { return(sub('(?s).*X-FileName:[^\n]+\\R+', '', txt, perl=TRUE)) }
f4 <- function(text) { return(sub('.*X-FileName:[^\n]+\n\n', '', txt)) }

> test <- microbenchmark( f1(txt), f2(txt), f3(txt), f4(txt), times = 500000 )
> test
Unit: microseconds
    expr    min     lq     mean median     uq       max neval  cld
 f1(txt) 21.130 24.451 28.08150 27.168 28.677 53796.565 5e+05  b  
 f2(txt) 29.280 32.903 37.46800 35.318 37.431 54556.635 5e+05   c 
 f3(txt) 57.655 59.466 63.36906 60.674 61.881  1651.448 5e+05    d
 f4(txt) 22.036 23.545 25.56820 24.451 25.356  1660.504 5e+05 a   
Sign up to request clarification or add additional context in comments.

3 Comments

Or sub('(?s).*X-FileName:[^\n]+\\R+', '', txt, perl=TRUE) ... codebunk.com/pb/852138641
@hwnd: It turns out a similar TRE regex with sub performs better.
Thanks so much, this works great! It wasn't clear to me that the capture group was returned in the second index of the result and that's what caused my confusion. I think you comment will help a lot of people!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.