I am doing some text wrangling in R, and for a specific extraction I need to use a capture group. For some reason the base/stringr functions I am familiar with don't seem to support capture groups:
str_extract("abcd123asdc", pattern = "([0-9]{3}).+$")
# Returns: "123asdc"
stri_extract(str = "abcd123asdc", regex = "([0-9]{3}).+$")
# Returns: "123asdc"
grep(x = "abcd123asdc", pattern = "([0-9]{3}).+$", value = TRUE)
# Returns: "abcd123asdc"
The usual googling for "R capture group regex" doesn't give any useful hits for solutions to this problem. Am I missing something, or are capture groups not implemented in R?
EDIT: So after trying to solution suggested in the comments, which works on a small example, it fails for my situation.
Note this is a text from the enron emails dataset, so doesn't contain sensitive information.
txt <- "Message-ID: <24216240.1075855687451.JavaMail.evans@thyme>
Date: Wed, 18 Oct 2000 03:00:00 -0700 (PDT)
From: [email protected]
To: [email protected]
Subject: Re: test
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: Leah Van Arsdall
X-cc:
X-bcc:
X-Folder: \\Phillip_Allen_Dec2000\\Notes Folders\\sent mail
X-Origin: Allen-P
X-FileName: pallen.nsf
test successful. way to go!!!"
sub("X-FileName:.+\n\n([\\W\\w]+)$", "\\1", txt)
# Returns all of "txt", not the capture group
Since we only have a single capture group, shouldn't the "\1" capture it? I tested the regex with an online regex tester and it should be working. Also tried both \n and \n for the newlines. Any ideas?
sub(".*([0-9]{3}.+$)", "\\1", "abcd123asdc")perhapsgregexpr\Pand\Nare unknown escape sequences, the backslash must be doubled.