6

I try to use stringr package to extract part of a string, which is between two particular patterns.

For example, I have:

my.string <- "nanaqwertybaba"
left.border  <- "nana"
right.border <- "baba"

and by the use of str_extract(string, pattern) function (where pattern is defined by a POSIX regular expression) I would like to receive:

"qwerty"

Solutions from Google did not work.

4 Answers 4

14

In base R you can use gsub. The parentheses in the pattern create numbered capturing groups. Here we select the second group in the replacement, i.e. the group between the borders. The . matches any character. The * means that there is zero or more of the preceeding element

gsub(pattern = "(.*nana)(.*)(baba.*)",
     replacement = "\\2",
     x = "xxxnanaRisnicebabayyy")
# "Risnice"
Sign up to request clarification or add additional context in comments.

2 Comments

Well, the point is I do not know that "qwerty" does sit here, do there is no way I use it in the regex pattern!
@Marciszka: you can replace "qwerty" in this example by an regular expression as well, e.g. gsub(pattern = "(.*nana)([[:alpha:]]+)(baba.*)", "\\2", x=my.string) for at least one letter.
9

I do not know whether and how this is possible with functions provided by stringr but you can also use base regexpr and substring:

pattern <- paste0("(?<=", left.border, ")[a-z]+(?=", right.border, ")")
# "(?<=nana)[a-z]+(?=baba)"

rx <- regexpr(pattern, text=my.string, perl=TRUE)
# [1] 5
# attr(,"match.length")
# [1] 6

substring(my.string, rx, rx+attr(rx, "match.length")-1)
# [1] "qwerty"

1 Comment

Thank you, sigbb! I have just adjusted it a little bit, so as to: 1. match all characters between left.border and right.border, 2. match up to first occurence of right.border and now I have: rx <- regexpr(paste0("(?<=", left.border, ")(.*?)+(?=", right.border, ")"), text = my.string, perl = TRUE). Big thank you to you!
7

I would use str_match from stringr: "str_match extracts capture groups formed by () from the first match. It returns a character matrix with one column for the complete match and one column for each group." ref

str_match(my.string, paste(left.border, '(.+)', right.border, sep=''))[,2]

The code above creates a regular expression with paste concatenating the capture group (.+) that captures 1 or more characters, with left and right borders (no spaces between strings).

A single match is assumed. So, [,2] selects the second column from the matrix returned by str_match.

Comments

0

You can use the package unglue:

library(unglue)
my.string <- "nanaqwertybaba"
unglue_vec(my.string, "nana{res}baba")
#> [1] "qwerty"

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.