Regular expression in r. Grouping & Capturing

Question

I'm trying to use regexp in R cran, using the library stringr. I was studing str_match and str_replace functions. I don't understand why they give different results when I use parentheses for Grouping :

library(stringr)
s<-"(.+?)( PIAZZALE | SS)(.+?)([0-9]{5})"

a<-str_match("MONT SS DPR   PIAZZALE CADORNA, 1A RICCIONE   47838",perl(s))
b<-str_replace("MONT SS DPR   PIAZZALE CADORNA, 1A RICCIONE   47838",perl(s), "\\2")

a[3]
#[1] " PIAZZALE "
b
#[1] " SS"

it works correctly. The seond one replaces the whole string with the string present inside the group index 2. I don't know what's your problem with this. — Avinash Raj
– Avinash Raj, Commented Feb 24, 2015 at 12:24
I think his problem is that the output of str_match[3] and the str_replace here should be equivalent. — JonM
– JonM, Commented Feb 24, 2015 at 12:28

Community · Accepted Answer · 2020-06-20 09:12:55Z

1

Try using just the expression s instead of perl(s):

library(stringr)
s<-"(.+?)( PIAZZALE | SS)(.+?)([0-9]{5})"

a<-str_match("MONT SS DPR   PIAZZALE CADORNA, 1A RICCIONE   47838",s)
b<-str_replace("MONT SS DPR   PIAZZALE CADORNA, 1A RICCIONE   47838",s, "\\2")

a[3]
#[1] " PIAZZALE "
b
#[1] " PIAZZALE "

I've had a look in the documentation for this library: http://cran.r-project.org/web/packages/stringr/stringr.pdf

It suggests that while the str_replace method can accept POSIX patterns by default and also perl patterns if supplied, the str_match can only accept POSIX style patterns and will treat the pattern as such if supplied with a perl pattern. The reason they were supplying different values is that they were using different expression engines. str_detect can use perl expressions and returns either TRUEE or FALSE. could you potentially use the str_detect method instead of the match method?

The difference between POSIX and perl that causes this:

The POSIX engine does not recognise lazy (non-greedy) quantifiers.

Your expression

(.+?)( PIAZZALE | SS)(.+?)([0-9]{5})

would be seen as the perl equivalent of

(.+)( PIAZZALE | SS)(.+)([0-9]{5})

Where the first quantified class .+ would match as much as it can (the full string) before backtracking and evaluating the rest of the expression. It is successful when the first quantified class .+ comes all the way back from the end of the string and consumes the characters MONT SS DPR leaving only SS for the second capture group a[3]

Simplified Explanation of Engine Inner Workings

Here is a simplified explanation of how the different engines are processing your string. All of your quantifiers/alternation are directly wrapped in capture groups so the numbered quantifiers in the following examples are also your capture groups:

Perl:

Quantifier 1: "M"
Quantifier 2: FAILED - MUST BACKTRACK

Quantifier 1: "MO"
Quantifier 2: FAILED - MUST BACKTRACK

Quantifier 1: "MON"
Quantifier 2: FAILED - MUST BACKTRACK

Quantifier 1: "MONT"
Quantifier 2: " SS"
Quantifier 3: " "
Quantifier 4: FAILED - MUST BACKTRACK

Quantifier 1: "MONT"
Quantifier 2: " SS"
Quantifier 3: " D"
Quantifier 4: FAILED - MUST BACKTRACK

...

Quantifier 1: "MONT"
Quantifier 2: " SS"
Quantifier 3: " DPR   PIAZZALE CADORNA, 1A RICCIONE   "
Quantifier 4: "47838"

SUCCESS

POSIX:

Quantifier 1: "MONT SS DPR   PIAZZALE CADORNA, 1A RICCIONE   47838"
Quantifier 2: FAILED - MUST BACKTRACK

Quantifier 1: "MONT SS DPR   PIAZZALE CADORNA, 1A RICCIONE   4783"
Quantifier 2: FAILED - MUST BACKTRACK

Quantifier 1: "MONT SS DPR   PIAZZALE CADORNA, 1A RICCIONE   478"
Quantifier 2: FAILED - MUST BACKTRACK

Quantifier 1: "MONT SS DPR   PIAZZALE CADORNA, 1A RICCIONE   47"
Quantifier 2: FAILED - MUST BACKTRACK

...

Quantifier 1: "MONT SS DPR   P"
Quantifier 2: FAILED - MUST BACKTRACK

Quantifier 1: "MONT SS DPR   "
Quantifier 2: FAILED - MUST BACKTRACK

Quantifier 1: "MONT SS DPR  "
Quantifier 2: " PIZZALE "
Quantifier 3: "CADORNA, 1A RICCIONE   47838"
Quantifier 4: FAILED - MUST BACKTRACK

...

Quantifier 1: "MONT SS DPR  "
Quantifier 2: " PIZZALE "
Quantifier 3: "CADORNA, 1A RICCIONE   "
Quantifier 4: "47838"

SUCCESS

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Feb 24, 2015 at 12:24

JonM

1,37411 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

dax90 Over a year ago

Thank you. But I'd like to understand why it's different

JonM Over a year ago

@dax90 I've had a look in the documentation for this library: cran.r-project.org/web/packages/stringr/stringr.pdf It suggests that while the str_replace method can accept perl patterns the str_match can only accept POSIX style patterns and will treat the pattern as such if supplied with a perl pattern. The reason they were supplying different values is that they were using different expression engines. str_detect can use perl expressions and returns either TRUEE or FALSE. could you potentially use the str_detect method instead of the match method?

dax90 Over a year ago

That's a good answer. But I still don't understand how perl engine is different from POSIX evaluating my s object. Why b<-str_replace("string,s, "\\2") is different from b<-str_replace("string,perl(s), "\\2") ? What changes in the interpretation of s?

dax90 Over a year ago

Thank you so mutch! That's very helpful and complete!

Collectives™ on Stack Overflow

Regular expression in r. Grouping & Capturing

1 Answer 1

The difference between POSIX and perl that causes this:

Simplified Explanation of Engine Inner Workings

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

The difference between POSIX and perl that causes this:

Simplified Explanation of Engine Inner Workings

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related