1

I'm trying to use regexp in R cran, using the library stringr. I was studing str_match and str_replace functions. I don't understand why they give different results when I use parentheses for Grouping :

library(stringr)
s<-"(.+?)( PIAZZALE | SS)(.+?)([0-9]{5})"

a<-str_match("MONT SS DPR   PIAZZALE CADORNA, 1A RICCIONE   47838",perl(s))
b<-str_replace("MONT SS DPR   PIAZZALE CADORNA, 1A RICCIONE   47838",perl(s), "\\2")

a[3]
#[1] " PIAZZALE "
b
#[1] " SS"
3
  • it works correctly. The seond one replaces the whole string with the string present inside the group index 2. I don't know what's your problem with this. Commented Feb 24, 2015 at 12:24
  • I think his problem is that the output of str_match[3] and the str_replace here should be equivalent. Commented Feb 24, 2015 at 12:28
  • Yes. I expect the output to be equal Commented Feb 24, 2015 at 12:55

1 Answer 1

1

Try using just the expression s instead of perl(s):

library(stringr)
s<-"(.+?)( PIAZZALE | SS)(.+?)([0-9]{5})"

a<-str_match("MONT SS DPR   PIAZZALE CADORNA, 1A RICCIONE   47838",s)
b<-str_replace("MONT SS DPR   PIAZZALE CADORNA, 1A RICCIONE   47838",s, "\\2")

a[3]
#[1] " PIAZZALE "
b
#[1] " PIAZZALE "

I've had a look in the documentation for this library: http://cran.r-project.org/web/packages/stringr/stringr.pdf

It suggests that while the str_replace method can accept POSIX patterns by default and also perl patterns if supplied, the str_match can only accept POSIX style patterns and will treat the pattern as such if supplied with a perl pattern. The reason they were supplying different values is that they were using different expression engines. str_detect can use perl expressions and returns either TRUEE or FALSE. could you potentially use the str_detect method instead of the match method?


The difference between POSIX and perl that causes this:

The POSIX engine does not recognise lazy (non-greedy) quantifiers.

Your expression

(.+?)( PIAZZALE | SS)(.+?)([0-9]{5}) 

would be seen as the perl equivalent of

(.+)( PIAZZALE | SS)(.+)([0-9]{5})

Where the first quantified class .+ would match as much as it can (the full string) before backtracking and evaluating the rest of the expression. It is successful when the first quantified class .+ comes all the way back from the end of the string and consumes the characters MONT SS DPR leaving only SS for the second capture group a[3]

Simplified Explanation of Engine Inner Workings

Here is a simplified explanation of how the different engines are processing your string. All of your quantifiers/alternation are directly wrapped in capture groups so the numbered quantifiers in the following examples are also your capture groups:

Perl:

Quantifier 1: "M"
Quantifier 2: FAILED - MUST BACKTRACK

Quantifier 1: "MO"
Quantifier 2: FAILED - MUST BACKTRACK

Quantifier 1: "MON"
Quantifier 2: FAILED - MUST BACKTRACK

Quantifier 1: "MONT"
Quantifier 2: " SS"
Quantifier 3: " "
Quantifier 4: FAILED - MUST BACKTRACK

Quantifier 1: "MONT"
Quantifier 2: " SS"
Quantifier 3: " D"
Quantifier 4: FAILED - MUST BACKTRACK

...

Quantifier 1: "MONT"
Quantifier 2: " SS"
Quantifier 3: " DPR   PIAZZALE CADORNA, 1A RICCIONE   "
Quantifier 4: "47838"

SUCCESS

POSIX:

Quantifier 1: "MONT SS DPR   PIAZZALE CADORNA, 1A RICCIONE   47838"
Quantifier 2: FAILED - MUST BACKTRACK

Quantifier 1: "MONT SS DPR   PIAZZALE CADORNA, 1A RICCIONE   4783"
Quantifier 2: FAILED - MUST BACKTRACK

Quantifier 1: "MONT SS DPR   PIAZZALE CADORNA, 1A RICCIONE   478"
Quantifier 2: FAILED - MUST BACKTRACK

Quantifier 1: "MONT SS DPR   PIAZZALE CADORNA, 1A RICCIONE   47"
Quantifier 2: FAILED - MUST BACKTRACK

...

Quantifier 1: "MONT SS DPR   P"
Quantifier 2: FAILED - MUST BACKTRACK

Quantifier 1: "MONT SS DPR   "
Quantifier 2: FAILED - MUST BACKTRACK

Quantifier 1: "MONT SS DPR  "
Quantifier 2: " PIZZALE "
Quantifier 3: "CADORNA, 1A RICCIONE   47838"
Quantifier 4: FAILED - MUST BACKTRACK

...

Quantifier 1: "MONT SS DPR  "
Quantifier 2: " PIZZALE "
Quantifier 3: "CADORNA, 1A RICCIONE   "
Quantifier 4: "47838"

SUCCESS
Sign up to request clarification or add additional context in comments.

4 Comments

Thank you. But I'd like to understand why it's different
@dax90 I've had a look in the documentation for this library: cran.r-project.org/web/packages/stringr/stringr.pdf It suggests that while the str_replace method can accept perl patterns the str_match can only accept POSIX style patterns and will treat the pattern as such if supplied with a perl pattern. The reason they were supplying different values is that they were using different expression engines. str_detect can use perl expressions and returns either TRUEE or FALSE. could you potentially use the str_detect method instead of the match method?
That's a good answer. But I still don't understand how perl engine is different from POSIX evaluating my s object. Why b<-str_replace("string,s, "\\2") is different from b<-str_replace("string,perl(s), "\\2") ? What changes in the interpretation of s?
Thank you so mutch! That's very helpful and complete!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.