Extract strings between only one known string in R

Question

I want to extract a string between two other strings. One string is a carriage return, whereas the other is a variation of almost similar characters:

dput(head(decisions$Title))
c("Zinaida Shumilina et al. v. Belarus                    \r\n                    
CCPR/C/120/D/2142/2012", 
"K.E.R. vs. Canada                    \r\n                    
CCPR/C/120/D/2196/2012", 
"Lounis Khelifati v Algeria                    \r\n                    
CCPR/C/120/D/2267/2013", 
"Hibaq Said Hash v. Denmark                    \r\n                    
CCPR/C/120/D/2470/2014", 
"Anton Batanov v. Russian Federation                    \r\n                    
CCPR/C/120/D/2532/2015", 
"S. Z. v. Denmark                    \r\n                    
CCPR/C/120/D/2625/2015"
)

I essentially want to extract the country names between "v." and the carriage return \r. However, "v." is sometimes "v", "vs.", "vs" and "v:".

Based on the answer from a related SO question, I tried the following:

res <- str_match(decisions$Title, "(v\\.|vs\\.|v)(.*?)\\r")
res[,3]

Unfortunately, this doesn't get all variations, or in some cases it returns data such as "ruz Tahirovich Nasyrlayev v. Turkmenistan" when trying to extract the country name from "Navruz Tahirovich Nasyrlayev v. Turkmenistan CCPR/C/117/D/2219/2012".

Is there another way to achieve this?

Wiktor Stribiżew · Accepted Answer · 2018-01-09 17:01:32Z

6

You may use

trimws(str_match(decisions$Title, "\\bv(?:s?\\.|:)?\\s*(.*)")[,2])

See the regex demo. Note that trimws will remove redundant leading and trailing whitespace chars.

Pattern details

\b - a word boundary
v - a v char
(?:s?\\.|:)? - optionally matches an optional s followed with . or a : char
\\s* - 0+ whitespace chars
(.*) - Group 1: any 0+ chars other than line break chars (note that you do not have to worry about whether . matches a CR symbol or not (in TRE regex flavor used in sub the . also matches LF symbols) becaue trimws will cut the leading/trailing whitespaces anyway).

Tested in R:

> df<-c("Zinaida Shumilina et al. v. Belarus                    \r\n                    
+ CCPR/C/120/D/2142/2012", 
+ "K.E.R. vs. Canada                    \r\n                    
+ CCPR/C/120/D/2196/2012", 
+ "Lounis Khelifati v Algeria                    \r\n                    
+ CCPR/C/120/D/2267/2013", 
+ "Hibaq Said Hash v. Denmark                    \r\n                    
+ CCPR/C/120/D/2470/2014", 
+ "Anton Batanov v. Russian Federation                    \r\n                    
+ CCPR/C/120/D/2532/2015", 
+ "S. Z. v. Denmark                    \r\n                    
+ CCPR/C/120/D/2625/2015"
+ )

> trimws(str_match(df, "\\bv(?:s?\\.|:)?\\s*(.*)")[,2])
[1] "Belarus"            "Canada"             "Algeria"           
[4] "Denmark"            "Russian Federation" "Denmark"           
>

edited Jan 9, 2018 at 17:01

answered Jan 9, 2018 at 16:55

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

mundos Over a year ago

Thanks. This did it. How would I check for instances where the "v" character is also uppercase, "V"?

Wiktor Stribiżew Over a year ago

@RayS. Either make the whole expression case insensitive (e.g. using the inline modifier (?i) - "(?i)\\bv(?:s?\\.|:)?\\s*(.*)") or using a character class: "\\b[vV](?:s?\\.|:)?\\s*(.*)".

akrun · Accepted Answer · 2018-01-09 16:56:36Z

4

We can use sub to match characters (.*) until a word boundary (\\b) followed by 'v' followed by s or ., one or more spaces (\\s+) and capture the characters that are not a \r ([^\r]+) and other characters following it. In the replacement, use the backreference of the captured group (\\2) and remove the trailing spaces with trimws

trimws(sub(".*\\bv(s*\\.*)\\s+([^\r]+)\\s*\r.*", "\\2", v1))
#[1] "Belarus"            "Canada"             "Algeria"   
#[4] "Denmark"            "Russian Federation" "Denmark"

edited Jan 9, 2018 at 16:56

answered Jan 9, 2018 at 16:47

akrun

891k38 gold badges590 silver badges700 bronze badges

1 Comment

Scipione Sarlo Over a year ago

Could you explain what your regex does

Esteban PS · Accepted Answer · 2018-01-09 16:55:59Z

0

You can also include a word boundary before "v"

str_match(decisions$Title, "(\\b)(v\\.|vs\\.|v)(.*?)\\r")

answered Jan 9, 2018 at 16:55

Esteban PS

9991 gold badge8 silver badges12 bronze badges

Collectives™ on Stack Overflow

Extract strings between only one known string in R

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related