2

I want to extract a string between two other strings. One string is a carriage return, whereas the other is a variation of almost similar characters:

dput(head(decisions$Title))
c("Zinaida Shumilina et al. v. Belarus                    \r\n                    
CCPR/C/120/D/2142/2012", 
"K.E.R. vs. Canada                    \r\n                    
CCPR/C/120/D/2196/2012", 
"Lounis Khelifati v Algeria                    \r\n                    
CCPR/C/120/D/2267/2013", 
"Hibaq Said Hash v. Denmark                    \r\n                    
CCPR/C/120/D/2470/2014", 
"Anton Batanov v. Russian Federation                    \r\n                    
CCPR/C/120/D/2532/2015", 
"S. Z. v. Denmark                    \r\n                    
CCPR/C/120/D/2625/2015"
)

I essentially want to extract the country names between "v." and the carriage return \r. However, "v." is sometimes "v", "vs.", "vs" and "v:".

Based on the answer from a related SO question, I tried the following:

res <- str_match(decisions$Title, "(v\\.|vs\\.|v)(.*?)\\r")
res[,3]

Unfortunately, this doesn't get all variations, or in some cases it returns data such as "ruz Tahirovich Nasyrlayev v. Turkmenistan" when trying to extract the country name from "Navruz Tahirovich Nasyrlayev v. Turkmenistan CCPR/C/117/D/2219/2012".

Is there another way to achieve this?

3 Answers 3

6

You may use

trimws(str_match(decisions$Title, "\\bv(?:s?\\.|:)?\\s*(.*)")[,2])

See the regex demo. Note that trimws will remove redundant leading and trailing whitespace chars.

Pattern details

  • \b - a word boundary
  • v - a v char
  • (?:s?\\.|:)? - optionally matches an optional s followed with . or a : char
  • \\s* - 0+ whitespace chars
  • (.*) - Group 1: any 0+ chars other than line break chars (note that you do not have to worry about whether . matches a CR symbol or not (in TRE regex flavor used in sub the . also matches LF symbols) becaue trimws will cut the leading/trailing whitespaces anyway).

Tested in R:

> df<-c("Zinaida Shumilina et al. v. Belarus                    \r\n                    
+ CCPR/C/120/D/2142/2012", 
+ "K.E.R. vs. Canada                    \r\n                    
+ CCPR/C/120/D/2196/2012", 
+ "Lounis Khelifati v Algeria                    \r\n                    
+ CCPR/C/120/D/2267/2013", 
+ "Hibaq Said Hash v. Denmark                    \r\n                    
+ CCPR/C/120/D/2470/2014", 
+ "Anton Batanov v. Russian Federation                    \r\n                    
+ CCPR/C/120/D/2532/2015", 
+ "S. Z. v. Denmark                    \r\n                    
+ CCPR/C/120/D/2625/2015"
+ )

> trimws(str_match(df, "\\bv(?:s?\\.|:)?\\s*(.*)")[,2])
[1] "Belarus"            "Canada"             "Algeria"           
[4] "Denmark"            "Russian Federation" "Denmark"           
> 
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks. This did it. How would I check for instances where the "v" character is also uppercase, "V"?
@RayS. Either make the whole expression case insensitive (e.g. using the inline modifier (?i) - "(?i)\\bv(?:s?\\.|:)?\\s*(.*)") or using a character class: "\\b[vV](?:s?\\.|:)?\\s*(.*)".
4

We can use sub to match characters (.*) until a word boundary (\\b) followed by 'v' followed by s or ., one or more spaces (\\s+) and capture the characters that are not a \r ([^\r]+) and other characters following it. In the replacement, use the backreference of the captured group (\\2) and remove the trailing spaces with trimws

trimws(sub(".*\\bv(s*\\.*)\\s+([^\r]+)\\s*\r.*", "\\2", v1))
#[1] "Belarus"            "Canada"             "Algeria"   
#[4] "Denmark"            "Russian Federation" "Denmark"           

1 Comment

Could you explain what your regex does
0

You can also include a word boundary before "v"

str_match(decisions$Title, "(\\b)(v\\.|vs\\.|v)(.*?)\\r")

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.