3

From the string

s <- "|tree| Lorem ipsum dolor sit amet, |house| consectetur adipiscing elit, 
|street| sed do eiusmod tempor incididunt ut labore et |car| dolore magna aliqua."

I want to extract the text after the letters within the |-symbols.

My approach:

words <- list("tree","house","street","car")

for(word in words){
   expression <- paste0("^.*\\|",word,"\\|\\s*(.+?)\\s*\\|.*$")
   print(sub(expression, "\\1", s))
}

This works fine for all but the last wortd car. It instead returns the entire string s. How can I modify the regex such that for the last element of words-list in prints out dolore magna aliqua..

\Edit: Previously the list with expressions was a,b,c,d. Solutions to this specific problem cannot be generalized very well.

3
  • 1
    For getting the regex right, I'd recommend taking a look at regex.inginf.units.it if you're not very comfortable with it Commented Aug 17, 2020 at 11:50
  • 2
    I always find using sub in these cases confusing, since you have to specify what you DON'T want to keep instead of (the more natural) what you DO want to keep. I'd advise using stringi::stri_extract_all, for example: stringi::stri_extract_all(regex = "(?<=\\|[abcd]\\| )([^\\|]+)", s). This uses a lookbehind to match the |a|, |b|, |c| and |d| without capturing it. Commented Aug 17, 2020 at 11:54
  • Thanks, suppose the expressions I am looking for are not a,b,c,d but instead tree,house,street,car. How would I do it? Commented Aug 17, 2020 at 12:06

3 Answers 3

2

Try this:

library(stringi)

s <- '|a| Lorem ipsum dolor sit amet, |b| consectetur adipiscing elit, 
|c| sed do eiusmod tempor incididunt ut labore et |d| dolore magna aliqua.'

stri_split_regex(s, '\\|[:alpha:]\\|')

[[1]]
[1] ""                                                " Lorem ipsum dolor sit amet, "                  
[3] " consectetur adipiscing elit, \n"                " sed do eiusmod tempor incididunt ut labore et "
[5] " dolore magna aliqua."     
Sign up to request clarification or add additional context in comments.

1 Comment

or stringr::str_split( s, pattern = "\\|[a-z]\\| ")
2

You can try this pattern

library(stringr)
s <- "|tree| Lorem ipsum dolor sit amet, |house| consectetur adipiscing elit, 
|street| sed do eiusmod tempor incididunt ut labore et |car| dolore magna aliqua."

str_extract_all(s, regex("(?<=\\|)\\w+(?=\\|)"))
#[1] "tree"   "house"  "street" "car" 
  • (?<=\\|): Look behind, position following by |; \\|: is an escape for |
  • \\w: word characters
  • (?=\\|): Lookahead, position followed by |

Comments

1

I suggest extracting all the words with corresponding values using stringr::str_match_all:

s <- "|tree| Lorem ipsum dolor sit amet, |house| consectetur adipiscing elit, 
|street| sed do eiusmod tempor incididunt ut labore et |car| dolore magna aliqua."
words1 <- list("tree","house","street","car")
library(stringr)
expression <- paste0("\\|(", paste(words1, collapse="|"),")\\|\\s*([^|]*)")
result <- str_match_all(s, expression)
lapply(result, function(x) x[,-1])

See the R demo

Output:

[[1]]
     [,1]     [,2]                                            
[1,] "tree"   "Lorem ipsum dolor sit amet, "                  
[2,] "house"  "consectetur adipiscing elit, \n"               
[3,] "street" "sed do eiusmod tempor incididunt ut labore et "
[4,] "car"    "dolore magna aliqua."    

The regex is

\|(tree|house|street|car)\|\s*([^|]*)

See the regex demo, details:

  • \| - a | char
  • (tree|house|street|car) - Group 1: one of the words
  • \| - a | char
  • \s* - 0 or more whitespace chars
  • ([^|]*) - Group 2: any 0 or more chars other than |.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.