1

Basically, this is my input;

"a ~ b c d*e !r x"
"a ~ b c"
"a ~ b c d1 !r y",
"a ~ b c D !r z",
"a~b c d*e!r z"

and would desire this as my result;

"b c d*e"
"b c"
"b c d1"
"b c D"
"b c d*e"

The input represents (mixed) models that are built up of three groups, i.e. the dependent part (~) the fixed part and the random part (!r). I thought with capture groups it would be easy enough (example). The difficulty is the random part which is not always present.

I tried different things as you can see below and of course it possible to do this in two steps. However, I desire a (robust) regex one-liner - I feel that should be possible. I employed these different sources as well for inspiration; non-capturing groups, string replacing and string removal.

library(stringr)
txt <- c("a ~ b c d*e !r x",
         "a ~ b c",
         "a ~ b c d1 !r y",
         "a ~ b c D !r z",
         "a~b c d*e!r z")

# Different tries with capture groups
str_replace(txt, "^.*~ (.*) !r.*$", "\\1")
> [1] "b c d*e"       "a ~ b c"       "b c d1"        "b c D"        
> [5] "a~b c d*e!r z"
str_replace(txt, "^(.*~ )(.*)( !r.*)$", "\\2")
> [1] "b c d*e"       "a ~ b c"       "b c d1"        "b c D"        
> [5] "a~b c d*e!r z"
str_replace(txt, "^(.*~)(.*)(!r.*|\n)$", "\\1\\2")
> [1] "a ~ b c d*e " "a ~ b c"      "a ~ b c d1 "  "a ~ b c D "  
> [5] "a~b c d*e"
str_replace(txt, "^(.*) ~ (.*)!r.*($)", "\\2")
> [1] "b c d*e "      "a ~ b c"       "b c d1 "       "b c D "       
> [5] "a~b c d*e!r z"
str_replace(txt, "^.* ~ (.*)(!r.*|\n)$", "\\1")
> [1] "b c d*e "      "a ~ b c"       "b c d1 "       "b c D "       
> [5] "a~b c d*e!r z"


# Multiple steps
step1 <- str_replace(txt, "^.*~\\s*", "")
step2 <- str_replace(step1, "\\s*!r.*$", "")
step2
> "b c d*e" "b c"     "b c d1"  "b c D"   "b c d*e"

EDIT: After posting I kept playing around and found something that worked for my particular case.

# My (probably non-robust) solution/monstrosity
str_replace(txt, "(^.*~\\s*(.*)\\s*!r.*$|^.*~\\s*(.*)$)", "\\2\\3")
> "b c d*e " "b c"      "b c d1 "  "b c D "   "b c d*e"

3 Answers 3

3

I suggest removing all from the start and up to and incluiding the first tilde (with optional whitespaces) and all starting with the first !r as whole word:

gsub("^[^~]+~\\s*|\\s*!r\\b.*", "", txt)

See the regex demo

Details

  • ^ - start of string
  • [^~]+ - 1+ chars other than ~
  • ~ - a ~ char
  • \\s* - 0+ whitespaces
  • | - or
  • \\s* - 0+ whitespaces
  • !r - !r substring
  • \\b - word boundary
  • .* - the rest of the string.

R demo:

txt <- c("a ~ b c d*e !r x",
         "a ~ b c",
         "a ~ b c d1 !r y",
         "a ~ b c D !r z",
         "a~b c d*e!r z")
gsub("^[^~]+~\\s*|\\s*!r\\b.*", "", txt)
## => [1] "b c d*e" "b c"     "b c d1"  "b c D"   "b c d*e"
Sign up to request clarification or add additional context in comments.

1 Comment

I ended up using this for my final solution. Hence, this was chosen as answer.
3

What about str_extract() using positive lookbehind and lookahead?

str_extract(st, "(?<=~)[^!]+") %>% trimws()
[1] "b c d*e" "b c"     "b c d1"  "b c D"   "b c d*e"

My try to rephrase in English:

We are looking for something that is preceded by a ~ (?<=~), and is a sequence of 1 or more characters that are not ! [^!]+, when we have found something that fits our criteria we stop searching that string (otherwise use str_extract_all()). Finalement, if what we extracted has any spaces at the start of end of string, then remove them trimws().

Data:

st <- c(
  'a ~ b c d*e !r x',
  'a ~ b c',
  'a ~ b c d1 !r y',
  'a ~ b c D !r z',
  'a~b c d*e!r z'
)

EDIT

Few updates already as examples of inputs grow. Will not update again.

6 Comments

Interesting, will play around with this. After posting I came up with my own regex monstrosity that seems to work (also works on more cases); str_replace(st, "(^.*~\\s*(.*)\\s*!r.*$|^.*~\\s*(.*)$)", "\\2\\3"). It's a shame that with the first string there is an extra space at the end. Nothing str_trim can't handle, but still...
you mind if I throw some more cases at your solution that might "break" it?
Sure throw them in - but better that you asked.
My question is oversimplification of the my actual problem. For example, I used a and b but in actual fact this can also be b1 or X. So additional cases like st <- c(st, "a ~ b c d1 !r y", "a ~ b c D ! r z") won't get desired result. EDIT: str_extract(txt, "(?<=~\\s)[a-zA-Z0-9*\\s]+(?=\\b)") improves it already...
@tstev Made another one and a final update. The current one only has a lookbehind which is (?<=~) and means has to be preceded by ~.
|
1

This pattern will let you extract with first capturing group the text you want: ~ ?([\w\*\-\+\/ ]+)(!r)?.

First capturing group: [\w\*\-\+\/ ]+ matches any word character \w or *, +, -, \ and space one or more times (+). It will be terminetaed before second capturing group (if any) (!r)?.

Demo

1 Comment

Thanks for the explanation! However I can't seem to get this to work in R with the stringr package. i.e. it didn't remove the characters before the ~ or after the !r so I edited to: str_replace(txt, ".*~ ?([\\w\\*\\-\\+\\/ ]*)(!r.*)?", "\\1") and this seems to work for my cases. Perhaps you meant to use in a different way?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.