2

Suppose I have the following strings:

string <- c(
  "DATE_OF_BIRTH_B1",
  "HEIGHT_BABY2",
  "WEIGHT_BABY_3",
  "OTHER_CONDITION_4",
  "OTHER_OPERATION_5"
)

How can I use regex in gsub() to extract:

  • Everything except the trailing underscore up until the number suffixes in the first three strings;
  • Nothing from the last two strings.

In other words, my expected gsub() output is:

"DATE_OF_BIRTH_B", "HEIGHT_BABY", "WEIGHT_BABY"

I managed to use gsub("(.+_B[A-Z]*)_?[0-9]", "\\1", string) to extract the desired substrings from the first three strings, but it failed to excluded the last two strings.

Could anyone help to correct and improve my regex, with a bit of explanation? Many thanks!

2
  • 2
    Do the strings you want to exclude have some kind of communality? Otherwise I don't see how you could exclude them generally. In your example you could e.g. just filter for OTHER. But something similar to that would need to be present Commented Oct 21, 2020 at 13:59
  • 1
    It is done using alternation with .+: sub("(.+_B[A-Z]*)_?[0-9]|.+", "\\1", string) Commented Oct 21, 2020 at 17:40

2 Answers 2

3

Remove OTHER or the suffix.

gsub("^OTHER.*|_?[0-9]+$", "", string)
#> [1] "DATE_OF_BIRTH_B"
#> [2] "HEIGHT_BABY"    
#> [3] "WEIGHT_BABY"    
#> [4] ""               
#> [5] ""  

Or, if you specifically want capture groups, use a non-greedy capture.

gsub("(OTHER.*)?(.*?)_?[0-9]", "\\2", string)
#> [1] "DATE_OF_BIRTH_B"
#> [2] "HEIGHT_BABY"    
#> [3] "WEIGHT_BABY"    
#> [4] ""               
#> [5] "" 
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks very much for your solutions! Filtering for "OTHER" is a good tip, but @Wiktor Stribiżew's solution may be more universal?
1

If you expect gsub (or sub, usually, in this case, you really should use a sub since you only expect a single replacement operation) to return a result of the replacement or an empty string, you need to follow this technique:

sub("...(<what_you_want_to_extract>)...|.+", "\\1", x)

That is, your regex is before | alternation operator that is followed with .+ that matches any one or more chars as many as possible.

So, in your case, assuming your regex is just what you need and meets all your requirements, you can use

> res <- sub("(.+_B[A-Z]*)_?[0-9]|.+", "\\1", string)
> res
[1] "DATE_OF_BIRTH_B" "HEIGHT_BABY"     "WEIGHT_BABY"     ""                ""      

If you need to remove empty items, just use

> res[nzchar(res)]
[1] "DATE_OF_BIRTH_B" "HEIGHT_BABY"     "WEIGHT_BABY"

1 Comment

Excellent! Thanks very much for the tips on not to capture anything in alternative strings!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.