1

I have the following two strings:

x <- "chr1:625000-635000.BB_162.Adipose"
y <- "chr1:625000-635000.BB_162.combined.HMSC-ad"

With this regex I have no problem capturing parts of x

> stringr::str_match(x,"(\\w+):(\\d+)-(\\d+)\\.(\\w+)\\.(\\w+)")
     [,1]                                [,2]   [,3]     [,4]     [,5]     [,6]     
[1,] "chr1:625000-635000.BB_162.Adipose" "chr1" "625000" "635000" "BB_162" "Adipose"

What I want to do is with y to obtain this

     [,1]                                [,2]   [,3]     [,4]     [,5]     [,6]     
[1,] "chr1:625000-635000.BB_162.combined.HMSC-ad"  "chr1" "625000" "635000" "BB_162" "HMSC-ad"

With my current regex and apply for y I get this instead:

   [,1]                                 [,2]   [,3]     [,4]     [,5]     [,6]      
[1,] "chr1:625000-635000.BB_162.combined" "chr1" "625000" "635000" "BB_162" "combined"

How can I generalize my regex so that it can deal with both x and y?

Update

S.Kalbar, your regex gave this:

> stringr::str_match(y,"(\\w+):(\\d+)-(\\d+)\\.(\\w+)\\.(\\w+)(?:\\.([A-Za-z-]+))?")
     [,1]                                         [,2]   [,3]     [,4]     [,5]     [,6]       [,7]     
[1,] "chr1:625000-635000.BB_162.combined.HMSC-ad" "chr1" "625000" "635000" "BB_162" "combined" "HMSC-ad"
> stringr::str_match(x,"(\\w+):(\\d+)-(\\d+)\\.(\\w+)\\.(\\w+)(?:\\.([A-Za-z-]+))?")
     [,1]                                [,2]   [,3]     [,4]     [,5]     [,6]      [,7]
[1,] "chr1:625000-635000.BB_162.Adipose" "chr1" "625000" "635000" "BB_162" "Adipose" NA 

What' I'd like to get is this for y:

                                          [,1]     [,2]   [,3]     [,4]     [,5]     [,6]        
[1,] "chr1:625000-635000.BB_162.combined.HMSC-ad" "chr1" "625000" "635000" "BB_162" "HMSC-ad"

And this for x:

                                   [,1]  [,2]   [,3]     [,4]     [,5]     [,6]      
[1,] "chr1:625000-635000.BB_162.Adipose" "chr1" "625000" "635000" "BB_162" "Adipose" 
1
  • For general regex problems it can help to play with your examples on regex 101. Commented Jan 25, 2018 at 2:30

2 Answers 2

1

Regex: (\w+):(\d+)-(\d+)\.(\w+)(?:\.\w+)?(?:\.([A-Za-z-]+))

RegEx demo

Sign up to request clarification or add additional context in comments.

2 Comments

@S.Kalbar It's seems that the answer is incorrect for x, it get Adipos without ending e. Beside please give example in R code.
@S.Kalbar as pointed in my OP. I look for one regex that can take care for both x and y.
1

You could give the engines some tokens to split on:

(?:(?<=\\d)-(?=\\d))|(?:\\.combined\\.)|[.:]+

Broken down, this says:

(?:(?<=\\d)-(?=\\d))  # a dash between numbers
|                     # or
(?:\\.combined\\.)    # .combined. literally
|                     # or
[.:]+                 # one of . or :


In R using str_split():

library(stringr)

x <- c("chr1:625000-635000.BB_162.Adipose", "chr1:625000-635000.BB_162.combined.HMSC-ad")
str_split(x, '(?:(?<=\\d)-(?=\\d))|(?:\\.combined\\.)|[.:]+', simplify = TRUE)

Which yields

     [,1]   [,2]     [,3]     [,4]     [,5]     
[1,] "chr1" "625000" "635000" "BB_162" "Adipose"
[2,] "chr1" "625000" "635000" "BB_162" "HMSC-ad"

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.