I have strings like this: "X96HE6.10nMBI_1_2", "X96HE6.10nMBI_2_2", "X96HE6.10nMBI_3_2" and I would like to match only numbers 1, 2 and 3 in between underscores but without them(underscores). The best solution I could come up with is this str_match(sample_names, "_+[1-3]?") I would really appreciate the help.
4 Answers
The simplest method is by using suband backreference:
Data:
d <- c("X96HE6.10nMBI_1_2", "X96HE6.10nMBI_2_2", "X96HE6.10nMBI_3_2")
Solution:
sub(".*_(\\d)_.*", "\\1", d)
Here, (\\d) defines the capturing group for a single number (if the number in question can be more than one digit, use \\d+) that is 'recalled' by the backreference \\1in subs replacement argument
Alternatively use str_extract and positive lookaround:
library(stringr)
str_extract(d, "(?<=_)\\d(?=_)")
(?<=_) is positive lookbehind which can be glossed as "If you see _ on the left..."
\\d is the number to be matched
(?=_) is positive lookahead, which can be glossed as "If you see _ on the right..."
Result:
[1] "1" "2" "3"
Comments
You can use Look Arounds, I personally rely heavily on the stringr Cheatsheets for these kind of regex, the syntax is a bit hard to remember, here is the rstudio page for Cheatsheets look for stringr ->LOOK AROUNDS
library(tidyverse)
codes <- c("X96HE6.10nMBI_1_2", "X96HE6.10nMBI_2_2", "X96HE6.10nMBI_3_2")
codes %>%
str_extract("(?<=_)[:digit:]+(?=_)")
#> [1] "1" "2" "3"
Created on 2020-06-14 by the reprex package (v0.3.0)
1 Comment
No need for any third-party module:
strings <- c("X96HE6.10nMBI_1_2", "X96HE6.10nMBI_2_2", "X96HE6.10nMBI_3_2")
pattern <- "(?<=_)(\\d+)(?=_)"
unlist(regmatches(strings, gregexpr(pattern, strings, perl = TRUE)))
Which yields:
[1] "1" "2" "3"
2 Comments
(?!$)(?=_) = (?=_) because _ is not the end of the string.
str_match(sample_names,"(?<=_)\\d+(?=_)")'1','2'or'3', and only when they are surrounded by underscores, or match any single digit surrounded by underscores or match any string of digits surrounded by underscores? Please edit to clarify.