Regex for (x,y) and (x,y] in R [duplicate]

Question

i'm using R, and i have a df with a column that has bins that follow one of these two formats:

(x.xx,y.yy] or (x.xx,y.yy)

they are all positive integers with multiple decimals

i want to split them into

lower upper
x.xx  y.yy

i first filter all all NAs out of the bin column (there are a few across multiple dfs):

filter(!is.na(bin))

i'm currently using this regex:

mutate(
lower = as.numeric(sub("^[\\(\\[]([0-9.-]+),", "\\1", bin))  
upper = as.numeric(sub(",([0-9.-]+)[\\)\\]]$", "\\1", bin))
)

but it produces all NAs

I haven't tried many alternatives, any help would be appreciated and thank you in advance

here is a test data example:

> test_bins <- c("[0.15,0.273]", "(0.273,0.397]", "(0.397,0.52]", "[0.52,0.643]")

> lower_values <- sapply(test_bins, function(x) as.numeric(sub("^[\\[\\(]([0-9.]+),", "\\1", x)))
Warning messages:
1: In FUN(X[[i]], ...) : NAs introduced by coercion
2: In FUN(X[[i]], ...) : NAs introduced by coercion
3: In FUN(X[[i]], ...) : NAs introduced by coercion
4: In FUN(X[[i]], ...) : NAs introduced by coercion

> upper_values <- sapply(test_bins, function(x) as.numeric(sub(",([0-9.]+)[\\)\\]]$", "\\1", x)))
Warning messages:
1: In FUN(X[[i]], ...) : NAs introduced by coercion
2: In FUN(X[[i]], ...) : NAs introduced by coercion
3: In FUN(X[[i]], ...) : NAs introduced by coercion
4: In FUN(X[[i]], ...) : NAs introduced by coercion

> data.frame(test_bins, lower_values, upper_values)
                  test_bins lower_values upper_values
[0.15,0.273]   [0.15,0.273]           NA           NA
(0.273,0.397] (0.273,0.397]           NA           NA
(0.397,0.52]   (0.397,0.52]           NA           NA
[0.52,0.643]   [0.52,0.643]           NA           NA

The first one must be ^[([]([0-9.-]+),.*. The second one must be .*,([0-9.-]+)[])]$. See the R demo online. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Dec 13, 2024 at 17:58
another one read.csv(text = c('lower,upper', gsub('[][()]', '', test_bins))) — rawr
– rawr, Commented Dec 15, 2024 at 5:37

G. Grothendieck · Accepted Answer · 2024-12-13 21:05:41Z

5

1) Read the input into a data frame and apply parse_number to each column. No regular expressions are used.

library(dplyr)
library(purrr)
library(readr)

test_bins %>%
  bind_cols(input = ., {
    read.table(text = ., sep = ",", col.names = c("lower", "upper")) %>%
    map_dfr(parse_number)
  })

giving

# A tibble: 4 × 3
  input         lower upper
  <chr>         <dbl> <dbl>
1 [0.15,0.273]  0.15  0.273
2 (0.273,0.397] 0.273 0.397
3 (0.397,0.52]  0.397 0.52 
4 [0.52,0.643]  0.52  0.643

2) or with only base R where the only regular expression used is \D

test_bins |>
  list(x = _) |>
  with(cbind(input = x, x |>
    read.table(text = _, sep = ",", col.names = c("lower", "upper"))  |>
    lapply(\(x) as.numeric(trimws(x, whitespace = "\\D"))) |>
    data.frame()
))

giving

          input lower upper
1  [0.15,0.273] 0.150 0.273
2 (0.273,0.397] 0.273 0.397
3  (0.397,0.52] 0.397 0.520
4  [0.52,0.643] 0.520 0.643

edited Dec 13, 2024 at 21:05

answered Dec 13, 2024 at 20:46

G. Grothendieck

273k18 gold badges221 silver badges365 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Regex for (x,y) and (x,y] in R [duplicate]

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related