2

i'm using R, and i have a df with a column that has bins that follow one of these two formats:

(x.xx,y.yy] or (x.xx,y.yy)

they are all positive integers with multiple decimals

i want to split them into

lower upper
x.xx  y.yy

i first filter all all NAs out of the bin column (there are a few across multiple dfs):

filter(!is.na(bin))

i'm currently using this regex:

mutate(
lower = as.numeric(sub("^[\\(\\[]([0-9.-]+),", "\\1", bin))  
upper = as.numeric(sub(",([0-9.-]+)[\\)\\]]$", "\\1", bin))
)

but it produces all NAs

I haven't tried many alternatives, any help would be appreciated and thank you in advance

here is a test data example:

> test_bins <- c("[0.15,0.273]", "(0.273,0.397]", "(0.397,0.52]", "[0.52,0.643]")

> lower_values <- sapply(test_bins, function(x) as.numeric(sub("^[\\[\\(]([0-9.]+),", "\\1", x)))
Warning messages:
1: In FUN(X[[i]], ...) : NAs introduced by coercion
2: In FUN(X[[i]], ...) : NAs introduced by coercion
3: In FUN(X[[i]], ...) : NAs introduced by coercion
4: In FUN(X[[i]], ...) : NAs introduced by coercion

> upper_values <- sapply(test_bins, function(x) as.numeric(sub(",([0-9.]+)[\\)\\]]$", "\\1", x)))
Warning messages:
1: In FUN(X[[i]], ...) : NAs introduced by coercion
2: In FUN(X[[i]], ...) : NAs introduced by coercion
3: In FUN(X[[i]], ...) : NAs introduced by coercion
4: In FUN(X[[i]], ...) : NAs introduced by coercion

> data.frame(test_bins, lower_values, upper_values)
                  test_bins lower_values upper_values
[0.15,0.273]   [0.15,0.273]           NA           NA
(0.273,0.397] (0.273,0.397]           NA           NA
(0.397,0.52]   (0.397,0.52]           NA           NA
[0.52,0.643]   [0.52,0.643]           NA           NA
2
  • The first one must be ^[([]([0-9.-]+),.*. The second one must be .*,([0-9.-]+)[])]$. See the R demo online. Commented Dec 13, 2024 at 17:58
  • another one read.csv(text = c('lower,upper', gsub('[][()]', '', test_bins))) Commented Dec 15, 2024 at 5:37

1 Answer 1

5

1) Read the input into a data frame and apply parse_number to each column. No regular expressions are used.

library(dplyr)
library(purrr)
library(readr)

test_bins %>%
  bind_cols(input = ., {
    read.table(text = ., sep = ",", col.names = c("lower", "upper")) %>%
    map_dfr(parse_number)
  })

giving

# A tibble: 4 × 3
  input         lower upper
  <chr>         <dbl> <dbl>
1 [0.15,0.273]  0.15  0.273
2 (0.273,0.397] 0.273 0.397
3 (0.397,0.52]  0.397 0.52 
4 [0.52,0.643]  0.52  0.643

2) or with only base R where the only regular expression used is \D

test_bins |>
  list(x = _) |>
  with(cbind(input = x, x |>
    read.table(text = _, sep = ",", col.names = c("lower", "upper"))  |>
    lapply(\(x) as.numeric(trimws(x, whitespace = "\\D"))) |>
    data.frame()
))

giving

          input lower upper
1  [0.15,0.273] 0.150 0.273
2 (0.273,0.397] 0.273 0.397
3  (0.397,0.52] 0.397 0.520
4  [0.52,0.643] 0.520 0.643
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.