Parsing a string into a nested data.table

Question

I have data in a table in which one cell in every row is a multiline string, which is formatted a a bit like a document with references at the end of it. For example, one of those strings looks like:

item A...1
item B...2
item C...3
item D...2
1=foo
2=bar
3=baz

My eventual goal is to extract foo/bar/baz into columns and count the matching items. So for the above, I'd end up with a row including:

foo | bar | baz
----+-----+----
1   | 2   | 1

I tried to start by extracting the "reference" mappings, as a nested data.table looking like this:

code | reason
-----+-------
1    | foo
2    | bar
3    | baz

Here's how I tried to do it, using data.table and stringr.

encounter_alerts[, whys := lapply(
  str_extract_all(text, regex('^[0-9].*$', multiline = TRUE)),
  FUN = function (s) { fread(text = s, sep = '=', header = FALSE, col.names = c('code', 'reason')) }
)]

I am very confused by the error message I get when I try to do this:

Error in fread(text = s, sep = "=", header = FALSE, col.names = c("code",  :
  file not found: 1=foo

I am explicitly using text rather than file so I'm not sure how it's trying to interpret the line of text as a filename!

When I test this with a single row, it seems to work fine:

> fread(text = str_extract_all(encounter_alerts[989]$text, regex('^[0-9].*$', multiline = TRUE))[[1]], sep = '=', header = FALSE, col.names = c('code', 'reason'))
   code reason
1:    1    foo
2:    2    bar

What am I doing wrong? Is there a better way to do this?

Thanks!

prosoitos · Accepted Answer · 2018-11-13 17:00:09Z

2

Note: Edited after reading comments

From your comment, I tried to reproduce what I understand your data might look like.

library(tidyverse)

df <- tibble(
  strings = c("item A...1
item B...2
item C...3
item D...2
1=foo
2=bar
3=baz",
"item A...2
item B...2
item C...3
item D...1
1=toto
2=foo
3=lala",
"item A...3
item B...3
item C...3
item D...1
1=tutu
3=ttt")
)

Code:

get_ref <- function(string) {
  string %>%
    str_split("\n") %>%
    unlist() %>% 
    str_subset("=") %>%
    str_split_fixed("=", 2) %>%
    as_tibble() %>%
    rename(code = V1, reason = V2)
}

list1 <- map(df$strings, get_ref)

get_value <- function(string) {
  string %>%
      str_split("\n") %>%
      unlist() %>% 
      str_subset("\\.\\.\\.") %>%
      str_replace_all(".*\\.\\.\\.", "") %>%
      as_tibble() %>%
    rename(code = value)
}

list2 <- map(df$strings, get_value)

get_result <- function(df1, df2) {
  left_join(df1, df2) %>%
    count(reason) %>%
    spread(reason, n)
}

result <- map2_df(list1, list2, get_result)

result[is.na(result)] <- 0

result

Result

# A tibble: 3 x 7
    bar   baz   foo  lala  toto   ttt  tutu
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     2     1     1     0     0     0     0
2     0     0     2     1     1     0     0
3     0     0     0     0     0     3     1

edited Nov 13, 2018 at 17:00

answered Nov 13, 2018 at 7:16

prosoitos

7,6936 gold badges32 silver badges45 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Nicholas Riley Over a year ago

I'm sorry my question was confusing! You are correct in 'note 2'; there is such a multi-line string in every row of my data. I think this approach will work for me but need to correct for some assumptions you (understandably!) made that aren't the case in my real data…

prosoitos Over a year ago

So, is your data in a data frame since you talk about "row"? Sorry I am still a little confused about what your data looks like

prosoitos Over a year ago

Would you mind giving a little more information about its structure so that I could help you adapt my code to match your data structure?

prosoitos Over a year ago

If you have a data frame with one variable consisting of a vector of strings similar to the one you pasted in your question, you could wrap my code in a function and pass it to pmap_df() to apply it to every row and output a data frame with one result per row. I'll be happy to write this up if you give me enough info on your data frame.

prosoitos Over a year ago

I edited my answer to match what I now understand your data might look like

|

Hunaidkhan · Accepted Answer · 2018-11-13 04:05:43Z

0

using stringr and dplyr you can do it easily

library(stringr)
library(dplyr)
v <- as.data.frame(c(  "item A...1",
         "item B...2",
        "item C...3",
         "item D...2"))
colnames(v)<- "items"

matching <- c( "1",
               "2",
               "3")
Mapping <- read.table(text="code     reason
1    foo
                      2  bar
                      3  baz
                      ", header = T)

## Answer 
df1<- v %>%
  mutate(code = str_extract(v$items, str_c(matching, collapse = "|")))
str(df1)
str(Mapping)
df1$code <- as.numeric(df1$code )

df1 <- left_join(df1,Mapping)

please have a look

answered Nov 13, 2018 at 4:05

Hunaidkhan

1,4432 gold badges13 silver badges22 bronze badges

Comments

Eric Burden · Accepted Answer · 2018-11-13 05:33:28Z

0

There's probably a nicer way to do this, but here's a solution that doesn't require any additional libraries (beyond stringr, which you're already using).

sample_str <- 'item A...1
item B...2
item C...3
item D...2
1=foo
2=bar
3=baz'

lines <- stringr::str_split(sample_str, '\n', simplify = T)

extracted_strs <- lines[stringr::str_detect(lines, '^\\d=\\w+$')]

dfs_list <- lapply(extracted_strs, function(x) {
  str_parts <- stringr::str_split(x, '=', simplify = T)
  df_args = list()
  df_args[[str_parts[2]]] = as.integer(str_parts[1])
  df_args[['stringsAsFactors']] = F

  do.call(data.frame, df_args)
})


df <- do.call(cbind, dfs)

answered Nov 13, 2018 at 5:33

Eric Burden

761 silver badge2 bronze badges

1 Comment

Nicholas Riley Over a year ago

Thanks. This creates a data.frame with a variable for foo, bar and bar. Would this be easier to use later on than what I created as my example?

Nicholas Riley · Accepted Answer · 2018-11-15 02:10:11Z

Thanks so much to @prosoitos for helping with this. Here's the final code I ended up using, highly based on the accepted answer — it's a mix of different packages and so forth which I hope to clean up eventually, but deadlines happen...

get_code_reason_mapping <- function(alert_text) {
  alert_text %>%
    str_extract_all(regex('^[0-9]=(.*)$', multiline = T)) %>%
    unlist() %>%
    str_split_fixed("=", 2) %>%
    as.data.table() %>%
    setnames(c('code', 'reason'))
}

encounter_alerts$code_reason_mapping <- map(encounter_alerts$alert_text, get_code_reason_mapping)

get_why_codes <- function(alert_text) {
  alert_text %>%
    str_extract_all(regex('[/n][0-9e][0-9>][0-9]$', multiline = TRUE)) %>%
    unlist() %>%
    str_sub(-1) %>%
    as.data.table() %>%
    setnames(c('code'))
}

encounter_alerts$why_codes <- map(encounter_alerts$alert_text, get_why_codes)

get_code_counts <- function(df1, df2) {
  left_join(df1, df2) %>%
    count(reason) %>%
    spread(reason, n)
}

code_counts <- map2_df(encounter_alerts$code_reason_mapping, encounter_alerts$why_codes, get_code_counts)

code_counts[is.na(code_counts)] <- 0

code_counts

Collectives™ on Stack Overflow

Parsing a string into a nested data.table

4 Answers 4

Code:

Result

8 Comments

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Code:

Result

8 Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related