3

I have data in a table in which one cell in every row is a multiline string, which is formatted a a bit like a document with references at the end of it. For example, one of those strings looks like:

item A...1
item B...2
item C...3
item D...2
1=foo
2=bar
3=baz

My eventual goal is to extract foo/bar/baz into columns and count the matching items. So for the above, I'd end up with a row including:

foo | bar | baz
----+-----+----
1   | 2   | 1

I tried to start by extracting the "reference" mappings, as a nested data.table looking like this:

code | reason
-----+-------
1    | foo
2    | bar
3    | baz

Here's how I tried to do it, using data.table and stringr.

encounter_alerts[, whys := lapply(
  str_extract_all(text, regex('^[0-9].*$', multiline = TRUE)),
  FUN = function (s) { fread(text = s, sep = '=', header = FALSE, col.names = c('code', 'reason')) }
)]

I am very confused by the error message I get when I try to do this:

Error in fread(text = s, sep = "=", header = FALSE, col.names = c("code",  :
  file not found: 1=foo

I am explicitly using text rather than file so I'm not sure how it's trying to interpret the line of text as a filename!

When I test this with a single row, it seems to work fine:

> fread(text = str_extract_all(encounter_alerts[989]$text, regex('^[0-9].*$', multiline = TRUE))[[1]], sep = '=', header = FALSE, col.names = c('code', 'reason'))
   code reason
1:    1    foo
2:    2    bar

What am I doing wrong? Is there a better way to do this?

Thanks!

4 Answers 4

2

Note: Edited after reading comments

From your comment, I tried to reproduce what I understand your data might look like.

library(tidyverse)

df <- tibble(
  strings = c("item A...1
item B...2
item C...3
item D...2
1=foo
2=bar
3=baz",
"item A...2
item B...2
item C...3
item D...1
1=toto
2=foo
3=lala",
"item A...3
item B...3
item C...3
item D...1
1=tutu
3=ttt")
)

Code:

get_ref <- function(string) {
  string %>%
    str_split("\n") %>%
    unlist() %>% 
    str_subset("=") %>%
    str_split_fixed("=", 2) %>%
    as_tibble() %>%
    rename(code = V1, reason = V2)
}

list1 <- map(df$strings, get_ref)

get_value <- function(string) {
  string %>%
      str_split("\n") %>%
      unlist() %>% 
      str_subset("\\.\\.\\.") %>%
      str_replace_all(".*\\.\\.\\.", "") %>%
      as_tibble() %>%
    rename(code = value)
}

list2 <- map(df$strings, get_value)

get_result <- function(df1, df2) {
  left_join(df1, df2) %>%
    count(reason) %>%
    spread(reason, n)
}

result <- map2_df(list1, list2, get_result)

result[is.na(result)] <- 0

result

Result

# A tibble: 3 x 7
    bar   baz   foo  lala  toto   ttt  tutu
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     2     1     1     0     0     0     0
2     0     0     2     1     1     0     0
3     0     0     0     0     0     3     1
Sign up to request clarification or add additional context in comments.

8 Comments

I'm sorry my question was confusing! You are correct in 'note 2'; there is such a multi-line string in every row of my data. I think this approach will work for me but need to correct for some assumptions you (understandably!) made that aren't the case in my real data…
So, is your data in a data frame since you talk about "row"? Sorry I am still a little confused about what your data looks like
Would you mind giving a little more information about its structure so that I could help you adapt my code to match your data structure?
If you have a data frame with one variable consisting of a vector of strings similar to the one you pasted in your question, you could wrap my code in a function and pass it to pmap_df() to apply it to every row and output a data frame with one result per row. I'll be happy to write this up if you give me enough info on your data frame.
I edited my answer to match what I now understand your data might look like
|
0

using stringr and dplyr you can do it easily

library(stringr)
library(dplyr)
v <- as.data.frame(c(  "item A...1",
         "item B...2",
        "item C...3",
         "item D...2"))
colnames(v)<- "items"

matching <- c( "1",
               "2",
               "3")
Mapping <- read.table(text="code     reason
1    foo
                      2  bar
                      3  baz
                      ", header = T)

## Answer 
df1<- v %>%
  mutate(code = str_extract(v$items, str_c(matching, collapse = "|")))
str(df1)
str(Mapping)
df1$code <- as.numeric(df1$code )

df1 <- left_join(df1,Mapping)

please have a look

Comments

0

There's probably a nicer way to do this, but here's a solution that doesn't require any additional libraries (beyond stringr, which you're already using).

sample_str <- 'item A...1
item B...2
item C...3
item D...2
1=foo
2=bar
3=baz'

lines <- stringr::str_split(sample_str, '\n', simplify = T)

extracted_strs <- lines[stringr::str_detect(lines, '^\\d=\\w+$')]

dfs_list <- lapply(extracted_strs, function(x) {
  str_parts <- stringr::str_split(x, '=', simplify = T)
  df_args = list()
  df_args[[str_parts[2]]] = as.integer(str_parts[1])
  df_args[['stringsAsFactors']] = F

  do.call(data.frame, df_args)
})


df <- do.call(cbind, dfs)

1 Comment

Thanks. This creates a data.frame with a variable for foo, bar and bar. Would this be easier to use later on than what I created as my example?
0

Thanks so much to @prosoitos for helping with this. Here's the final code I ended up using, highly based on the accepted answer — it's a mix of different packages and so forth which I hope to clean up eventually, but deadlines happen...

get_code_reason_mapping <- function(alert_text) {
  alert_text %>%
    str_extract_all(regex('^[0-9]=(.*)$', multiline = T)) %>%
    unlist() %>%
    str_split_fixed("=", 2) %>%
    as.data.table() %>%
    setnames(c('code', 'reason'))
}

encounter_alerts$code_reason_mapping <- map(encounter_alerts$alert_text, get_code_reason_mapping)

get_why_codes <- function(alert_text) {
  alert_text %>%
    str_extract_all(regex('[/n][0-9e][0-9>][0-9]$', multiline = TRUE)) %>%
    unlist() %>%
    str_sub(-1) %>%
    as.data.table() %>%
    setnames(c('code'))
}

encounter_alerts$why_codes <- map(encounter_alerts$alert_text, get_why_codes)

get_code_counts <- function(df1, df2) {
  left_join(df1, df2) %>%
    count(reason) %>%
    spread(reason, n)
}

code_counts <- map2_df(encounter_alerts$code_reason_mapping, encounter_alerts$why_codes, get_code_counts)

code_counts[is.na(code_counts)] <- 0

code_counts

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.