read rectangular data blocks with separate tags as new columns

Question

I feel my situation is a typical use case in experiments where the data are logged as text file for human understanding, but not for machine consumption. Tags are interspersed with the actual data to describe the data that follows. For data analysis, the tags need to integrated with the data rows to be useful. Below is a made-up example.

TAG1, t1_1

DATA_A, 5, 3, 4, 8
DATA_A, 3, 4, 5, 7

TAG1, t1_2
TAG2, t2_1

DATA_B, 1, 2, 3, 4, 5

DATA_A, 1, 2, 3, 4

The desired parse results should be two data frames. One for DATA_A,

X1, X2, X3, X4, TAG1, TAG2
5, 3, 4, 8, t1_1, NA
3, 4, 5, 7, t1_1, NA
1, 2, 3, 4, t1_2, t2_1

and one for DATA_B

X1, X2, X3, X4, X5, TAG1, TAG2
1, 2, 3, 4, 5, t1_2, t2_1

The current method (implemented in Python) check the file line by line. If it starts with "T", then the corresponding tag variable is updated; if it starts with "DATA", then the tag values are appended to the end of the "DATA" line, and the now completed line is appended to the corresponding CSV file. In the end, the CSV files are read into data frames for data analysis.

I wonder if this data import can be done faster in one step. What I have in mind is


library(tidyverse)

text_frame <- read_lines(clipboard(), skip_empty_rows = TRUE) %>% 
  enframe(name = NULL, value = "line") 

text_frame %>% 
  separate(line, into = c("ID", "value"), extra = "merge", sep = ", ")

which produces

# A tibble: 7 x 2
  ID     value        
  <chr>  <chr>        
1 TAG1   t1_1         
2 DATA_A 5, 3, 4, 8   
3 DATA_A 3, 4, 5, 7   
4 TAG1   t1_2         
5 TAG2   t2_1         
6 DATA_B 1, 2, 3, 4, 5
7 DATA_A 1, 2, 3, 4

The next step is to create new column "TAG1" and "TAG2" with the value added to the row. This is where I got stuck. It is like gather for individual rows. How could I do it? Is the general approach reasonable? Any suggestions?

Fast/memory efficient solutions are welcome since the I need to deal with hundreds of ~10MB text files (they do have the same structure).

IceCreamToucan · Accepted Answer · 2019-05-30 18:33:11Z

1

Using the input data

text <- '
TAG1, t1_1

DATA_A, 5, 3, 4, 8
DATA_A, 3, 4, 5, 7

TAG1, t1_2
TAG2, t2_1

DATA_B, 1, 2, 3, 4, 5

DATA_A, 1, 2, 3, 4
'

You can get the tags from the second column of the imported data V2 by selecting the elements of V2 where the first column V1 is TAG[1|2], and do this for each group. Groups are identified by a variable starting at 0 and incrementing by 1 after each occurrence of [V1 contains TAG then V1 doesn't contain TAG].

Then with tags as their own columns you can remove the TAG rows, and split the data according to whether the first column contains 'B'

library(data.table)

df <- fread(text, fill = T, blank.lines.skip = T)

df[, `:=`(TAG1 = V2[V1 == 'TAG1'],
          TAG2 = V2[V1 == 'TAG2']),
   by = .(g = (rleid(grepl('TAG', V1)) - 1) %/% 2)]

df <- df[-grep('TAG', V1)] 

split(df, df[, grepl('B', V1)])

# $`FALSE`
#        V1 V2 V3 V4 V5 V6 TAG1 TAG2
# 1: DATA_A  5  3  4  8 NA t1_1 <NA>
# 2: DATA_A  3  4  5  7 NA t1_1 <NA>
# 3: DATA_A  1  2  3  4 NA t1_2 t2_1
# 
# $`TRUE`
#        V1 V2 V3 V4 V5 V6 TAG1 TAG2
# 1: DATA_B  1  2  3  4  5 t1_2 t2_1

If you don't always 2 tags and may have more or less, you can replace the step after fread above with

n_tags <- df[, as.numeric(gsub('[^0-9]', '', max(grep('TAG', V1, value = T))))]
df[, g := (rleid(grepl('TAG', V1)) - 1) %/% 2]
for(i in seq_len(n_tags))
  df[, paste0('TAG', i) := V2[V1 == paste0('TAG', i)], g]

edited May 30, 2019 at 18:33

answered May 30, 2019 at 18:09

IceCreamToucan

28.8k2 gold badges27 silver badges48 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Dong Over a year ago

Nice use of fill. But using fread for the first step is not ideal since actual DATA have different colClasses for the columns. Then the columns will be read as characters and then parsed to the right colClasses later after split. My quick test shows reading the columns as characters are much slower.

Collectives™ on Stack Overflow

read rectangular data blocks with separate tags as new columns

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related