I feel my situation is a typical use case in experiments where the data are logged as text file for human understanding, but not for machine consumption. Tags are interspersed with the actual data to describe the data that follows. For data analysis, the tags need to integrated with the data rows to be useful. Below is a made-up example.
TAG1, t1_1 DATA_A, 5, 3, 4, 8 DATA_A, 3, 4, 5, 7 TAG1, t1_2 TAG2, t2_1 DATA_B, 1, 2, 3, 4, 5 DATA_A, 1, 2, 3, 4
The desired parse results should be two data frames. One for DATA_A,
X1, X2, X3, X4, TAG1, TAG2 5, 3, 4, 8, t1_1, NA 3, 4, 5, 7, t1_1, NA 1, 2, 3, 4, t1_2, t2_1
and one for DATA_B
X1, X2, X3, X4, X5, TAG1, TAG2 1, 2, 3, 4, 5, t1_2, t2_1
The current method (implemented in Python) check the file line by line. If it starts with "T", then the corresponding tag variable is updated; if it starts with "DATA", then the tag values are appended to the end of the "DATA" line, and the now completed line is appended to the corresponding CSV file. In the end, the CSV files are read into data frames for data analysis.
I wonder if this data import can be done faster in one step. What I have in mind is
library(tidyverse)
text_frame <- read_lines(clipboard(), skip_empty_rows = TRUE) %>%
enframe(name = NULL, value = "line")
text_frame %>%
separate(line, into = c("ID", "value"), extra = "merge", sep = ", ")
which produces
# A tibble: 7 x 2
ID value
<chr> <chr>
1 TAG1 t1_1
2 DATA_A 5, 3, 4, 8
3 DATA_A 3, 4, 5, 7
4 TAG1 t1_2
5 TAG2 t2_1
6 DATA_B 1, 2, 3, 4, 5
7 DATA_A 1, 2, 3, 4
The next step is to create new column "TAG1" and "TAG2" with the value added to the row. This is where I got stuck. It is like gather for individual rows. How could I do it? Is the general approach reasonable? Any suggestions?
Fast/memory efficient solutions are welcome since the I need to deal with hundreds of ~10MB text files (they do have the same structure).