How to split/parse long strings into tabular data with R data.table/data.frame?

Question

I have an R data.table with a column of strangely formatted data which I need to parse. For each row, there is a column identity which is in the following format:

identity
cat:211:93|dog:616:58|bird:1270:46|fish:2068:31|horse:614:1|cow:3719:1012

It's the format name:total_number:count_number, separated by |

An example of the data.table is as follows:

library(data.table)

foo = data.table(name = c('Luna', 'Bob', 'Melissa'), 
    number = c(23, 37, 33), 
    identity = c('cat:311:93|dog:516:58|bird:2270:46|fish:1268:31|horse:514:1|cow:319:12', 'bird:1270:35|fish:2068:11|horse:614:44|cow:319:21', 'fish:72:41'))

print(foo)
name        number    identity
'Luna'      23        cat:311:93|dog:516:58|bird:2270:46|fish:1268:31|horse:514:1|cow:319:12
'Bob'       37        bird:1270:35|fish:2068:11|horse:614:44|cow:319:21
'Melissa'   33        fish:72:41

My problem is how to parse these lines such that each name becomes a new column, and the numbers are calculated as a fraction, count_number/total_number.

The correct format is as follows:

name        number    cat        dog         bird        fish        horse       cow
'Luna'      23        0.2990354  0.1124031   0.02026432  0.02444795  0.001945525 0.03761755
'Bob'       37        NA         NA          0.02755906   0.005319149    0.001628664     0.03761755
'Melissa'   33        NA         NA          NA          0.5694444   NA       NA

How could I parse these rows, given I know the 'names' of the columns beforehand?

I think there should be some way to use data.table::tstrsplit(), e.g.

tstrsplit(foo$identity, "|", fixed=TRUE)

(I'm happy to use a data.frame or dplyr as well.)

chinsoon12 · Accepted Answer · 2018-10-23 07:17:02Z

You can probably split by |, melt, then split by : again before calculating ratio and reshaping to your desired format.

library(data.table)
#step 4: reshape into desired wide format
dcast(
    #step 1: split by | and get the elements into a column
    foo[, melt(tstrsplit(identity, "\\|")), by=.(name, number)][,
        #step 2: split by : to get count_number and total_number
        tstrsplit(value, ":"), by=.(name, number)][,
            #step 3: calculate ratio
            ratio := as.numeric(V3) / as.numeric(V2)],
    name + number ~ V1, value.var="ratio")

output:

      name number       bird       cat        cow       dog        fish       horse
1:     Bob     37 0.02755906        NA 0.06583072        NA 0.005319149 0.071661238
2:    Luna     23 0.02026432 0.2990354 0.03761755 0.1124031 0.024447950 0.001945525
3: Melissa     33         NA        NA         NA        NA 0.569444444          NA

Addressing OP's comment in a more general way: You have to design a solution to your problem first before coding. Picture in your mind what kind of output you are expecting in each step of your solution. Then let the console be your TA and documentation be your lecturer.

For e.g. in your first step of your solution, you split by |, so you run the below in the console

foo[, tstrsplit(identity, "|", fixed=TRUE)]

What are your expecting? What do you see? Missing name and number? Add them in by=.

foo[, tstrsplit(identity, "|", fixed=TRUE), by=.(name, number)]

Then, what do you get? Error? Can you fix it? Maybe read the documentation again? If still unable to solve it, maybe search for it online? Remember what you are trying to achieve with this step: How to get it into a single column? Maybe you find something like below:

foo[, unlist(tstrsplit(identity, "|", fixed=TRUE)), by=.(name, number)]

Then, move on to the next step.

Thanks for the help! This was great. So I learn more, could you explain what you are doing each step?

Collectives™ on Stack Overflow

How to split/parse long strings into tabular data with R data.table/data.frame?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related