I have an R data.table with a column of strangely formatted data which I need to parse. For each row, there is a column identity which is in the following format:
identity
cat:211:93|dog:616:58|bird:1270:46|fish:2068:31|horse:614:1|cow:3719:1012
It's the format name:total_number:count_number, separated by |
An example of the data.table is as follows:
library(data.table)
foo = data.table(name = c('Luna', 'Bob', 'Melissa'),
number = c(23, 37, 33),
identity = c('cat:311:93|dog:516:58|bird:2270:46|fish:1268:31|horse:514:1|cow:319:12', 'bird:1270:35|fish:2068:11|horse:614:44|cow:319:21', 'fish:72:41'))
print(foo)
name number identity
'Luna' 23 cat:311:93|dog:516:58|bird:2270:46|fish:1268:31|horse:514:1|cow:319:12
'Bob' 37 bird:1270:35|fish:2068:11|horse:614:44|cow:319:21
'Melissa' 33 fish:72:41
My problem is how to parse these lines such that each name becomes a new column, and the numbers are calculated as a fraction, count_number/total_number.
The correct format is as follows:
name number cat dog bird fish horse cow
'Luna' 23 0.2990354 0.1124031 0.02026432 0.02444795 0.001945525 0.03761755
'Bob' 37 NA NA 0.02755906 0.005319149 0.001628664 0.03761755
'Melissa' 33 NA NA NA 0.5694444 NA NA
How could I parse these rows, given I know the 'names' of the columns beforehand?
I think there should be some way to use data.table::tstrsplit(), e.g.
tstrsplit(foo$identity, "|", fixed=TRUE)
(I'm happy to use a data.frame or dplyr as well.)