1

I have a quite simple question regarding data.table dt, but as always, this package brings me to the brink of despair :D

I have a column name with these contents: e.g. "bin1_position1_ID1" and I want to split these infos into separate columns:

name                        bins   positions   IDs
--------------------       ------------------------
"bin1_position1_ID1"   ->  "bin1" "position1" "ID1"
"bin2_position2_ID2"       "bin2" "position2" "ID2"

I tried it with

dt <- dt[, bins := lapply(.SD, function(x) strsplit(x, "_")[[1]][1]), .SDcols="name"]
(and for the other new columns with [[1]][2] [[1]][3])

However, I end up having a new column bins (so far, so good), but this has the info from row 1 in every row and not the info from the same row then itself (i.e. bin1 in every row).

And I have some columns that have more infos, that I don't want to make to columns. e.g. one column has "bin5_position5_ID5_another5_more5"

Code for testing (see Maëls solution):

library(data.table)

name <- c("bin1_position1_ID1",
          "bin2_position2_ID2",
          "bin3_position3_ID3",
          "bin4_position4_ID4",
          "bin5_position5_ID5_another5_more5")

dt <- data.table(name)

dt[, c("bin", "position", "ID") := tstrsplit(name, "_", fixed = TRUE, keep = 1:3)]
0

1 Answer 1

4

Use tstrsplit with keep = 1:3 to keep only the first three columns:

dt[, c("bins", "positions", "IDs") := tstrsplit(name, "_", fixed = TRUE, keep = 1:3)]
                                name  bin  position  ID
1:                bin1_position1_ID1 bin1 position1 ID1
2:                bin2_position2_ID2 bin2 position2 ID2
3:                bin3_position3_ID3 bin3 position3 ID3
4:                bin4_position4_ID4 bin4 position4 ID4
5: bin5_position5_ID5_another5_more5 bin5 position5 ID5
Sign up to request clarification or add additional context in comments.

4 Comments

Ah, that would work, but I have some columns that have other infos, that I don't want to make to columns. e.g. one column has "bin5_position5_ID5_another5_more5". This raises an error with tstrsplit().
You should edit your question to make it apparent
If I use keep = 3all my columns have the string from the 3rd position. In this case there is "IDn" in the columns bins, position and ID,
My dataset has multiple GBs, but I made a short script and placed it in the original post that produces the same problem.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.