Conditional string split in data.table in R

Question

Based on this question: Split text string in a data.table columns, I was wondering whether there is an efficient method to split the text string conditionally based on the contents of the row.

Suppose I have the following table:

Comments                  Eaten
001 Testing my computer   No
0026 Testing my fridge    No
Testing my car            Yes

and I would to have this:

ID   Comments             Eaten
001  Testing my computer  No
0026 Testing my fridge    No
NA   Testing my car       Yes

Where NA is empty.

Is this possible in data.table?

The comment should have an ID, but since this is optional, I only want to extract the ID's if and only if the comment starts with a number.

So you know that there supposed to be an ID and comments or this is supposed to be detected automatically? — David Arenburg
– David Arenburg, Commented May 3, 2017 at 13:01
It is optional to have an ID, but if it the comment starts with a number, then that should be automatically an ID. — Snowflake
– Snowflake, Commented May 3, 2017 at 13:02

talat · Accepted Answer · 2017-05-03 13:34:33Z

7

This could be done using tidyr's extract function which allows you to specify a regex pattern:

tidyr::extract(dt, Comments, c("ID", "Comments"), regex = "^(\\d+)?\\s?(.*)$")
#     ID            Comments Eaten
#1:  001 Testing my computer    No
#2: 0026   Testing my fridge    No
#3:   NA      Testing my car   Yes

You can add the argument convert = TRUE if you want the extracted columns to be converted to a more sensible type.

Another option using only base R and data.table would be

dt[grepl("^\\d+", Comments),                     # check if start with ID (subset)
   `:=`(ID = sub("^(\\d+).*", "\\1",Comments),   # extract ID from comments
        Comments = sub("^(\\d+)", "",Comments))  # delete ID from Comments
]

Though in this case the tidyr syntax seems a little easier to me. There may also be a way using data.table's tstrsplit function with a fancy lookaround regex.

edited May 3, 2017 at 13:34

answered May 3, 2017 at 13:03

talat

70.5k22 gold badges130 silver badges158 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Frank Over a year ago

transpose(regmatches(x, regexec("^(\\d+)? ?(.*)", x))) or similar, I guess. Not tested since OP's data is not copy-pastable...

alexis_laz Over a year ago

R 3.4.0, also, has a strcapture function that could fit here -- strcapture("^(\\d+)?\\s?(.*)$", dt$Comments, data.frame(ID = "", Comment = ""))

talat Over a year ago

@alexis_laz looks interesting but also a bit strange with those empty strings. I haven't upgraded to 3.4.0 yet

Collectives™ on Stack Overflow

Conditional string split in data.table in R

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related