4

Based on this question: Split text string in a data.table columns, I was wondering whether there is an efficient method to split the text string conditionally based on the contents of the row.

Suppose I have the following table:

Comments                  Eaten
001 Testing my computer   No
0026 Testing my fridge    No
Testing my car            Yes

and I would to have this:

ID   Comments             Eaten
001  Testing my computer  No
0026 Testing my fridge    No
NA   Testing my car       Yes

Where NA is empty.

Is this possible in data.table?

The comment should have an ID, but since this is optional, I only want to extract the ID's if and only if the comment starts with a number.

2
  • So you know that there supposed to be an ID and comments or this is supposed to be detected automatically? Commented May 3, 2017 at 13:01
  • It is optional to have an ID, but if it the comment starts with a number, then that should be automatically an ID. Commented May 3, 2017 at 13:02

1 Answer 1

7

This could be done using tidyr's extract function which allows you to specify a regex pattern:

tidyr::extract(dt, Comments, c("ID", "Comments"), regex = "^(\\d+)?\\s?(.*)$")
#     ID            Comments Eaten
#1:  001 Testing my computer    No
#2: 0026   Testing my fridge    No
#3:   NA      Testing my car   Yes

You can add the argument convert = TRUE if you want the extracted columns to be converted to a more sensible type.


Another option using only base R and data.table would be

dt[grepl("^\\d+", Comments),                     # check if start with ID (subset)
   `:=`(ID = sub("^(\\d+).*", "\\1",Comments),   # extract ID from comments
        Comments = sub("^(\\d+)", "",Comments))  # delete ID from Comments
]

Though in this case the tidyr syntax seems a little easier to me. There may also be a way using data.table's tstrsplit function with a fancy lookaround regex.

Sign up to request clarification or add additional context in comments.

3 Comments

transpose(regmatches(x, regexec("^(\\d+)? ?(.*)", x))) or similar, I guess. Not tested since OP's data is not copy-pastable...
R 3.4.0, also, has a strcapture function that could fit here -- strcapture("^(\\d+)?\\s?(.*)$", dt$Comments, data.frame(ID = "", Comment = ""))
@alexis_laz looks interesting but also a bit strange with those empty strings. I haven't upgraded to 3.4.0 yet

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.