0

I am trying to figure it out how I can use tstrisplit() function from data.table to split a text by location number. I am aware of the Q1, Q2 & Q3 but these do not address my question.

as an example :

 DT2 <- data.table(a = paste0(LETTERS[1:5],seq(10,15)), b = runif(6))
 DT2
     a         b  
1: A10 0.4153622
2: B11 0.1567381
3: C12 0.5361883
4: D13 0.5920144
5: E14 0.3376648
6: A15 0.5503773

I tried the following which did not work:

DT2[, c("L", "D") := tstrsplit(a, "")][]
DT2[, c("L", "D") := tstrsplit(a, "[A-Z]")][]
DT2[, c("L", "D") := tstrsplit(a, "[0-9]{1}")][]

The expectation:

     a         b    L   D
1: A10 0.4153622    A   10
2: B11 0.1567381    B   11
3: C12 0.5361883    C   12
4: D13 0.5920144    D   13
5: E14 0.3376648    E   14
6: A15 0.5503773    A   15

any help with explanation is highly appreciated.

1
  • Instead of tstrsplit can't you do with c(substr(a, 1, 1), substr(a, 2, 3))? Commented Jul 24, 2017 at 21:00

1 Answer 1

1

You can split on regex "(?<=[A-Za-z])(?=[0-9])" if you want to split between letters and digits, (?<=[A-Za-z])(?=[0-9]) restricts the split to a position that is preceded by a letter and followed by a digit:

The regex contains two parts, look behind (?<=[A-Za-z]) which means after a letter and look ahead (?=[0-9]), i.e before a digit, see more about regex look around, in r, you need to specify perl=TRUE to use Perl-compatible regexps to make these work:

DT2[, c("L", "D") := tstrsplit(a, "(?<=[A-Za-z])(?=[0-9])", perl=TRUE)][]

#     a          b L  D
#1: A10 0.01487372 A 10
#2: B11 0.95035709 B 11
#3: C12 0.49230300 C 12
#4: D13 0.67183871 D 13
#5: E14 0.40076579 E 14
#6: A15 0.27871477 A 15
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the answer, Would you please let me know why you assigned to ? or in another words what dose (?<=[A-Za-z]) means ? I know regex but I do not know why you assign to ?. Furthermore, what dose perl = TRUE mean here as it dose not explained/defined in package?
When ? is the first character in a regex group, it signals there will be extra options for the group. In this case, "(?<=[A-Za-z])` means "preceded by [A-Za-z], but don't include this group in the match." Similarly, (?=[0-9]) means "followed by a digit, but don't include this group in the match."

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.