1

The scala.util.matching.Regex appears to only have a single split() method whose behavior is to extract the match and return only the non-matching segments of the input string:

val str = "Here is some stuff PAT and second token PAT and third token PAT and fourth"
val r = "PAT".r
r.split(str)

res14: Array[String] = Array("Here is some stuff ", " and second token ", "
and third token ", " and fourth")

So is there another approach commonly used to retain the tokens in the returned list?

Note: the splitting patterns I use for actual work are somewhat complicated and certainly not constants like the above example. Therefore, simply inserting alternating constant values (or zipping them) would not suffice.

Update Here is a more representative regex

val str = "Here is some stuff PAT and second token PAT and third token 
           or something else and fourth"
val r = "(PAT|something else)".r
r.split(str)

res14: Array[String] = Array("Here is some stuff ", " and second token ", "
and third token ", " and fourth")
3
  • How complicated is the pattern? If it is not, then a mere val r = "((?<=PAT)|(?=PAT))".r could help. Commented Nov 22, 2015 at 20:22
  • @stribizhev Well I need to put character classes in there. Your comment is already interesting (so pls add as answer) . I am checking if it were actually sufficient to completely satisfy the need. Commented Nov 22, 2015 at 20:31
  • @stribizhev I updated the OP. Your suggestion DOES work - even for the expanded scope. Please create an answer - and maybe add a little explanation with it. Commented Nov 22, 2015 at 20:32

1 Answer 1

3

For a non-complicated pattern that does not involve patterns of indefinite width, you can use a lookbehind/lookahead solution:

val str = "Here is some stuff PAT and second token PAT and third token PAT and fourth"
val r = "((?<=PAT)|(?=PAT))".r
print(r.split(str).toList)

Output of the sample demo: List(Here is some stuff , PAT, and second token , PAT, and third token , PAT, and fourth)

The idea is just to match the empty strings before (?<=PAT) and after (?=PAT) the PAT pattern, and only split there. Unfortunately, there is no such a handy feature as split using a regex with a capturing group and keep the captured text as an element of the resulting array/list.

As an alternative, the matching regex with findAllIn should be used. Or temporary one-character delimiters in front or end of the delimiting pattern can be created to further split against them.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.