Java string tokenization: Split on pattern and retain pattern

Question

My question is the Scala (Java) variant of this query on Python.

In particular, I have a string val myStr = "Shall we meet at, let's say, 8:45 AM?". I would like to tokenize it and retain the delimiters (all except whitespace). If my delimiters were only characters, e.g. ., :, ? etc., I could do:

val strArr = myStr.split("((\\s+)|(?=[,.;:?])|(?<=\\b[,.;:?]))")

which yields

[Shall, we, meet, at, ,, let's, say, ,, 8, :, 45, AM, ?]

However, I wish to make the time signature \\d+:\\d+ a delimiter, and would still like to retain it. So, what I'd like is

[Shall, we, meet, at, ,, let's, say, ,, 8:45, AM, ?]

Note:

Adding the disjunct (?=(\\d+:\\d+)) in the expression of the split statement is not helping
outside of the time signature, : is a delimiter in itself

How could I make this happen?

Yes, I've been checking both the approaches mentioned so far. I am just trying it out on the more generic examples that I have. — N. CHATURV3DI
– N. CHATURV3DI, Commented Aug 29, 2017 at 10:32
Good, just note that my approach matches 1) time substrings as whole words, or 2) any chunks of your delimiter chars, or 3) anything that is not your delimiters and the time strings as whole words. I believe it is comprehensive enough to tokenize strings the way you need. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Aug 29, 2017 at 10:45

Wiktor Stribiżew · Accepted Answer · 2017-08-29 09:26:16Z

1

I suggest matching all your tokens, not splitting a string, because that way you may control what you get in a better way:

 \b\d{1,2}:\d{2}\b|[,.;:?]+|(?:(?!\b\d{1,2}:\d{2}\b)[^\s,.;:?])+

See the regex demo.

We start matching the most specific patterns and the last one is the most generic one.

Details

\b\d{1,2}:\d{2}\b - 1 to 2 digits, :, 2 digits enclosed with word boundaries
| - or
[,.;:?]+ - 1 or more ,, ., ;, :, ? chars
| - or
(?:(?!\b\d{1,2}:\d{2}\b)[^\s,.;:?])+ - matches any char that is not our delimiter char or whitespace ([^\s,.;:?]) that is not a starting point for the time string.

Consider this snippet:

val str = "Shall we meet at, let's say, 8:45 AM?"
var rx = """\b\d{1,2}:\d{2}\b|[,.;:?]+|(?:(?!\b\d{1,2}:\d{2}\b)[^\s,.;:?])+""".r
rx findAllIn str foreach println

Output:

Shall
we
meet
at
,
let's
say
,
8:45
AM
?

answered Aug 29, 2017 at 9:26

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Wiktor Stribiżew Over a year ago

Note that the word boundaries around \d{1,2}:\d{2} may be removed if you want to just match a sequence of 1-2 digits, :, 2 digits in any context.

N. CHATURV3DI Over a year ago

My query was just an example from a larger number of use-cases that I had. And your solution extends to arbitrary patterns. E.g. I can specify pattern strings for email addresses, dates, time stamps, etc. And concatenate them together. Thanks.

HarryQQ · Accepted Answer · 2019-11-07 07:45:42Z

/**
 * StringPatternTokenizer is simlular to java.util.StringTokenizer
 * But it uses regex string as the tokenizer separator.
 * See inside method #testCase for detail usage.
 */
public class StringPatternTokenizer {
    Pattern pattern;

    public StringPatternTokenizer(String regex) {
        this.pattern = Pattern.compile(regex);
    }

    public void getTokens(String str, NextToken nextToken) {
        Matcher matcher = pattern.matcher(str);

        int index = 0;
        Result result = null;
        while (matcher.find()) {
            if (matcher.start() > index) {
                result = nextToken.visit(null, str.substring(index, matcher.start()));
            }
            if (result != Result.STOP) {
                index = matcher.end();
                result = nextToken.visit(matcher, null);
            }

            if (result == Result.STOP) {
                return;
            }
        }

        if (index < str.length()) {
            nextToken.visit(null, str.substring(index));
        }
    }

    enum Result {
        CONTINUE,
        STOP,
    }

    public interface NextToken {
        Result visit(Matcher matcher, String str);
    }

    /***********************************/
    /***** test cases FOR IT ***********/
    /***********************************/

    public void testCase() {

        // as a test, it tries access tokenizer result for each part,
        // then replace variable parts by given values.
        // And finally, we collect the result target string as  output.

        String strSource = "My name is {{NAME}}, nice to meet you.";
        String strTarget = "My name is TokenTst, nice to meet you.";

        // separator pattern for: variable names in two curly brackets
        String variableRegex = "\\{\\{([A-Za-z]+)\\}\\}";

        // variable values
        org.json.JSONObject data = new org.json.JSONObject(
                java.util.Collections.singletonMap("NAME", "TokenTst")
        );

        StringBuilder sb = new StringBuilder();
        new StringPatternTokenizer(variableRegex)
                .getTokens(strSource, (matcher, str) -> {
                    sb.append(matcher == null ? str
                            : data.optString(matcher.group(1), ""));
                    return StringPatternTokenizer.Result.CONTINUE;
                });

        // check the result as expected
        org.junit.Assert.assertEquals(strTarget, sb.toString());
    }
}

Code-only answers aren't as useful as code-with-explanation. Especially when answering a question this old it is helpful to point out how/why your answer is different from the accepted answer.

Collectives™ on Stack Overflow

Java string tokenization: Split on pattern and retain pattern

2 Answers 2

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related