4

My question is the Scala (Java) variant of this query on Python.

In particular, I have a string val myStr = "Shall we meet at, let's say, 8:45 AM?". I would like to tokenize it and retain the delimiters (all except whitespace). If my delimiters were only characters, e.g. ., :, ? etc., I could do:

val strArr = myStr.split("((\\s+)|(?=[,.;:?])|(?<=\\b[,.;:?]))")

which yields

[Shall, we, meet, at, ,, let's, say, ,, 8, :, 45, AM, ?]

However, I wish to make the time signature \\d+:\\d+ a delimiter, and would still like to retain it. So, what I'd like is

[Shall, we, meet, at, ,, let's, say, ,, 8:45, AM, ?]

Note:

  1. Adding the disjunct (?=(\\d+:\\d+)) in the expression of the split statement is not helping
  2. outside of the time signature, : is a delimiter in itself

How could I make this happen?

3
  • Have you had a chance to check my approach? Commented Aug 29, 2017 at 10:24
  • Yes, I've been checking both the approaches mentioned so far. I am just trying it out on the more generic examples that I have. Commented Aug 29, 2017 at 10:32
  • Good, just note that my approach matches 1) time substrings as whole words, or 2) any chunks of your delimiter chars, or 3) anything that is not your delimiters and the time strings as whole words. I believe it is comprehensive enough to tokenize strings the way you need. Commented Aug 29, 2017 at 10:45

2 Answers 2

1

I suggest matching all your tokens, not splitting a string, because that way you may control what you get in a better way:

 \b\d{1,2}:\d{2}\b|[,.;:?]+|(?:(?!\b\d{1,2}:\d{2}\b)[^\s,.;:?])+

See the regex demo.

We start matching the most specific patterns and the last one is the most generic one.

Details

  • \b\d{1,2}:\d{2}\b - 1 to 2 digits, :, 2 digits enclosed with word boundaries
  • | - or
  • [,.;:?]+ - 1 or more ,, ., ;, :, ? chars
  • | - or
  • (?:(?!\b\d{1,2}:\d{2}\b)[^\s,.;:?])+ - matches any char that is not our delimiter char or whitespace ([^\s,.;:?]) that is not a starting point for the time string.

Consider this snippet:

val str = "Shall we meet at, let's say, 8:45 AM?"
var rx = """\b\d{1,2}:\d{2}\b|[,.;:?]+|(?:(?!\b\d{1,2}:\d{2}\b)[^\s,.;:?])+""".r
rx findAllIn str foreach println

Output:

Shall
we
meet
at
,
let's
say
,
8:45
AM
?
Sign up to request clarification or add additional context in comments.

2 Comments

Note that the word boundaries around \d{1,2}:\d{2} may be removed if you want to just match a sequence of 1-2 digits, :, 2 digits in any context.
My query was just an example from a larger number of use-cases that I had. And your solution extends to arbitrary patterns. E.g. I can specify pattern strings for email addresses, dates, time stamps, etc. And concatenate them together. Thanks.
0
/**
 * StringPatternTokenizer is simlular to java.util.StringTokenizer
 * But it uses regex string as the tokenizer separator.
 * See inside method #testCase for detail usage.
 */
public class StringPatternTokenizer {
    Pattern pattern;

    public StringPatternTokenizer(String regex) {
        this.pattern = Pattern.compile(regex);
    }

    public void getTokens(String str, NextToken nextToken) {
        Matcher matcher = pattern.matcher(str);

        int index = 0;
        Result result = null;
        while (matcher.find()) {
            if (matcher.start() > index) {
                result = nextToken.visit(null, str.substring(index, matcher.start()));
            }
            if (result != Result.STOP) {
                index = matcher.end();
                result = nextToken.visit(matcher, null);
            }

            if (result == Result.STOP) {
                return;
            }
        }

        if (index < str.length()) {
            nextToken.visit(null, str.substring(index));
        }
    }

    enum Result {
        CONTINUE,
        STOP,
    }

    public interface NextToken {
        Result visit(Matcher matcher, String str);
    }

    /***********************************/
    /***** test cases FOR IT ***********/
    /***********************************/

    public void testCase() {

        // as a test, it tries access tokenizer result for each part,
        // then replace variable parts by given values.
        // And finally, we collect the result target string as  output.

        String strSource = "My name is {{NAME}}, nice to meet you.";
        String strTarget = "My name is TokenTst, nice to meet you.";

        // separator pattern for: variable names in two curly brackets
        String variableRegex = "\\{\\{([A-Za-z]+)\\}\\}";

        // variable values
        org.json.JSONObject data = new org.json.JSONObject(
                java.util.Collections.singletonMap("NAME", "TokenTst")
        );

        StringBuilder sb = new StringBuilder();
        new StringPatternTokenizer(variableRegex)
                .getTokens(strSource, (matcher, str) -> {
                    sb.append(matcher == null ? str
                            : data.optString(matcher.group(1), ""));
                    return StringPatternTokenizer.Result.CONTINUE;
                });

        // check the result as expected
        org.junit.Assert.assertEquals(strTarget, sb.toString());
    }
}

1 Comment

Code-only answers aren't as useful as code-with-explanation. Especially when answering a question this old it is helpful to point out how/why your answer is different from the accepted answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.