2

How can i split this text below with split-cretiria: FIRST, NOW, THEN:

String text = "FIRST i go to the homepage NOW i click on button \"NOW CLICK\" very quick THEN i will become a text result.";

Expected are three sentences:

  1. FIRST i go to the homepage
  2. NOW i click on button "NOW CLICK" very quick
  3. THEN i will become a text result.

This code doesn't work, because of button "NOW CLICK"

String[] textArray = text.split("FIRST|NOW|THEN");
1
  • Aside from dealing with the quoted text, String.split() removes the matched text that is the split delimiter, so you'll never get FIRST i go to the homepage — you would get i go to the homepage Commented Jul 1, 2020 at 22:54

5 Answers 5

4

If I understand you correctly you

  • want to separate your text on keywords FIRST NOW THEN and preserve them in resulting parts
  • but don't want to split on those keywords if they appear inside quotes.

If my guess is correct instead of split method, you can use find to iterate over all

  • quotes
  • words which are not inside quotes,
  • whitespaces.

This would let you add all quotes and whitespaces to result and focus only on checking words which are not inside quotation to see if you should split on them or not.

Regex representing such parts can look like Pattern.compile("\"[^\"]*\"|\\S+|\\s+");

IMPORTANT: we need to search for ".." first, otherwise \\S+ would also match "NOW CLICK" as "NOW and CLICK" as two separate parts which will prevent it to be seen as single quotation. This is why we want to place "[^"]*" regex (which represents quotations) at start of subregex1|subregex2|subregex3 series.

This regex will allow us to iterate over text

FIRST i go to the homepage NOW i click on button "NOW CLICK" very quick THEN i will become a text result.

as tokens

FIRST i go to the homepage NOW i click on button "NOW CLICK" very quick THEN i will become a text result. THEN i will become a text result.

Notice that "NOW CLICK" will be treated as single token. Because of that even if it will contain inside keyword on which you want to split, it will never be equal to such keyword (because it will contain other characters like ", or simply other words in quote). This will prevent it from being treated as delimiter on which text should be split.

Using this idea we can create code like:

String text = "FIRST i go to the homepage NOW i click on button \"NOW CLICK\" very quick THEN i will become a text result.";
List<String> keywordsToSplitOn = List.of("FIRST", "NOW", "THEN");

//lets search for quotes ".." | words | whitespaces
Pattern p = Pattern.compile("\"[^\"]*\"|\\S+|\\s+");
Matcher m = p.matcher(text);

StringBuilder sb = new StringBuilder();
List<String> result = new ArrayList<>();
while(m.find()){
    String token = m.group();
    if (keywordsToSplitOn.contains(token) && sb.length() != 0){
        result.add(sb.toString());
        sb.delete(0, sb.length());//clear sb
    }
    sb.append(token);
}
if (sb.length() != 0){//include rest of text after last keyword 
    result.add(sb.toString());
}

result.forEach(System.out::println);

Output:

FIRST i go to the homepage 
NOW i click on button "NOW CLICK" very quick 
THEN i will become a text result.
Sign up to request clarification or add additional context in comments.

4 Comments

Interesting that this is the accepted answer, as it adds so much complexity to the simple question.
@JWoodchuck I don't think that main difference is that "this solution is more convoluted", but rather that it uses different assumptions than other answers (posted before mine). For instance your solution is based on assumption that OP doesn't want to split on NOW when it is followed by CLICK while mine is that OP doesn't want to split on NOW which is placed inside quotation. Our solutions may give same results for current OP example, but will work differently for other sentences like FIRST select option A. NOW CLICK button B..
That makes sense. Guess it depends on their broader requirements considered with other factors like complexity and introducing a different (non-split) approach.
@JWoodchuck Yes. I assumed that this may be yet another case of XY problem, where OP wants to do something, but describes only one aspect/overly simplified case and shows his failed attempt (here using split). Often attempt shown in question isn't mandatory for OP, even if it is part of what OP is asking about in question.
3

You need to use lookaheads and a lookbehind (mentioned briefly here).

Simply changing the regex in your split method to the following should do it:

String[] textArray = text.split("((?=FIRST)|(?=NOW(?! CLICK))|(?=THEN))");

May be better even to include a space in each expression to prevent splitting on, e.g., NOWHERE:

String[] textArray = text.split("((?=FIRST )|(?=NOW (?!CLICK))|(?=THEN ))");

Comments

1

You may use a Pattern and matcher to split the input using groups:

Pattern pattern = Pattern.compile("^(FIRST.*?)(NOW.*?)(THEN.*)$");

String text = "FIRST i go to the homepage NOW i click on button \"NOW CLICK\" very quick THEN i will become a text result.";

Matcher matcher = pattern.matcher(text);
        
if (matcher.find()) {
    System.out.println(matcher.group(1));
    System.out.println(matcher.group(2));
    System.out.println(matcher.group(3));
}

Output:

FIRST i go to the homepage 
NOW i click on button "NOW CLICK" very quick 
THEN i will become a text result.

Comments

1

You could match the following regular expression.

/\bFIRST +(?:(?!\bNOW\b)[^\n])+(?<! )|\bNOW +(?:(?!\bTHEN\b)[^\n])+(?<! )|\bTHEN +.*/

Start your engine!

Java's regex engine performs the following operations.

\bFIRST +      : match 'FIRST' preceded by a word boundary,
                 followed by 1+ spaces
(?:            : begin a non-capture group
  (?!\bNOW\b)  : use a negative lookahead to assert that
                 the following chars are not 'NOW'  
  [^\n]        : match any char other than a line terminator
)              : end non-capture group
+              : execute non-capture group 1+ times
(?<! )         : use negative lookbehind to assert that the
                 previous char is not a space
|              : or
\bNOW +        : match 'NOW' preceded by a word boundary,
                 followed by 1+ spaces
(?:            : begin a non-capture group
  (?!\bTHEN\b) : use a negative lookahead to assert that
                 the following chars are not 'THEN'  
  [^\n]        : match any char other than a line terminator
)              : end non-capture group
+              : execute non-capture group 1+ times
(?<! )         : use negative lookbehind to assert that the
                 previous char is not a space
|              : or
\bTHEN +.*     : match 'THEN' preceded by a word boundary,
                 followed by 1+ spaces then 0+ chars

This uses a technique called the tempered greedy token solution.

Comments

0

You can use these (Lookahead and Lookbehind):

public static void main(String args[]) { 
    String text = "FIRST i go to the homepage NOW i click on button \"NOW CLICK\" very quick THEN i will become a text result.";
    String[] textArray = text.split("(?=FIRST)|(?=\\b NOW \\b)|(?=THEN)");
    
    for(String s: textArray) {
        System.out.println(s);
    }
}

Output:

FIRST i go to the homepage
 NOW i click on button "NOW CLICK" very quick 
THEN i will become a text result.

1 Comment

This still splits on the second NOW.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.