3

I am using a system where & followed by a certain letter or number represents a color.
Valid characters that can follow & are [A-Fa-fK-Ok-or0-9]

For example I have the string &aThis is a test &bstring that &ehas plenty &4&lof &7colors.

I want to split at every &x while keeping the &x in the strings.
So I use a positive lookahead in my regex
(?=(&[A-Fa-fK-Ok-or0-9]))
That works completely fine, the output is:

&aThis is a test 
&bstring that 
&ehas plenty 
&4
&lof 
&7colors.

The problem is that the spot that has two instances of &x right next to each other should not be split, that line should be &4&lof instead.

Does anyone know what regex I can use so that when there are two of &x next to each other that they are matched together. Two instances of the color code should have priority over a single instance.

4
  • then why not add a space between ?=( and &[ in your regex? Commented Jun 21, 2016 at 2:12
  • Because the whole idea is to have them matched together as one piece, not separate. Commented Jun 21, 2016 at 2:26
  • you said "$a $b" is two, and "$a$b" is one, so isn't that the regex should be more like "(?=( &[A-Fa-fK-Ok-or0-9]))", with a space between ( and & ? Then you just need to give special care to the first characters in the string. I think this is the easiest and direct way. Commented Jun 21, 2016 at 3:02
  • Seems to me you can just add a negative lookbehind: (?i)(?=&[a-fk-o0-9])(?<!&[a-fk-o0-9]). In other words, split on any color code that's not preceded by a color code. Commented Jun 21, 2016 at 8:00

3 Answers 3

2

Issue Description

The problem is known: you need to tokenize a string that may contain consecutive separators you need to keep as a single item in the resulting string list/array.

Splitting with lookaround(s) cannot help here, because an unanchored lookaround tests each position inside the string. If your pattern matched any char in the string, you could use \G operator, but it is not the case. Even adding a + quantifier - s0.split("(?=(?:&[A-Fa-fK-Ok-or0-9])+)" would still return &4, &lof as separate tokens because of this.

Solution

Use matching rather than splitting, and use building blocks to keep it readable.

String s0 = "This is a text&aThis is a test &bstring that &ehas plenty &4&lof &7colors.";
String colorRx = "&[A-Fa-fK-Ok-or0-9]";
String nonColorRx = "[^&]*(?:&(?![A-Fa-fK-Ok-or0-9])[^&]*)*";
Pattern pattern = Pattern.compile("(?:" + colorRx + ")+" + nonColorRx + "|" + nonColorRx);
Matcher m = pattern.matcher(s0);
List<String> res = new ArrayList<>();
while (m.find()){
    if (!m.group(0).isEmpty()) res.add(m.group(0)); // Add if non-empty!
} 
System.out.println(res); 
// => [This is a text, &aThis is a test , &bstring that , &ehas plenty , &4&lof , &7colors.]

The regex is

(?:&[A-Fa-fK-Ok-or0-9])+[^&]*(?:&(?![A-Fa-fK-Ok-or0-9])[^&]*)*|[^&]*(?:&(?![A-Fa-fK-Ok-or0-9])[^&]*)*

See the regex demo here. It is actually based on your initial pattern: first, we match all the color codes (1 or more sequences), and then we match 0+ characters that are not a starting point for the color sequence (i.e. all strings other than the color codes). The [^&]*(?:&(?![A-Fa-fK-Ok-or0-9])[^&]*)* subpattern is a synonym of (?s)(?:(?!&[A-Fa-fK-Ok-or0-9]).)* and it is quite handy when you need to match some chunk of text other than the one you specify, but as it is resource consuming (especially in Java), the unrolled version is preferable.

So, the pattern - (?:" + colorRx + ")+" + nonColorRx + "|" + nonColorRx - matches 1+ colorRx subpatterns followed with optional nonColorRx subpatterns, OR (|) zero or more nonColorRx subpatterns. The .group(0).isEmpy() does not allow empty strings in the resulting array.

Sign up to request clarification or add additional context in comments.

Comments

0

Something like this will work.

It uses the String#split method and places the valid lines into an ArrayList (e.g. colorLines)

String mainStr = "&aThis is a test &bstring that &ehas plenty &4&lof &7colors";
String [] arr = mainStr.split("&");

List<String> colorLines = new ArrayList<String>();

String lastColor = "";
for (String s : arr)
{
    s = s.trim();
    if (s.length() > 0)
    {
        if (s.length() == 1)
        {
            lastColor += s;
        }
        else
        {
            colorLines.add(lastColor.length() > 0 ? lastColor + s : s);
            lastColor = "";
        }
    }
}

for (String s : colorLines)
{
    System.out.println(s);
}

Outputs:

aThis is a test
bstring that
ehas plenty
4lof
7colors

Comments

0

I tried:

{

      String line = "&aThis is a test &bstring that &ehas plenty &4&lof &7colors.";
      String pattern = " &(a-z)*(0-9)*";

      String strs[] = line.split(pattern, 0);
      for (int i=0; i<strs.length; i++){
          if (i!=0){
              System.out.println("&"+strs[i]);
          } else {
              System.out.println(strs[i]);
          }
      }

}

and the output is : {

&aThis is a test
&bstring that
&ehas plenty
&4&lof
&7colors.

}

We can add the & at the beginning of all the substrings to get the result you are looking for.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.