Java, regex to split a string on a delimiter with constraints

Question

I have a malformed base64 string in java.

It's not absolutely malformed but the string sometimes contains more base64 encoded data.

I want to split the string, and I think regex is the best way to achieve this.

There are cases:

if there is only one base64 in the string, it either
- does not contain any padding char =
- contains padding char(s) (one or two) only at the end
if there are more base64 in the string, it will
- contain padding char(s) (one or two) not at the end (or not only at the end)

Now I want to get a String[] which holds the single base64 strings.

So regex does not have to split if there is no padding char, or the padding char is at the end. But it has to split if there is padding char in the middle (and there can be one ore two padding chars).

Test snippet:

import java.util.Base64;
import java.io.UnsupportedEncodingException;
import java.util.Arrays;

/*
TEST CASES:
output array shall contain one item only
TG9y
TG9yZW0=
TG9yZQ==

output array shall contain two items
TG9yZW0=TG9y
TG9yZW0=TG9yZW0=
TG9yZW0=TG9yZQ==

TG9yZQ==TG9y
TG9yZQ==TG9yZW0=
TG9yZQ==TG9yZQ==

output array shall contain three items
TG9yZW0=TG9yZW0=TG9y
TG9yZQ==TG9yZW0=TG9y
...
*/

public class MyClass {
  public static void main(String args[]) {

    String buffer = "";

    try {
      byte[] decodedString = Base64.getDecoder().decode(buffer.getBytes("UTF-8"));
      System.out.println(new String(decodedString));
    } catch (IllegalArgumentException e) {
      e.printStackTrace();
      System.err.println("Buffer: " + buffer);
    } catch (UnsupportedEncodingException e) { }
  }
}

I'm not sure if regex is fully capable of this, or if it is the best method to do this.

What if the first item is TG9y? How do you know where it ends? — Andreas
– Andreas, Commented Apr 3, 2020 at 10:40
To split after 1 or more = signs, use a combination of zero-width positive lookbehind and zero-width negative lookahead: String[] arr = buffer.split("(?<==)(?!=)") — Andreas
– Andreas, Commented Apr 3, 2020 at 10:47
@Andreas: What if the first item is TG9y? In that case it does not need to be split. Base64 parser decodes concatenated base64s if they don't have the padding character in between. — Daniel
– Daniel, Commented Apr 3, 2020 at 10:50
So if the original input was { "Lor", "Lore", "Lorem" }, resulting in combined Base64 text TG9yTG9yZQ==TG9yZW0=, you're ok with that decoding back to { "LorLore", "Lorem" }, losing the separation of the first two inputs? — Andreas
– Andreas, Commented Apr 3, 2020 at 12:00

Andreas · Accepted Answer · 2020-04-03 12:07:28Z

As mentioned in a comment, you can split the string after an = equal sign, that isn't followed by an = equal sign, by using a combination of (?<=X) zero-width positive lookbehind and (?!X) zero-width negative lookahead:

String[] arr = input.split("(?<==)(?!=)");

Test

String[] inputs = {
        "TG9y",
        "TG9yZW0=",
        "TG9yZQ==",
        "TG9yZW0=TG9y",
        "TG9yZW0=TG9yZW0=",
        "TG9yZW0=TG9yZQ==",
        "TG9yZQ==TG9y",
        "TG9yZQ==TG9yZW0=",
        "TG9yZQ==TG9yZQ==",
        "TG9yZW0=TG9yZW0=TG9y",
        "TG9yZQ==TG9yZW0=TG9y",
        "TG9yTG9yZQ==TG9yZW0=" };
Decoder decoder = Base64.getDecoder();
for (String input : inputs) {
    String[] arr = input.split("(?<==)(?!=)");
    for (int i = 0; i < arr.length; i++)
        arr[i] = new String(decoder.decode(arr[i]), StandardCharsets.US_ASCII);
    System.out.println(Arrays.toString(arr));
}

Output

[Lor]
[Lorem]
[Lore]
[Lorem, Lor]
[Lorem, Lorem]
[Lorem, Lore]
[Lore, Lor]
[Lore, Lorem]
[Lore, Lore]
[Lorem, Lorem, Lor]
[Lore, Lorem, Lor]
[LorLore, Lorem]

Collectives™ on Stack Overflow

Java, regex to split a string on a delimiter with constraints

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related