Multiple matches in single java regexp

Question

Is it possible to match the following in a single regular expression to get the first word, and then a list of the numbers?

this 10 12 3 44 5 66 7 8    # should return "this", "10", "12", ...
another 1 2 3               # should return "another", "1", "2", "3"

EDIT1: My actual data is not this simple, the digits are actually more complex patterns, but for illustration purposes, I've reduced the problem to simple digits, so I do require a regex answer.

The numbers are unknown in length on each line, but all match a simple pattern.

The following only matches "this" and "10":

([\p{Alpha}]+ )(\d+ ?)+?

Dropping the final ? matches "this" and "8".

I had thought that the final group (\d+ ?)+ would do the digit matching multiple times, but it doesn't and I can't find the syntax to do it, if possible.

I can do it in multiple passes, by only searching for the name and latter numbers separately, but was wondering if it's possible in a single expression? (And if not, is there a reason?)

EDIT2: As I mentioned in some of the comments, this was a question in Advent of Code (Day 7, 2020). I was looking to find cleanest solution (who doesn't love a bit of polishing?)

Here's my ultimate solution (kotlin) I used, but spent too long trying to do it in 1 regex, so I posted this question.

val bagExtractor = Regex("""^([\p{Alpha} ]+) bags contain""")
val rulesExtractor = Regex("""([\d]+) ([\p{Alpha} ]+) bag""")

// bagRule is a line from the input
val bag = bagExtractor.find(bagRule)?.destructured!!.let { (n) -> Bag(name = n) }
val contains = rulesExtractor.findAll(bagRule).map { it.destructured.let { (num, bagName) -> Contain(num = num.toInt(), bag = Bag(bagName)) } }.toList()
Rule(bag = bag, contains = contains)

Despite now knowing it can be done in 1 line, I haven't implemented it, as I think it's cleaner in 2.

Looking at this, can't you simply split on spaces? And if not, why? — JvdV
– JvdV, Commented Dec 7, 2020 at 17:18
this is a very simplified version of the actual input, where the final numbers are more complex patterns (actually of the pattern "<number> <word1> <word2> <other bits>") that exhibit the same behaviour, only matching the first or last expression, never the full list of items. — Mark Fisher
– Mark Fisher, Commented Dec 7, 2020 at 17:24
Yes, use String pat = "(\\G(?!^)|\\b\\p{L}+\\b)\\s+(\\d+)";. Group 1 will only be matched when the initial word is matched. You need to use it with matcher.find and some extra code logic. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Dec 7, 2020 at 18:12
This is wizardry! I tested this at freeformatter.com/java-regex-tester.html#ad-output and as you say, the initial group is slightly askew, but otherwise is pretty good. the matches give "other 1", "2", "3". — Mark Fisher
– Mark Fisher, Commented Dec 7, 2020 at 23:38

Arvind Kumar Avinash · Accepted Answer · 2020-12-07 17:54:17Z

1

I think what you are looking for can be achieved by splitting the string on \s+ unless I am missing something.

import java.util.Arrays;

public class Main {
    public static void main(String[] args) {
        String str = "this 10 12 3 44 5 66 7 8";
        String[] parts = str.split("\\s+");
        System.out.println(Arrays.toString(parts));
    }
}

Output:

[this, 10, 12, 3, 44, 5, 66, 7, 8]

If you want to select just the alphabetical text and the integer text from the string, you can do it as

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    public static void main(String[] args) {
        String str = "this 10 12 3 44 5 66 7 8";
        Matcher matcher = Pattern.compile("(\\b\\p{Alpha}+\\b)|(\\b\\d+\\b)").matcher(str);
        while (matcher.find()) {
            System.out.println(matcher.group());
        }
    }
}

Output:

or as

import java.util.List;
import java.util.regex.MatchResult;
import java.util.regex.Pattern;
import java.util.stream.Collectors;

public class Main {
    public static void main(String[] args) {
        String str = "this 10 12 3 44 5 66 7 8";

        List<String> list = Pattern.compile("(\\b\\p{Alpha}+\\b)|(\\b\\d+\\b)")
                            .matcher(str)
                            .results()
                            .map(MatchResult::group)                                                        
                            .collect(Collectors.toList());

        System.out.println(list);
    }
}

Output:

[this, 10, 12, 3, 44, 5, 66, 7, 8]

edited Dec 7, 2020 at 17:54

answered Dec 7, 2020 at 17:24

Arvind Kumar Avinash

81k10 gold badges98 silver badges144 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Mark Fisher Over a year ago

I should have commented faster :) No, this isn't possible with the real data, the "digits" in my actual data are more complex structures made up of multiple words, but do follow a pattern I can match

Arvind Kumar Avinash Over a year ago

@MarkFisher - Can you please post here an actual sample (after hiding PII, if any)?

Mark Fisher Over a year ago

The sample data given should be good enough to test on. I have a solution which is to split the example regex I gave into 2 parts and scan the input twice with each regex. That works fine, I just don't understand why the combination of them doesn't work.

Arvind Kumar Avinash Over a year ago

@MarkFisher - I've posted an update. If the input and output are not as per your expectation, feel free to comment with an example input and the expected output.

Mark Fisher Over a year ago

That works! Nice answer. I went with ([\p{Alpha} ]+) bags contain|(\d+) ([\p{Alpha} ]+) bag on the actual input data which is matching everything I need on the line. Cheers.

|

rzwitserloot · Accepted Answer · 2020-12-07 17:27:19Z

No. The notion of "find me all of a certain regexp" is just not done with incrementing groups. You're really asking for why regexp is what it is? That's... an epic thesis that delves into some ancient computing history and a lot of Larry Wall (author of Perl, which is more or less where regexps came from) interviews, that seems a bit beyond the scope of SO. They work that way because regexps work that way, and those work that way because they've worked that way for decades and changing them would mess with people's expectations; let's not go any deeper than that.

You can do this with scanners instead:

Scanner s = new Scanner("this 10 12 3 44 5 66 7 8");
assertEquals("this", s.next());
assertEquals(10, s.nextInt());
// etc

or even:

Scanner s = new Scanner("this 10 12 3 44 5 66 7 8");
assertEquals("this", s.next());
assertEquals(10, s.nextInt());
// etc

or even:

Scanner s = new Scanner("this 10 12 3 44 5 66 7 8");
assertEquals("this", s.next(Pattern.compile("[\p{Alpha}]+"));
assertEquals(10, s.nextInt());

s = new Scanner("--00invalid-- 10 12 3 44 5 66 7 8");
// the line below will throw an InputMismatchException
s.next(Pattern.compile("[\p{Alpha}]+"));

NB: Scanners tokenize (they split the input into a sequence of token, separator, token, separator, etc - then tosses the separators and gives you the tokens). .next(Pattern) does not mean: Keep scanning until you hit something that matches. It just means: Grab the next token. If it matches this regexp, great, return it. Otherwise, crash.

So, the real magic is in making scanner tokenize as you want. This is done by use .useDelimiter() and is also regexp based. Some fancy footwork with positive lookahead and co can get you far, but it's not infinitely powerful. You didn't expand on the actual structure of your input so I can't say if it'll suffice for your needs.

An example of the actual input is posh crimson bags contain 2 mirrored tan bags, 1 faded red bag, 1 striped gray bag. which some may recognise from AOC 2020 day 7 today. I got an answer using 2 regex: ^([\p{Alpha} ]+) bags contain and ([\d]+) ([\p{Alpha} ]+) bag but wanted a single expression to work if possible matching the beginning and then multiple values on the end of the line.

Eritrean · Accepted Answer · 2020-12-07 18:41:55Z

0

Assuming you are talking about this: adventofcode where the inputs are the rules

light red bags contain 1 bright white bag, 2 muted yellow bags.
dark orange bags contain 3 bright white bags, 4 muted yellow bags.
bright white bags contain 1 shiny gold bag.
muted yellow bags contain 2 shiny gold bags, 9 faded blue bags.
shiny gold bags contain 1 dark olive bag, 2 vibrant plum bags.
dark olive bags contain 3 faded blue bags, 4 dotted black bags.
vibrant plum bags contain 5 faded blue bags, 6 dotted black bags.
faded blue bags contain no other bags.
dotted black bags contain no other bags.

Why search for a complicated regular expression when you can easily split on the word contain or on a ,

String str1 = "light red bags contain 1 bright white bag, 2 muted yellow bags.";
String str2 = "dotted black bags contain no other bags.";
String[] split1 = str1.split("\\scontain\\s|,");
String[] split2 = str2.split("\\scontain\\s|,");

System.out.println(Arrays.toString(split1));
System.out.println(Arrays.toString(split2));

//[light red bags, 1 bright white bag,  2 muted yellow bags.]
//[dotted black bags, no other bags.]

answered Dec 7, 2020 at 18:41

Eritrean

16.6k3 gold badges25 silver badges28 bronze badges

1 Comment

Mark Fisher Over a year ago

Yes, that's the puzzle for today. I solved it fine, I was just trying to find a single regex to cater for entire line, hence question. I'm actually using Kotlin, but the regex is same between the two. I had used split on space and taking 4 words at a time in my first solution but it was hideously long and convoluted, then refactored to regex removing half the code. I'll post my own solution in the question as it doesn't format well in comments. Thanks for your answer!

WJS · Accepted Answer · 2020-12-07 20:12:04Z

0

You said you had to use a regex. But how about a hybrid solution. Use the regex to verify the format and then split the values on spaces or the delimiter of your choosing. I also returned the value in an optional so you could check on its availability before use.

String[] data = { "this 10 12 3 44 5 66 7 8",
        "Bad Data 5 5 5",
        "another 1 2 3" };

for (String text : data) {
    Optional<List<String>> op = parseText(text);
    if (!op.isEmpty()) {
        System.out.println(op.get());
    }
}

Prints

[this, 10, 12, 3, 44, 5, 66, 7, 8]
[another, 1, 2, 3]

static String pattern = "([a-zA-Z]+)(\\s+\\d+)+";
    
public static Optional<List<String>> parseText(String text) {
    if (text.matches(pattern)) {
        return Optional.of(Arrays.stream(text.split("\\s+"))
                .collect(Collectors.toList()));
    }
    return Optional.empty();
}

edited Dec 7, 2020 at 20:12

answered Dec 7, 2020 at 18:16

WJS

40.2k4 gold badges27 silver badges46 bronze badges

2 Comments

Mark Fisher Over a year ago

thankyou for your answer. i was trying not to bog down the question with too much detail that the idea would get lost. the question really was about parsing multiple entries in the input data with regex rather than those specific values, and in retrospect can understand why some (very good) answers leaned more towards splitting on spaces and similar. It would have helped had I said the input is well formed, so I didn't have to worry about ensuring it matches first before parsing. Tips for me next time I ask a question!

WJS Over a year ago

I understand -- no problems. It wasn't the splitting on spaces that was the issue (at least for me). It was trying to capture a non-repeating group (alphas) following by some quantity of numbers. But that the important thing is that you got an answer you can use.

Collectives™ on Stack Overflow

Multiple matches in single java regexp

4 Answers 4

6 Comments

1 Comment

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

6 Comments

1 Comment

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related