3

Is it possible to match the following in a single regular expression to get the first word, and then a list of the numbers?

this 10 12 3 44 5 66 7 8    # should return "this", "10", "12", ...
another 1 2 3               # should return "another", "1", "2", "3"

EDIT1: My actual data is not this simple, the digits are actually more complex patterns, but for illustration purposes, I've reduced the problem to simple digits, so I do require a regex answer.

The numbers are unknown in length on each line, but all match a simple pattern.

The following only matches "this" and "10":

([\p{Alpha}]+ )(\d+ ?)+?

Dropping the final ? matches "this" and "8".

I had thought that the final group (\d+ ?)+ would do the digit matching multiple times, but it doesn't and I can't find the syntax to do it, if possible.

I can do it in multiple passes, by only searching for the name and latter numbers separately, but was wondering if it's possible in a single expression? (And if not, is there a reason?)


EDIT2: As I mentioned in some of the comments, this was a question in Advent of Code (Day 7, 2020). I was looking to find cleanest solution (who doesn't love a bit of polishing?)

Here's my ultimate solution (kotlin) I used, but spent too long trying to do it in 1 regex, so I posted this question.

val bagExtractor = Regex("""^([\p{Alpha} ]+) bags contain""")
val rulesExtractor = Regex("""([\d]+) ([\p{Alpha} ]+) bag""")

// bagRule is a line from the input
val bag = bagExtractor.find(bagRule)?.destructured!!.let { (n) -> Bag(name = n) }
val contains = rulesExtractor.findAll(bagRule).map { it.destructured.let { (num, bagName) -> Contain(num = num.toInt(), bag = Bag(bagName)) } }.toList()
Rule(bag = bag, contains = contains)

Despite now knowing it can be done in 1 line, I haven't implemented it, as I think it's cleaner in 2.

4
  • 1
    Looking at this, can't you simply split on spaces? And if not, why? Commented Dec 7, 2020 at 17:18
  • this is a very simplified version of the actual input, where the final numbers are more complex patterns (actually of the pattern "<number> <word1> <word2> <other bits>") that exhibit the same behaviour, only matching the first or last expression, never the full list of items. Commented Dec 7, 2020 at 17:24
  • Yes, use String pat = "(\\G(?!^)|\\b\\p{L}+\\b)\\s+(\\d+)";. Group 1 will only be matched when the initial word is matched. You need to use it with matcher.find and some extra code logic. Commented Dec 7, 2020 at 18:12
  • This is wizardry! I tested this at freeformatter.com/java-regex-tester.html#ad-output and as you say, the initial group is slightly askew, but otherwise is pretty good. the matches give "other 1", "2", "3". Commented Dec 7, 2020 at 23:38

4 Answers 4

1

I think what you are looking for can be achieved by splitting the string on \s+ unless I am missing something.

import java.util.Arrays;

public class Main {
    public static void main(String[] args) {
        String str = "this 10 12 3 44 5 66 7 8";
        String[] parts = str.split("\\s+");
        System.out.println(Arrays.toString(parts));
    }
}

Output:

[this, 10, 12, 3, 44, 5, 66, 7, 8]

If you want to select just the alphabetical text and the integer text from the string, you can do it as

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    public static void main(String[] args) {
        String str = "this 10 12 3 44 5 66 7 8";
        Matcher matcher = Pattern.compile("(\\b\\p{Alpha}+\\b)|(\\b\\d+\\b)").matcher(str);
        while (matcher.find()) {
            System.out.println(matcher.group());
        }
    }
}

Output:

this
10
12
3
44
5
66
7
8

or as

import java.util.List;
import java.util.regex.MatchResult;
import java.util.regex.Pattern;
import java.util.stream.Collectors;

public class Main {
    public static void main(String[] args) {
        String str = "this 10 12 3 44 5 66 7 8";

        List<String> list = Pattern.compile("(\\b\\p{Alpha}+\\b)|(\\b\\d+\\b)")
                            .matcher(str)
                            .results()
                            .map(MatchResult::group)                                                        
                            .collect(Collectors.toList());

        System.out.println(list);
    }
}

Output:

[this, 10, 12, 3, 44, 5, 66, 7, 8]
Sign up to request clarification or add additional context in comments.

6 Comments

I should have commented faster :) No, this isn't possible with the real data, the "digits" in my actual data are more complex structures made up of multiple words, but do follow a pattern I can match
@MarkFisher - Can you please post here an actual sample (after hiding PII, if any)?
The sample data given should be good enough to test on. I have a solution which is to split the example regex I gave into 2 parts and scan the input twice with each regex. That works fine, I just don't understand why the combination of them doesn't work.
@MarkFisher - I've posted an update. If the input and output are not as per your expectation, feel free to comment with an example input and the expected output.
That works! Nice answer. I went with ([\p{Alpha} ]+) bags contain|(\d+) ([\p{Alpha} ]+) bag on the actual input data which is matching everything I need on the line. Cheers.
|
0

No. The notion of "find me all of a certain regexp" is just not done with incrementing groups. You're really asking for why regexp is what it is? That's... an epic thesis that delves into some ancient computing history and a lot of Larry Wall (author of Perl, which is more or less where regexps came from) interviews, that seems a bit beyond the scope of SO. They work that way because regexps work that way, and those work that way because they've worked that way for decades and changing them would mess with people's expectations; let's not go any deeper than that.

You can do this with scanners instead:

Scanner s = new Scanner("this 10 12 3 44 5 66 7 8");
assertEquals("this", s.next());
assertEquals(10, s.nextInt());
// etc

or even:

Scanner s = new Scanner("this 10 12 3 44 5 66 7 8");
assertEquals("this", s.next());
assertEquals(10, s.nextInt());
// etc

or even:

Scanner s = new Scanner("this 10 12 3 44 5 66 7 8");
assertEquals("this", s.next(Pattern.compile("[\p{Alpha}]+"));
assertEquals(10, s.nextInt());

s = new Scanner("--00invalid-- 10 12 3 44 5 66 7 8");
// the line below will throw an InputMismatchException
s.next(Pattern.compile("[\p{Alpha}]+"));

NB: Scanners tokenize (they split the input into a sequence of token, separator, token, separator, etc - then tosses the separators and gives you the tokens). .next(Pattern) does not mean: Keep scanning until you hit something that matches. It just means: Grab the next token. If it matches this regexp, great, return it. Otherwise, crash.

So, the real magic is in making scanner tokenize as you want. This is done by use .useDelimiter() and is also regexp based. Some fancy footwork with positive lookahead and co can get you far, but it's not infinitely powerful. You didn't expand on the actual structure of your input so I can't say if it'll suffice for your needs.

1 Comment

An example of the actual input is posh crimson bags contain 2 mirrored tan bags, 1 faded red bag, 1 striped gray bag. which some may recognise from AOC 2020 day 7 today. I got an answer using 2 regex: ^([\p{Alpha} ]+) bags contain and ([\d]+) ([\p{Alpha} ]+) bag but wanted a single expression to work if possible matching the beginning and then multiple values on the end of the line.
0

Assuming you are talking about this: adventofcode where the inputs are the rules

light red bags contain 1 bright white bag, 2 muted yellow bags.
dark orange bags contain 3 bright white bags, 4 muted yellow bags.
bright white bags contain 1 shiny gold bag.
muted yellow bags contain 2 shiny gold bags, 9 faded blue bags.
shiny gold bags contain 1 dark olive bag, 2 vibrant plum bags.
dark olive bags contain 3 faded blue bags, 4 dotted black bags.
vibrant plum bags contain 5 faded blue bags, 6 dotted black bags.
faded blue bags contain no other bags.
dotted black bags contain no other bags.

Why search for a complicated regular expression when you can easily split on the word contain or on a ,

String str1 = "light red bags contain 1 bright white bag, 2 muted yellow bags.";
String str2 = "dotted black bags contain no other bags.";
String[] split1 = str1.split("\\scontain\\s|,");
String[] split2 = str2.split("\\scontain\\s|,");

System.out.println(Arrays.toString(split1));
System.out.println(Arrays.toString(split2));

//[light red bags, 1 bright white bag,  2 muted yellow bags.]
//[dotted black bags, no other bags.]

1 Comment

Yes, that's the puzzle for today. I solved it fine, I was just trying to find a single regex to cater for entire line, hence question. I'm actually using Kotlin, but the regex is same between the two. I had used split on space and taking 4 words at a time in my first solution but it was hideously long and convoluted, then refactored to regex removing half the code. I'll post my own solution in the question as it doesn't format well in comments. Thanks for your answer!
0

You said you had to use a regex. But how about a hybrid solution. Use the regex to verify the format and then split the values on spaces or the delimiter of your choosing. I also returned the value in an optional so you could check on its availability before use.

String[] data = { "this 10 12 3 44 5 66 7 8",
        "Bad Data 5 5 5",
        "another 1 2 3" };

for (String text : data) {
    Optional<List<String>> op = parseText(text);
    if (!op.isEmpty()) {
        System.out.println(op.get());
    }
}

Prints

[this, 10, 12, 3, 44, 5, 66, 7, 8]
[another, 1, 2, 3]
static String pattern = "([a-zA-Z]+)(\\s+\\d+)+";
    
public static Optional<List<String>> parseText(String text) {
    if (text.matches(pattern)) {
        return Optional.of(Arrays.stream(text.split("\\s+"))
                .collect(Collectors.toList()));
    }
    return Optional.empty();
}

2 Comments

thankyou for your answer. i was trying not to bog down the question with too much detail that the idea would get lost. the question really was about parsing multiple entries in the input data with regex rather than those specific values, and in retrospect can understand why some (very good) answers leaned more towards splitting on spaces and similar. It would have helped had I said the input is well formed, so I didn't have to worry about ensuring it matches first before parsing. Tips for me next time I ask a question!
I understand -- no problems. It wasn't the splitting on spaces that was the issue (at least for me). It was trying to capture a non-repeating group (alphas) following by some quantity of numbers. But that the important thing is that you got an answer you can use.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.