25

I need to get all substrings matching a regex, I know I can probably build an automaton for it, but I am looking for a simpler solution.
the problem is, Matcher.find() doesn't return all results.

String str = "abaca";
Matcher matcher = Pattern.compile("a.a").matcher(str);
while (matcher.find()) {
   System.out.println(str.substring(matcher.start(),matcher.end()));
}

The result is aba and not aba,acaas I want...
any ideas?
EDIT: another example: for string=abaa, regex=a.*a I am expecting to get aba,abaa,aa
p.s. if it cannot be achieved using regular expressions, it's also an answer, I just want to know I'm not re-inventing the wheel for something the language already provides me with...

2
  • I had same problem, look this: stackoverflow.com/questions/5231482/… Commented Apr 18, 2011 at 15:27
  • 1
    The problem is that the matcher only considers non-overlapping matches. Still, this is an interesting problem. +1 Commented Apr 18, 2011 at 15:28

4 Answers 4

24

You could do something like this:

import java.util.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {

    public static List<String> getAllMatches(String text, String regex) {
        List<String> matches = new ArrayList<String>();
        Matcher m = Pattern.compile("(?=(" + regex + "))").matcher(text);
        while(m.find()) {
            matches.add(m.group(1));
        }
        return matches;
    }

    public static void main(String[] args) {
        System.out.println(getAllMatches("abaca", "a.a"));
        System.out.println(getAllMatches("abaa", "a.*a"));
    }
}

which prints:

[aba, aca]
[abaa, aa]

The only thing is that you're missing aba from the last matches-list. This is because of the greedy .* in a.*a. You can't fix this with regex. You could do this by iterating over all possible substrings and call .matches(regex) on each substring:

public static List<String> getAllMatches(String text, String regex) {
    List<String> matches = new ArrayList<String>();
    for(int length = 1; length <= text.length(); length++) {
        for(int index = 0; index <= text.length()-length; index++) {
            String sub = text.substring(index, index + length);
            if(sub.matches(regex)) {
                matches.add(sub);
            }
        }
    }
    return matches;
}

If your text will stay relatively small, this will work, but for larger strings, this may become too computationally intense.

Sign up to request clarification or add additional context in comments.

2 Comments

this is exactly the problem I wanted to know if there is a simple regex solution for..
No, there is not a simple way using only regex. Note that it was not your entire problem: your first issue was that you couldn't get multiple matches because of "overlapping hits", something my suggestion solves (and Dmitrij Golubev's, I might add).
8

By default new match starts at the end of the previous one. If youe matches can overlap, you need to specify start point manually:

int start = 0;
while (matcher.find(start)) { 
    ...
    start = matcher.start() + 1;
}

4 Comments

this is still not enough if I have string=abaa regex=a.*a , I will still get only one result, and not all of them
@amit: what is the expected output for your example above (string=abaa, regex=a.*a)?
@aix: aba,abaa,aa... of course it does not come down to only this simple examples, this is just one point where the suggested solution fails.:\
@amit: Still, you might want to add this example to the question, since it illustrates an aspect of your expectations that isn't necessarily obvious from the question.
0

Use matcher.find(startingFrom) in your while loop, and increase startingFrom to one more than the start of the previous match: startingFrom = matcher.start()+1;

6 Comments

this is excactly what @axtavt suggested, however - it is not enough in this problem, see the editted question (last example) why
@amit I only saw the answer (and ensuing discussion) from @axtavt after I posted.
@Rikki: deleting the answer will be appritiated in this case.
@amit Whoops, hit enter by accident! In my test of this code on "abaa" =~ m/a.*a/ I get ("abaa", "aa"): so I get more than one result, but not all three that you want. This is because of regex greediness/laziness. a.*a will eat all the string it can. Trying a.*?a instead (making it lazy) will get you ("aba", "aa"). I was hoping something like (a.*?a)|(a.*a) would work, but it doesn't, so you'll have to match the string for both regexes: a.*a and a.*?a, then deduplicate the results.
@amit Ok, I see now. That wasn't at all clear from the original post. I think your example needs to be less specific, or include more iterations (use aba, abaa, abaaa, abaaaa as examples). Nope, I don't think this is possible with regex. You can either be lazy or greedy but not in between.
|
0

This is sort of a computationally open-ended problem. The question of all possible matches for a regex can be rephrased as

What are all the possible sub strings of a given String that match the given regex?

So what your code really needs to do is (pseudo-code):

for(String substring: allPossibleSubstrings) {
    if(PATTERN.matches(subString) {
        results.add(subString);
    }
}

Now for a string like abaa this is trivial: AllPossible = ["a", "ab", "aba", "abaa", "ba", "baa", "aa"] You can also add some intelligence by restricting the size of the substrings to the minimal size that can be matched by the regex. Of course, this will expand exponentially for large strings

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.