11

I have a working regex in Python and I am trying to convert to Java. It seems that there is a subtle difference in the implementations.

The RegEx is trying to match another reg ex. The RegEx in question is:

/(\\.|[^[/\\\n]|\[(\\.|[^\]\\\n])*])+/([gim]+\b|\B)

One of the strings that it is having problems on is: /\s+/;

The reg ex is not supposed to be matching the ending ;. In Python the RegEx works correctly (and does not match the ending ;, but in Java it does include the ;.

The Question(s):

  1. What can I do to get this RegEx working in Java?
  2. Based on what I read here there should be no difference for this RegEx. Is there somewhere a list of differences between the RegEx implementations in Python vs Java?
4
  • 4
    Could you post the Python and Java code for your match? Can you clarify what you mean by "This happens in Python but not in Java"? Which language is matching the string, and which one isn't? Commented May 8, 2012 at 3:34
  • That regex looks like it's supposed to match a JavaScript regex literal, is that right? And it's working in Python but you want to run it in Java? Commented May 8, 2012 at 4:51
  • @happydave - clarified the question above. Commented May 8, 2012 at 13:11
  • @AlanMoore - yes. It is trying to match a JavaScript regex. It is working in Python, it is not working in Java. Commented May 8, 2012 at 13:12

2 Answers 2

13

Java doesn't parse Regular Expressions in the same way as Python for a small set of cases. In this particular case the nested ['s were causing problems. In Python you don't need to escape any nested [ but you do need to do that in Java.

The original RegEx (for Python):

/(\\.|[^[/\\\n]|\[(\\.|[^\]\\\n])*])+/([gim]+\b|\B)

The fixed RegEx (for Java and Python):

/(\\.|[^\[/\\\n]|\[(\\.|[^\]\\\n])*\])+/([gim]+\b|\B)
Sign up to request clarification or add additional context in comments.

Comments

11

The obvious difference b/w Java and Python is that in Java you need to escape a lot of characters.

Moreover, you are probably running into a mismatch between the matching methods, not a difference in the actual regex notation:

Given the Java

String regex, input; // initialized to something
Matcher matcher = Pattern.compile( regex ).matcher( input );
  • Java's matcher.matches() (also Pattern.matches( regex, input )) matches the entire string. It has no direct equivalent in Python. The same result can be achieved by using re.match( regex, input ) with a regex that ends with $.
  • Java's matcher.find() and Python's re.search( regex, input ) match any part of the string.
  • Java's matcher.lookingAt() and Python's re.match( regex, input ) match the beginning of the string.

For more details also read Java's documentation of Matcher and compare to the Python documentation.

Since you said that isn't the problem, I decided to do a test: http://ideone.com/6w61T It looks like java is doing exactly what you need it to (group 0, the entire match, doesn't contain the ;). Your problem is elsewhere.

4 Comments

Yes, the characters are being escaped. Great point about the difference in the methods - that doesn't seem to be the problem here (I verified it - it seems something wrong with the reg ex). Digging into this a little more at (docs.oracle.com/javase/6/docs/api/java/util/regex/Matcher.html) seems like there is an equivalent for Python's re.match in Java - it seems to be matcher.lookingAt.
@Vineet Noted and edited. I also tested it, and your problem seems to be elsewhere.
re.fullmatch from 3.4 may do the same thing as matcher.matches
Thank you! I use RE's for decades but hardly touch Java. The other day, it took me at least an hour to find out, .matches() wants to cover the whole string. And I still desperately wish to understand why this unconventional behaviour.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.