Negative lookahead regex not working in Java

Question

The following regex successfully works when testing here, but when I try to implement it into my Java code, it won't return a match. It uses a negative lookahead to ensure no newlines occur between MAIN LEVEL and Bedrooms. Why won't it work in Java?

regex

^\s*\bMAIN LEVEL\b\n(?:(?!\n\n)[\s\S])*\bBedrooms:\s*(.*)

Java

pattern = Pattern.compile("^\\s*\\bMAIN LEVEL\\b\\n(?:(?!\\n\\n)[\\s\\S])*\\bBedrooms:\\s*(.*)");
    match = pattern.matcher(content);      
    if(match.find())
    {
        //Doesn't reach here
        String bed = match.group(1);
        bed = bed.trim();
    }

content is just a string read from a text file, which contains the exact text shown in the demo linked above.

File file = new File("C:\\Users\\ME\\Desktop\\content.txt"); 
 content = new Scanner(file).useDelimiter("\\Z").next();

UPDATE:

I changed my code to include a multiline modifier (?m), but it prints out "null".

pattern = Pattern.compile("(?m)^\\s*\\bMAIN LEVEL\\b\\n(?:(?!\\n\\n)[\\s\\S])*\\bBedrooms:\\s*(.*)");
    match = pattern.matcher(content);
    if(match.find())
    {   // Still not reaching here
        mainBeds=match.group(1);
        mainBeds= mainBeds.trim();
    }
  System.out.println(mainBeds);     // Prints null

ie, Pattern.compile("(?m)^\\s*\\bMAIN LEVEL\\b\\n(?:(?!\\n\\n)[\\s\\S])*\\bBedrooms:\\s*(.*)"); — Avinash Raj
– Avinash Raj, Commented Dec 27, 2015 at 4:44
Thanks @AvinashRaj but it's returning null. I've updated my question to reflect the changes. Any ideas?? — Mathomatic
– Mathomatic, Commented Dec 27, 2015 at 4:54
If I simply use pattern = Pattern.compile("Bedrooms:\\s(\\d+)"); then it properly prints666. — Mathomatic
– Mathomatic, Commented Dec 27, 2015 at 5:00
is match a Matcher obj? Matcher match = pattern.matcher(content); — Avinash Raj
– Avinash Raj, Commented Dec 27, 2015 at 5:11

CrazyChucky · Accepted Answer · 2022-12-29 13:25:08Z

5

The problem:

As explained in Alan Moore's answer, it's a mismatch between the format of the Line-Separators used in your file (\r\n), and what your pattern is specifying (\n):

Original code:
Pattern.compile("^\\s*\\bMAIN LEVEL\\b\\n(?:(?!\\n\\n)[\\s\\S])*\\bBedrooms:\\s*(.*)");

Note: I explain what the \r and \n represent, and the context and difference between \r\n and \n, in the second item of the "side notes" section.

The solution(s):

Most/all Java versions:
You can use \r?\n to match both formats, and this is sufficient in most cases.
Most/all Java versions:
You can use \u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029] to match "Any Unicode linebreak sequence".
Java 8 and later:
You can use the Linebreak Matcher (\R). It is equivalent to the second method (above), and whenever possible (Java 8 or later), this is the recommended method.

Resulting code (3rd method):
Pattern.compile("^\\s*\\bMAIN LEVEL\\b\\R(?:(?!\\R\\R)[\\s\\S])*\\bBedrooms:\\s*(.*)");

Side notes:

You can replace \\R\\R with \\R{2}, which is more readable.
Different formats of line-breaks exist and are used in different systems because early OSs inherited the "line-break logic" from mechanical typing machines, like typewriters.

The \r in code represents a Carriage-Return, aka CR. The idea behind this is to return the typing cursor to the start of the line.

The \n in code represents a Line-Feed, aka LF. The idea behind this is to move the typing cursor to the next line.

The most common line-break formats are CR-LF (\r\n), used primarily by Windows; and LF (\n), used by most UNIX-like systems. This is the reason why "\r?\n will be sufficient in most cases", and you can reliably use it for systems intended for household-grade users.

However, some (rare) OSs, usually in industrial-grade stuff such as servers, may use CR, LF-CR, or something else entirely, which is why the second method has so many characters in it, so if you need the code to be compatible with every system, you will need the second, or preferably, the third method.

Here is a useful method for testing where your patterns are failing:

String content = "..."; //Replace "..." with your content.
String patternString = "..."; //Replace "..." with your pattern.
String lastPatternSuccess = "None. You suck at Regex!";
for (int i = 0; i <= patternString.length(); i++) {
  try {
    String patternSubstring = patternString.substring(0, i);
    Pattern pattern = Pattern.compile(patternSubstring);
    Matcher matcher = pattern.matcher(content);
    if (matcher.find()) {
      lastPatternSuccess = i + " - Pattern: " + patternSubstring + " - Match: \n" + matcher.group();
    }
  } catch (Exception ex) {
    //Ignore and jump to next
  }
}
System.out.println(lastPatternSuccess);

edited Dec 29, 2022 at 13:25

CrazyChucky

3,5574 gold badges16 silver badges30 bronze badges

answered Dec 27, 2015 at 5:45

CosmicGiant

6,4695 gold badges46 silver badges59 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Alan Moore Over a year ago

Okay, this does improve on my answer, and that troubleshooting gimmick is pretty cute!

Mathomatic Over a year ago

Thank you, I'll check it out when I get home and report back.

CosmicGiant Over a year ago

@AlanMoore Thank you! =)

Mathomatic Over a year ago

@AlmightyR, I actually had been aware of this requirement for either \r\n or \n depending on the system, but this totally escaped me last night. Regardless, your answer helped me understand the reason for this difference, so thank you. I ended up using \\R and \\R{2}. Quick question: I had to remove the ^ from the beginning of the line for the regex to find a match. This is strange since I copied the demo text directly from the printed content. Could removal of ^ be required due to the following code using a delimiter? content = new Scanner(file).useDelimiter("\\Z").next(); Thx

CosmicGiant Over a year ago

@JimJim That might be the case: "By default, the regular expressions ^ and $ ignore line terminators and only match at the beginning and the end, respectively, of the entire input sequence. If MULTILINE mode is activated then ^ matches at the beginning of input and after any line terminator except at the end of input. When in MULTILINEmode $ matches just before a line terminator or the end of the input sequence."

Alan Moore · Accepted Answer · 2015-12-27 05:26:23Z

2

It's the line separators. You're looking for \n, but your file actually uses \r\n. If you're running Java 8, you can change every \\n in your code to \\R (the universal line separator). For Java 7 or earlier, use \\r?\\n.

answered Dec 27, 2015 at 5:26

Alan Moore

75.6k13 gold badges110 silver badges161 bronze badges

3 Comments

Alan Moore Over a year ago

You need to make all the line separators \r\n to see what I'm talking about. The first \n in the regex fails to match the first line separator (after MAIN LEVEL), and if you fix that, the negative lookahead ((?!\\n\\n)) incorrectly matches the blank lines between sections, causing the regex to match the whole string.

CosmicGiant Over a year ago

Removed obsolete comments. Answer explains the issue and solution in clear, concise manner; +1. But I think I can do better, so I'm going to shamelessly copy-&-improve (with due credit).

Mathomatic Over a year ago

Yes, this was the issue Alan. Thank you. However (as explained in the comment above), I had to remove the leading caret ^ from my Java code in order to function, despite the properly functioning demo text above being copied from the code's content string. I'm wondering if this has to do with using a delimiter after retrieving the file as so:

File file = new File("C:\\Users\\ME\\Desktop\\content.txt");                             content = new Scanner(file).useDelimiter("\\Z").next();

Thoughts?

Collectives™ on Stack Overflow

Negative lookahead regex not working in Java

2 Answers 2

The problem:

The solution(s):

Side notes:

5 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

The problem:

The solution(s):

Side notes:

5 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related