Regexp for string with multiple lines and special structure

Question

I am using Java and wanna build two reg-expressions which would fit two different scenarios:

1:

STARTText blah, blah
\    next line with more text, but the leading backslash
\    next line with more text, but the leading backslash
\    next line with more text, but the leading backslash

until the first line does not any longer start with a backslash.

2:

Now you will see the following links for the items:
1111 leading 4 digits and then some text
2565 leading 4 digits and then some text
8978 leading 4 digits and then some text

and this block ends with an additional empty line after, e.g. 8978. But additionally I know, the block with the starting digits will repeat 10 times and then finishes.

So filtering an individual line is somehow possible, but how to do it with multiple line breaks in between? And even with the first block when I don't really know when/how to end it. Also the search for the backslash. So, my approach is to have a closed expression, just one - which I could also use for replaceAll()

anubhava · Accepted Answer · 2013-05-31 18:30:29Z

1

Regex 1:

/^STARTText.*?(\r?\n)(?:^\\.*?\1)+/m

Live Demo: http://www.rubular.com/r/G35kIn3hQ4

Regex 2:

/^.*?(\r?\n)(?:^\d{4}\s.*?\1)+/m

Live Demo: http://www.rubular.com/r/TxFbBP1jLJ

EDIT:

Java Demo 1: http://ideone.com/BPNrm6

Regex 1 in Java:

(?m)^STARTText.*?(\\r?\\n)(?:^\\\\.*?\\1)+

Java Demo 2: http://ideone.com/TQB8Gs

Regex 2 in Java:

(?m)^.*?(\\r?\\n)(?:^\\d{4}\\s.*?\\1)+

edited May 31, 2013 at 18:30

answered May 31, 2013 at 12:57

anubhava

790k67 gold badges603 silver badges671 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

LeO Over a year ago

Thx a lot for your effort. Unfortunately the reg-exp directly doesn't work, cause rubalar mess up the expression. But without it, it doesn't work with Java. The first regex gives me all the time java.util.regex.PatternSyntaxException. Which is unfortunately true for the second one. Since I have three other samples, I don't dig tooo deep to figure out what the problem is. Anyway thx.

LeO Over a year ago

Tim + anubhava, both solutions work and I really, appreciate the solution. I am in struggle, which of your both I accept as solution. Hard decision...

Community · Accepted Answer · 2017-02-08 14:41:45Z

1

In both cases I'm using a zero assertion lookahead like (?=^[^\\]) to ensure the next line continues to have what I'm looking for.

(?= start the zero assertion lookahead, this requirs the value to exist but does not consume the value
^[^\\] match the a start of a line followed by any character then a \
) close the assertion

Part 1

This will match all text for part 1 where the first line captured is followed by any number of lines with \.

^([^\\].*?)(?=^[^\\])

Regular expression image

Edit live on Debuggex

    Java Code Example:
    import java.util.regex.Pattern;
    import java.util.regex.Matcher;
    class Module1{
      public static void main(String[] asd){
      String sourcestring = "STARTFirstText blah, blah
\    1next line with more text, but the leading backslash
\    2next line with more text, but the leading backslash
\    3next line with more text, but the leading backslash
STARTsecondText blah, blah
\    4next line with more text, but the leading backslash
\    5next line with more text, but the leading backslash
\    6next line with more text, but the leading backslash
foo";
      Pattern re = Pattern.compile("^([^\\\\].*?)(?=^[^\\\\])",Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
      Matcher m = re.matcher(sourcestring);
      int mIdx = 0;
        while (m.find()){
          for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
            System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
          }
          mIdx++;
        }
      }
    }

    $matches Array:
    (
        [0] => Array
            (
                [0] => STARTFirstText blah, blah
    \    1next line with more text, but the leading backslash
    \    2next line with more text, but the leading backslash
    \    3next line with more text, but the leading backslash

                [1] => STARTsecondText blah, blah
    \    4next line with more text, but the leading backslash
    \    5next line with more text, but the leading backslash
    \    6next line with more text, but the leading backslash

            )

        [1] => Array
            (
                [0] => STARTFirstText blah, blah
    \    1next line with more text, but the leading backslash
    \    2next line with more text, but the leading backslash
    \    3next line with more text, but the leading backslash

                [1] => STARTsecondText blah, blah
    \    4next line with more text, but the leading backslash
    \    5next line with more text, but the leading backslash
    \    6next line with more text, but the leading backslash

            )

    )

Part 2

This will match the first line followed by several lines of with which start with number

^([^\d].*?)(?=^[^\d])

Regular expression image

Edit live on Debuggex

Example

import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
  public static void main(String[] asd){
  String sourcestring = "First you will see the following links for the items:
1111 leading 4 digits and then some text
2565 leading 4 digits and then some text
8978 leading 4 digits and then some text

Second you will see the following links for the items:
2222 leading 4 digits and then some text
3333 leading 4 digits and then some text
4444 leading 4 digits and then some text";
  Pattern re = Pattern.compile("^([^\\d].*?)(?=^[^\\d])",Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
  Matcher m = re.matcher(sourcestring);
  int mIdx = 0;
    while (m.find()){
      for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
        System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
      }
      mIdx++;
    }
  }
}

$matches Array:
(
    [0] => Array
        (
            [0] => First you will see the following links for the items:
1111 leading 4 digits and then some text
2565 leading 4 digits and then some text
8978 leading 4 digits and then some text

            [1] => 

        )

    [1] => Array
        (
            [0] => First you will see the following links for the items:
1111 leading 4 digits and then some text
2565 leading 4 digits and then some text
8978 leading 4 digits and then some text

            [1] => 

        )

)

edited Feb 8, 2017 at 14:41

CommunityBot

11 silver badge

answered May 31, 2013 at 13:41

Ro Yo Mi

15k5 gold badges38 silver badges43 bronze badges

4 Comments

LeO Over a year ago

Thx for your code and results. They work somehow, if I am only interested in the question 'Is it included or not?' When I am interested in the result string, then the result is quite strange. i.e. the matcher m.find() finds it, but the m.group() shows only the start of the string. Is there a way to find it on the proper position, especially, when its somewhere in the middle of a long text?

Ro Yo Mi Over a year ago

Each found match is captured in the for loop and stored in variable m. To get the character location for a particular match, you'd use m.start() to show where the string starts inside the input text and m.end() where the last character is. This would go inside the for loop. See also javacodegeeks.com/2012/11/…

LeO Over a year ago

Unfortunately the result doesn't find the string with group. Anyway, thx a lot, and I have two solutions ;)

Tim Pietzcker Over a year ago

@Denomales: Sorry, I meant to write that as a comment to my own post. Got confused with the tiny edit window on my phone...

Tim Pietzcker · Accepted Answer · 2013-05-31 17:25:11Z

1

The first regex:

Pattern regex = Pattern.compile(
    "^          # Start of line\n" +
    "STARTText  # Match this text\n" +
    ".*\\r?\\n  # Match whatever follows on the line plus (CR)LF\n" +
    "(?:        # Match...\n" +
    " ^\\\\     # Start of line, then a backslash\n" +
    " .*\\r?\\n # Match whatever follows on the line plus (CR)LF\n" +
    ")*         # Repeat as needed", 
    Pattern.MULTILINE | Pattern.COMMENTS);

The second regex:

Pattern regex = Pattern.compile(
    "(?:        # Match...\n" +
    " ^         # Start of line\n" +
    " \\d{4}\\b # Match exactly four digits\n" +
    " .*\\r?\\n # Match whatever follows on the line plus (CR)LF\n" +
    ")+         # Repeat as needed (at least once)", 
    Pattern.MULTILINE | Pattern.COMMENTS);

edited May 31, 2013 at 17:25

answered May 31, 2013 at 12:51

Tim Pietzcker

337k59 gold badges520 silver badges572 bronze badges

2 Comments

LeO Over a year ago

Interesting the result. The first works very well, the second seems to have problems. Namely I can perform a regex.matcher(myString).find(), but very interesting is, that the match.group().length equals 0. So, the text is found, but not where. Even with the sample given. Any clue how to finetune it, so that match.group() reveals the search result properly?

LeO Over a year ago

Tim + anubhava, both solutions work and I really, appreciate the solution, especially the comments. I am in struggle, which of your both I accept as solution. Hard decision...

Viktor Pless · Accepted Answer · 2013-05-31 12:49:13Z

0

Use '\' for backslashes, use '\r\n|\r' for one linebreak, use '\d{4}' for the 4 digits:

.*(\r|r\n)

(your first blahblah)

\\.*(\r|r\n)

(your lines with backslash)

((\d{4}.*(\r|r\n))+(\r|\r\n))+

(your blocks of 4 digits ending with an emtpy line, the whole repeated with a +)

answered May 31, 2013 at 12:49

Viktor Pless

1611 silver badge12 bronze badges

1 Comment

Jerry Over a year ago

You didn't escape the r in some expressions :)

Collectives™ on Stack Overflow

Regexp for string with multiple lines and special structure

4 Answers 4

EDIT:

Java Demo 1: http://ideone.com/BPNrm6

Java Demo 2: http://ideone.com/TQB8Gs

2 Comments

Part 1

Part 2

4 Comments

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

EDIT:

Java Demo 1: http://ideone.com/BPNrm6

Java Demo 2: http://ideone.com/TQB8Gs

2 Comments

Part 1

Part 2

4 Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related