1

I am using Java and wanna build two reg-expressions which would fit two different scenarios:

1:

STARTText blah, blah
\    next line with more text, but the leading backslash
\    next line with more text, but the leading backslash
\    next line with more text, but the leading backslash

until the first line does not any longer start with a backslash.

2:

Now you will see the following links for the items:
1111 leading 4 digits and then some text
2565 leading 4 digits and then some text
8978 leading 4 digits and then some text

and this block ends with an additional empty line after, e.g. 8978. But additionally I know, the block with the starting digits will repeat 10 times and then finishes.

So filtering an individual line is somehow possible, but how to do it with multiple line breaks in between? And even with the first block when I don't really know when/how to end it. Also the search for the backslash. So, my approach is to have a closed expression, just one - which I could also use for replaceAll()

4 Answers 4

1

Regex 1:

/^STARTText.*?(\r?\n)(?:^\\.*?\1)+/m

Live Demo: http://www.rubular.com/r/G35kIn3hQ4

Regex 2:

/^.*?(\r?\n)(?:^\d{4}\s.*?\1)+/m

Live Demo: http://www.rubular.com/r/TxFbBP1jLJ

EDIT:

Java Demo 1: http://ideone.com/BPNrm6

Regex 1 in Java:

(?m)^STARTText.*?(\\r?\\n)(?:^\\\\.*?\\1)+

Java Demo 2: http://ideone.com/TQB8Gs

Regex 2 in Java:

(?m)^.*?(\\r?\\n)(?:^\\d{4}\\s.*?\\1)+
Sign up to request clarification or add additional context in comments.

2 Comments

Thx a lot for your effort. Unfortunately the reg-exp directly doesn't work, cause rubalar mess up the expression. But without it, it doesn't work with Java. The first regex gives me all the time java.util.regex.PatternSyntaxException. Which is unfortunately true for the second one. Since I have three other samples, I don't dig tooo deep to figure out what the problem is. Anyway thx.
Tim + anubhava, both solutions work and I really, appreciate the solution. I am in struggle, which of your both I accept as solution. Hard decision...
1

In both cases I'm using a zero assertion lookahead like (?=^[^\\]) to ensure the next line continues to have what I'm looking for.

  • (?= start the zero assertion lookahead, this requirs the value to exist but does not consume the value
  • ^[^\\] match the a start of a line followed by any character then a \
  • ) close the assertion

Part 1

This will match all text for part 1 where the first line captured is followed by any number of lines with \.

^([^\\].*?)(?=^[^\\])

Regular expression image

Edit live on Debuggex

    Java Code Example:
    import java.util.regex.Pattern;
    import java.util.regex.Matcher;
    class Module1{
      public static void main(String[] asd){
      String sourcestring = "STARTFirstText blah, blah
\    1next line with more text, but the leading backslash
\    2next line with more text, but the leading backslash
\    3next line with more text, but the leading backslash
STARTsecondText blah, blah
\    4next line with more text, but the leading backslash
\    5next line with more text, but the leading backslash
\    6next line with more text, but the leading backslash
foo";
      Pattern re = Pattern.compile("^([^\\\\].*?)(?=^[^\\\\])",Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
      Matcher m = re.matcher(sourcestring);
      int mIdx = 0;
        while (m.find()){
          for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
            System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
          }
          mIdx++;
        }
      }
    }

    $matches Array:
    (
        [0] => Array
            (
                [0] => STARTFirstText blah, blah
    \    1next line with more text, but the leading backslash
    \    2next line with more text, but the leading backslash
    \    3next line with more text, but the leading backslash

                [1] => STARTsecondText blah, blah
    \    4next line with more text, but the leading backslash
    \    5next line with more text, but the leading backslash
    \    6next line with more text, but the leading backslash

            )

        [1] => Array
            (
                [0] => STARTFirstText blah, blah
    \    1next line with more text, but the leading backslash
    \    2next line with more text, but the leading backslash
    \    3next line with more text, but the leading backslash

                [1] => STARTsecondText blah, blah
    \    4next line with more text, but the leading backslash
    \    5next line with more text, but the leading backslash
    \    6next line with more text, but the leading backslash

            )

    )

Part 2

This will match the first line followed by several lines of with which start with number

^([^\d].*?)(?=^[^\d])

Regular expression image

Edit live on Debuggex

Example

import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
  public static void main(String[] asd){
  String sourcestring = "First you will see the following links for the items:
1111 leading 4 digits and then some text
2565 leading 4 digits and then some text
8978 leading 4 digits and then some text

Second you will see the following links for the items:
2222 leading 4 digits and then some text
3333 leading 4 digits and then some text
4444 leading 4 digits and then some text";
  Pattern re = Pattern.compile("^([^\\d].*?)(?=^[^\\d])",Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
  Matcher m = re.matcher(sourcestring);
  int mIdx = 0;
    while (m.find()){
      for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
        System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
      }
      mIdx++;
    }
  }
}

$matches Array:
(
    [0] => Array
        (
            [0] => First you will see the following links for the items:
1111 leading 4 digits and then some text
2565 leading 4 digits and then some text
8978 leading 4 digits and then some text

            [1] => 

        )

    [1] => Array
        (
            [0] => First you will see the following links for the items:
1111 leading 4 digits and then some text
2565 leading 4 digits and then some text
8978 leading 4 digits and then some text

            [1] => 

        )

)

4 Comments

Thx for your code and results. They work somehow, if I am only interested in the question 'Is it included or not?' When I am interested in the result string, then the result is quite strange. i.e. the matcher m.find() finds it, but the m.group() shows only the start of the string. Is there a way to find it on the proper position, especially, when its somewhere in the middle of a long text?
Each found match is captured in the for loop and stored in variable m. To get the character location for a particular match, you'd use m.start() to show where the string starts inside the input text and m.end() where the last character is. This would go inside the for loop. See also javacodegeeks.com/2012/11/…
Unfortunately the result doesn't find the string with group. Anyway, thx a lot, and I have two solutions ;)
@Denomales: Sorry, I meant to write that as a comment to my own post. Got confused with the tiny edit window on my phone...
1

The first regex:

Pattern regex = Pattern.compile(
    "^          # Start of line\n" +
    "STARTText  # Match this text\n" +
    ".*\\r?\\n  # Match whatever follows on the line plus (CR)LF\n" +
    "(?:        # Match...\n" +
    " ^\\\\     # Start of line, then a backslash\n" +
    " .*\\r?\\n # Match whatever follows on the line plus (CR)LF\n" +
    ")*         # Repeat as needed", 
    Pattern.MULTILINE | Pattern.COMMENTS);

The second regex:

Pattern regex = Pattern.compile(
    "(?:        # Match...\n" +
    " ^         # Start of line\n" +
    " \\d{4}\\b # Match exactly four digits\n" +
    " .*\\r?\\n # Match whatever follows on the line plus (CR)LF\n" +
    ")+         # Repeat as needed (at least once)", 
    Pattern.MULTILINE | Pattern.COMMENTS);

2 Comments

Interesting the result. The first works very well, the second seems to have problems. Namely I can perform a regex.matcher(myString).find(), but very interesting is, that the match.group().length equals 0. So, the text is found, but not where. Even with the sample given. Any clue how to finetune it, so that match.group() reveals the search result properly?
Tim + anubhava, both solutions work and I really, appreciate the solution, especially the comments. I am in struggle, which of your both I accept as solution. Hard decision...
0

Use '\' for backslashes, use '\r\n|\r' for one linebreak, use '\d{4}' for the 4 digits:

.*(\r|r\n)

(your first blahblah)

\\.*(\r|r\n)

(your lines with backslash)

((\d{4}.*(\r|r\n))+(\r|\r\n))+

(your blocks of 4 digits ending with an emtpy line, the whole repeated with a +)

1 Comment

You didn't escape the r in some expressions :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.