3

I have the following sentences:

     text <MIR-1> GGG-33 <EXP-V-3> text text <VACCVIRUS-PROP-1> some other.
     text <MIR-1> text <ASSC-PHRASE-1> text <VACCVIRUS-PROP-1> some other <PATTERN-1> other.

What I want to do is to create a single regular expression (regex) that can match the two sentences above. Note that the only differing pattern in the above sentences are the middle factor <EXP-V-3> and <ASSC-PHRASE-1>.

I'm stucked with the current attempt, which matched them in two redundant regex. What's the right way to do it?

 use Data::Dumper;

    @sent = ("text <MIR-1> GGG-33 <EXP-V-3> text text <VACCVIRUS-PROP-1> some other.",
             " text <MIR-1> text <ASSC-PHRASE-1> text <VACCVIRUS-PROP-1> some other <PATTERN-1> other.");


    foreach $sent (@sent) {
       if ( $sent =~ /.*<MIR-\d+>.*<EXP-V-\d+>.*<VACCVIRUS-PROP-\d+>.*/gi ) {

          print "$sent\n";
        }
        elsif( $sent =~ /.*<MIR-\d+>.*<ASSC-PHRASE-\d+>.*<VACCVIRUS-PROP-\d+>/gi ) {
         print "$sent\n";
        }
    }

Live demo

2
  • 3
    So, do you now about | - choose- metasymbol? I think it could helps you Commented Jul 23, 2013 at 7:28
  • 2
    (?:xxx|yyy)\s*<MIR-1>\s*(?:xxx|yyy)\s*(?:<EXP-V-3>|<ASSC-PHRASE-1>)\s*(?:xxxx|yyy)\s*<VACCVIRUS-PROP-1> Commented Jul 23, 2013 at 7:30

2 Answers 2

5

(?:xxx|yyy)\s*<MIR-1>\s*(?:xxx|yyy)\s*(?:<EXP-V-3>|<ASSC-PHRASE-1>)\s*(?:xxxx|yyy)\s*<VACCVIRUS-PROP-1>

Maybe this regexp not optimized, but it work.

Ok, what I do here:

First Magic:

(?:EXPR) - Capture group NOT CAPTURED # <?:> helps to avoid any capturing

Second Magic:

(a|b|c) - choose metasymbol in work. I would choose between <a> or <b> or <c>

Third Magic:

Here Rubular work

Generalization:

.+?\s*<MIR-\d+>\s*.+?\s*(?:<EXP-V-\d+>|<ASSC-PHRASE-\d+>)\s*.+?\s*<VACCVIRUS-PROP-\d+>.+

And your example:

Here Rubular work too

Reject string:

.+?\s*<MIR-\d+>\s*[^\[]+?\s*(?:<EXP-V-\d+>|<ASSC-PHRASE-\d+>)\s*[^\]]+?\s*<VACCVIRUS-PROP-\d+>.+

Fourth Magic:

[^SYMBOLS] - Class of symbols. <^> At the beginning mean 'I DON'T want match them'.

Here Example:

[abc]{1} - I will match <a> or <b> or <c>
[^abc]{1} - I will NOT match <a> or <b> or <c>

Here Rubular work again

Sign up to request clarification or add additional context in comments.

6 Comments

How can I make it more general? As 'xxx'or 'yyy' can actually be anything.
oh, sorry, yes, xxx and yyy i take from your example. if you want, you could put anything here. please, provide more examples, what you want to match
Thanks a million. One last thing. How can I make it reject the following string text <MIR-1> text [[express]<EXP-V-0>ion]<EXP-N-0> text <VACCVIRUS-PROP-1> some other <PATTERN-1> other.
i.e. the EXP-V-\d+ pattern should be unique and followed by white space and no subsequent ].
please, check update again. Don't forget accept answer at the end ;)
|
0

Refactor what you have

@sent = ("text <MIR-1> GGG-33 <EXP-V-3> text text <VACCVIRUS-PROP-1> some other.",
         " text <MIR-1> text <ASSC-PHRASE-1> text <VACCVIRUS-PROP-1> some other <PATTERN-1> other.");

foreach $sent (@sent) {
   if ( $sent =~ /.*<MIR-\d+>.*<(?:EXP-V|ASSC-PHRASE)-\d+>.*<VACCVIRUS-PROP-\d+>.*/gi ) {
      print "$sent\n";
    }
}

Where

.*<MIR-\d+>.*<EXP-V-\d+>.*<VACCVIRUS-PROP-\d+>.*|.*<MIR-\d+>.*<ASSC-PHRASE-\d+>.*<VACCVIRUS-PROP-\d+>.*

becomes

.*<MIR-\d+>.*<(?:EXP-V|ASSC-PHRASE)-\d+>.*<VACCVIRUS-PROP-\d+>.*

Use regex refactor software http://regexformat.com

enter image description here

https://regex101.com/r/TiXXO6/1

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.