Why wrong output from the RegEx?

Question

When I run the script below, I get

$VAR1 = [
          'ok0.ok]][[file:ok1.ok',
          undef,
          undef,
          'ok2.ok|dgdfg]][[file:ok3.ok',
          undef,
          undef,
          undef,
          undef,
          undef,
          undef,
          undef,
          undef,
          undef,
          undef,
          undef,
          undef,
          undef
        ];

where I was hoping for ok0.ok ok1.ok ok2.ok ok3.ok and ideally also ok4.ok ok5.ok ok6.ok ok7.ok

Question

Can anyone see what I am doing wrong?

#!/usr/bin/perl

use strict;
use warnings;
use Data::Dumper;

my $html = "sdfgdfg[[file:ok0.ok]][[file:ok1.ok ]] [[file:ok2.ok|dgdfg]][[file:ok3.ok |dfgdfgg]] [[media:ok4.ok]] [[media:ok5.ok ]] [[media:ok6.ok|dgdfg]] [[media:ok7.ok |dfgdfgg]]ggg";

my @seen = ($html =~ /file:(.*?) |\||\]/g);

print Dumper \@seen;

DavidO · Accepted Answer · 2012-06-26 09:46:16Z

2

A negated character class can simplify things a bit, I think. Be explicit as to your anchors (file:, or media:), and explicit as to what terminates the sequence (a space, pipe, or closing bracket). Then capture.

my @seen = $html =~ m{(?:file|media):([^\|\s\]]+)}g;

Explained:

my @seen = $html =~ m{
    (?:file|media):        # Match either 'file' or 'media', don't capture, ':'
    ( [^\|\s\]]+ )         # Match and capture one or more, anything except |\s]
}gx;

Capturing stops as soon as ], |, or \s is encountered.

edited Jun 26, 2012 at 9:46

answered Jun 26, 2012 at 9:39

DavidO

14k4 gold badges41 silver badges67 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

rubber boots Over a year ago

[^\|\s\]] reduces to [^]\s|] (or [^] |] if you can't stand backspaces in char classes ;-).

Konerak · Accepted Answer · 2012-06-26 11:44:36Z

1

It looks like you are trying to match everything starting with file: and ending with either a space, a pipe or a closing square bracket.

Your OR-statement at the end of the regexp needs to be between (square) brackets itself though:

my @seen = ($html =~ /file:(.*?)[] |]/g);

If you want the media: blocks as well, OR the file part. You might want a non-capturing group here:

my @seen = ($html =~ /(?:file|media):(.*?)[] |]/g);

How it works

The first statement will capture everything between 'file:' and a ], | or .

The second statement does the same, but with both file and media. We use a non-capturing group (?:group) instead of (group) so the word is not put into your @seen.

edited Jun 26, 2012 at 11:44

answered Jun 26, 2012 at 9:37

Konerak

39.8k13 gold badges102 silver badges121 bronze badges

5 Comments

Sandra Schlichting Over a year ago

Very interesting. Is it on purpose that you say (.*?)[\]|\]| ] instead of (.*?)[\]\| ] ?

Borodin Over a year ago

While your solution will do what is required, you seem to misunderstand the way character classes work. Your original regex had [\]|\]| ] which matches any of the list close bracket, pipe, close bracket, pipe, or space. Your revision now matches pipe, pipe, close bracket, pipe, or space. All you require is [] |] which matches close bracket, pipe, or space.

Konerak Over a year ago

@Borodin: I indeed confused character class with group. No need for the pipe to OR characters in a character class []. Thanks!

Sandra Schlichting Over a year ago

Now I am confused. Why doesn't Perl complain about unbalanced bracket with (.*?)[] |] ? I mean, how does it know that the first ] is meant as a character to be matched, and not the end of character class?

rubber boots Over a year ago

@SandraSchlichting - pease read Section: Special Characters Inside a Bracketed Character Class in perlrecharclass

Martin · Accepted Answer · 2012-06-26 09:41:16Z

1

Try with

my @seen = ($html =~ /\[\[\w+:(\w+\.\w+)\]\]/g);

answered Jun 26, 2012 at 9:41

Martin

12.2k7 gold badges71 silver badges113 bronze badges

Comments

rubber boots · Accepted Answer · 2012-06-26 13:19:51Z

this is what your regex does:

 ...
 my $ss = qr {
              file: # start with file + column as anchor
              (         # start capture group
               .*?      # use any character in a non-greedy sweep
              )         # end capture group
              \s        # end non-greedy search on a **white space**

              |     # OR expression encountered up to here with:
              \|     # => | charachter  
              |      # OR expression encountered up to here with:
              \]       # => ] charachter  
              }x;

 my @seen = $html =~ /$ss/g;
 ...

and this is what your regex is supposed to do:

 ...
 my $rb = qr {
             \w :      # alphanumeric + column as front anchor
             (         # start capture group 
              [^]| ]+  # the terminating sequence
             )         # end capture group 
            }x;

 my @seen = $html =~ /$rb/g;
 ...

If you want a short, concise regex and know what you do, you could drop the capturing group altogether and use full capture chunk in list context together with positive lookbehind:

 ...
 my @seen = $html =~ /(?<=(?:.file|media):)[^] |]+/g; # no cature group ()
 ...

or, if no other structure in your data as shown is to be dealt with, use the : as only anchor:

 ...
 my @seen = $html =~ /(?<=:)[^] |]+/g;   # no capture group and short
 ...

Regards

rbo

Borodin · Accepted Answer · 2012-06-26 10:27:01Z

0

Depending on the possible characters in the file name, I think you probably want

my @seen = $html =~ /(?:file|media):([\w.]+)/g;

which captures all of ok0.ok through to ok7.ok.

It relies on the file names containing alphanumeric characters plus underscore and dot.

answered Jun 26, 2012 at 10:27

Borodin

127k9 gold badges72 silver badges146 bronze badges

Comments

Sandra Schlichting · Accepted Answer · 2012-06-26 12:30:08Z

0

I hope this is what you required.

#!/usr/bin/perl

use strict;  

use warnings;

use Data::Dumper;


my $string = "sdfgdfg[[file:ok0.ok]][[file:ok1.ok ]] [[file:ok2.ok|dgdfg]][[file:ok3.ok |dfgdfgg]] [[media:ok4.ok]] [[media:ok5.ok ]] [[media:ok6.ok|dgdfg]] [[media:ok7.ok |dfgdfgg]]ggg";

my @matches;

@matches = $string =~ m/ok\d\.ok/g;

print Dumper @matches;

Output:

$VAR1 = 'ok0.ok';

$VAR2 = 'ok1.ok';

$VAR3 = 'ok2.ok';

$VAR4 = 'ok3.ok';

$VAR5 = 'ok4.ok';

$VAR6 = 'ok5.ok';

$VAR7 = 'ok6.ok';

$VAR8 = 'ok7.ok';

Regards, Kiran.

edited Jun 26, 2012 at 12:30

Sandra Schlichting

26.3k38 gold badges122 silver badges188 bronze badges

answered Jun 26, 2012 at 12:06

Kiran Chaudhary

1441 gold badge3 silver badges7 bronze badges

Collectives™ on Stack Overflow

Why wrong output from the RegEx?

6 Answers 6

1 Comment

How it works

5 Comments

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

1 Comment

How it works

5 Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related