2

When I run the script below, I get

$VAR1 = [
          'ok0.ok]][[file:ok1.ok',
          undef,
          undef,
          'ok2.ok|dgdfg]][[file:ok3.ok',
          undef,
          undef,
          undef,
          undef,
          undef,
          undef,
          undef,
          undef,
          undef,
          undef,
          undef,
          undef,
          undef
        ];

where I was hoping for ok0.ok ok1.ok ok2.ok ok3.ok and ideally also ok4.ok ok5.ok ok6.ok ok7.ok

Question

Can anyone see what I am doing wrong?

#!/usr/bin/perl

use strict;
use warnings;
use Data::Dumper;

my $html = "sdfgdfg[[file:ok0.ok]][[file:ok1.ok ]] [[file:ok2.ok|dgdfg]][[file:ok3.ok |dfgdfgg]] [[media:ok4.ok]] [[media:ok5.ok ]] [[media:ok6.ok|dgdfg]] [[media:ok7.ok |dfgdfgg]]ggg";

my @seen = ($html =~ /file:(.*?) |\||\]/g);

print Dumper \@seen;

6 Answers 6

2

A negated character class can simplify things a bit, I think. Be explicit as to your anchors (file:, or media:), and explicit as to what terminates the sequence (a space, pipe, or closing bracket). Then capture.

my @seen = $html =~ m{(?:file|media):([^\|\s\]]+)}g;

Explained:

my @seen = $html =~ m{
    (?:file|media):        # Match either 'file' or 'media', don't capture, ':'
    ( [^\|\s\]]+ )         # Match and capture one or more, anything except |\s]
}gx;

Capturing stops as soon as ], |, or \s is encountered.

Sign up to request clarification or add additional context in comments.

1 Comment

[^\|\s\]] reduces to [^]\s|] (or [^] |] if you can't stand backspaces in char classes ;-).
1

It looks like you are trying to match everything starting with file: and ending with either a space, a pipe or a closing square bracket.

Your OR-statement at the end of the regexp needs to be between (square) brackets itself though:

my @seen = ($html =~ /file:(.*?)[] |]/g);

If you want the media: blocks as well, OR the file part. You might want a non-capturing group here:

my @seen = ($html =~ /(?:file|media):(.*?)[] |]/g);

How it works

The first statement will capture everything between 'file:' and a ], | or .

The second statement does the same, but with both file and media. We use a non-capturing group (?:group) instead of (group) so the word is not put into your @seen.

5 Comments

Very interesting. Is it on purpose that you say (.*?)[\]|\]| ] instead of (.*?)[\]\| ] ?
While your solution will do what is required, you seem to misunderstand the way character classes work. Your original regex had [\]|\]| ] which matches any of the list close bracket, pipe, close bracket, pipe, or space. Your revision now matches pipe, pipe, close bracket, pipe, or space. All you require is [] |] which matches close bracket, pipe, or space.
@Borodin: I indeed confused character class with group. No need for the pipe to OR characters in a character class []. Thanks!
Now I am confused. Why doesn't Perl complain about unbalanced bracket with (.*?)[] |] ? I mean, how does it know that the first ] is meant as a character to be matched, and not the end of character class?
@SandraSchlichting - pease read Section: Special Characters Inside a Bracketed Character Class in perlrecharclass
1

Try with

my @seen = ($html =~ /\[\[\w+:(\w+\.\w+)\]\]/g);

Comments

1

this is what your regex does:

 ...
 my $ss = qr {
              file: # start with file + column as anchor
              (         # start capture group
               .*?      # use any character in a non-greedy sweep
              )         # end capture group
              \s        # end non-greedy search on a **white space**

              |     # OR expression encountered up to here with:
              \|     # => | charachter  
              |      # OR expression encountered up to here with:
              \]       # => ] charachter  
              }x;

 my @seen = $html =~ /$ss/g;
 ...

and this is what your regex is supposed to do:

 ...
 my $rb = qr {
             \w :      # alphanumeric + column as front anchor
             (         # start capture group 
              [^]| ]+  # the terminating sequence
             )         # end capture group 
            }x;

 my @seen = $html =~ /$rb/g;
 ...

If you want a short, concise regex and know what you do, you could drop the capturing group altogether and use full capture chunk in list context together with positive lookbehind:

 ...
 my @seen = $html =~ /(?<=(?:.file|media):)[^] |]+/g; # no cature group ()
 ...

or, if no other structure in your data as shown is to be dealt with, use the : as only anchor:

 ...
 my @seen = $html =~ /(?<=:)[^] |]+/g;   # no capture group and short
 ...

Regards

rbo

Comments

0

Depending on the possible characters in the file name, I think you probably want

my @seen = $html =~ /(?:file|media):([\w.]+)/g;

which captures all of ok0.ok through to ok7.ok.

It relies on the file names containing alphanumeric characters plus underscore and dot.

Comments

0

I hope this is what you required.

#!/usr/bin/perl

use strict;  

use warnings;

use Data::Dumper;


my $string = "sdfgdfg[[file:ok0.ok]][[file:ok1.ok ]] [[file:ok2.ok|dgdfg]][[file:ok3.ok |dfgdfgg]] [[media:ok4.ok]] [[media:ok5.ok ]] [[media:ok6.ok|dgdfg]] [[media:ok7.ok |dfgdfgg]]ggg";

my @matches;

@matches = $string =~ m/ok\d\.ok/g;

print Dumper @matches;

Output:

$VAR1 = 'ok0.ok';

$VAR2 = 'ok1.ok';

$VAR3 = 'ok2.ok';

$VAR4 = 'ok3.ok';

$VAR5 = 'ok4.ok';

$VAR6 = 'ok5.ok';

$VAR7 = 'ok6.ok';

$VAR8 = 'ok7.ok';

Regards, Kiran.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.