-2

I have a problem I don't understand: Given a string consisting of multiple lines like this:

  DB<5> x $token
0  'ACCOUNT_CHANGED     = "20250728081545Z"
ACCOUNT_CHANGED_T   = "1753690545"
COMMON_NAME         = "User Test"
CURRENT_TIME_T      = "1753863424"
INSTANCE            = "testing"
...
TEMPLATE_TIME_T     = "1753699621"
USER_ID             = "testuser"
WRAP_COLUMN         = "999"'

I'm trying to process the strings line-by line using this code:

#...
        while ($token && $token =~ /^([^\n]*?)(\n)(.*)$/m) {
            $add_token->($1)
                if (defined($1));
            $add_token->($2);
        }
        $add_token->($token)
            if (defined($token) && length($token) > 0);
#...

As it seems $3 is just set to be the second line instead of the rest of the string after the first line break. Like this:

  DB<4> x ($1, $2, $3)
0  'ACCOUNT_CHANGED     = "20250728081545Z"'
1  '
'
2  'ACCOUNT_CHANGED_T   = "1753690545"'

I have some questions, including a stupid one:

  1. (stupid one) In my first attempt I made a typing error, making the regex end with (.)*$, causing $3 to be the last character of the line following the newline. Wouldn't * repeat the capture group, creating as many groups as there are characters on the line (so $3 should have been the first character instead of the last)?
  2. Is it correct to use the /m modifier?
  3. Why doesn't the regex work as intended, i.e.: where is the mistake?
7
  • 1
    . matches all characters except line breaks, regex101.com/r/igYTeu/1 - you'd need to add the s modifier to make it match those as well, regex101.com/r/igYTeu/2 Commented Jul 30 at 9:11
  • Are you using file at once mode or undef $/? Your second capture group is always a newline, and there should only be one newline per line unless you are using file at once. Commented Jul 30 at 13:51
  • re "as intended", desired results would make your question much clearer; you have a while loop and talk about "process...line by line",but lack /g and are clearly trying to match 2 or 2+. more lines in a single regex, so your intent is very unclear to me Commented Jul 30 at 16:05
  • Don't use /m multi-line mode for this since you're counting just the first 3 lines. Instead use .* in conjunction with the \R line-break construct. No other modifiers. Only capture the second and third lines then. /^.*\R(.*)\R(.*)/ regex101.com/r/9k6bxo/1 . Also , these capture groups will always be defined() a valid check is on its length. Commented Jul 30 at 19:51
  • @sln Can your explain why I'm "counting just the first 3 lines."? Commented Jul 31 at 14:18

4 Answers 4

1

I quite don't understand:

As it seems $3 is just set to be the second line instead of the rest of the string after the first line break.

"Rest of the string after first line break" effectively means second line. And if you want to match "rest of string" you need to stop using .* as . won't match newline, unless you turn on single line mode /s (and yes, /m is perfectly fine, it just enables ^ and $ to match start and end of line, instead of entire string).

Here's regex with corrections Regex101.

Regarding question about (.)* not producing more capturing groups - this is because captured groups are normally numbered in the order of appearance in regex, not based on what they matched and how many times. So this pattern:

^([^\n]*?)(\n)(.)*$

has only 3 capture groups. Just the third one is overwritten each time there's next match for that group.

Sign up to request clarification or add additional context in comments.

2 Comments

I regularly forget that when . matches "any character" that \n is not a character ;-) - Actually I cannot remember having read about the /s modifier before.
\n is a character, just not one the unmodified . matches. . matches any character except a newline, which is different than how you are thinking about it.
1

If we check your code and print the matches:

while ($token =~ /^([^\n]*?)(\n)(.*)$/mg) {
    print Dumper $1,$2,$3;
}

We can see:

$VAR1 = 'ACCOUNT_CHANGED     = "20250728081545Z"';
$VAR2 = '
';
$VAR3 = 'ACCOUNT_CHANGED_T   = "1753690545"';
$VAR1 = 'COMMON_NAME         = "User Test"';
$VAR2 = '
';
$VAR3 = 'CURRENT_TIME_T      = "1753863424"';
$VAR1 = 'INSTANCE            = "testing"';
$VAR2 = '
';
$VAR3 = 'TEMPLATE_TIME_T     = "1753699621"';
$VAR1 = 'USER_ID             = "testuser"';
$VAR2 = '
';
$VAR3 = 'WRAP_COLUMN         = "999"';

You are matching two lines at the time, and also uselessly capturing a newline. This would create false mismatches if you have an odd number of lines in your input. The last line would not match.

You will see that I have added the /g flag to allow your regex to be iterated over, which I think you intended. Otherwise it would just be an infinite loop.

[^\n]*? is a fancy way to say .*?, since period . matches any character except newline. The quantifier *? means match as short a string as possible. But since your string is anchored with ^ and \n that is redundant. It will match the same amount, regardless of your minimal match quantifier.

(\n) is not a very useful capture. We don't need to match a constant, we always know what a constant value is.

So this part of your regex really should be written

^(.*)\n

1: What does (.)*$ do?

Well, it captures several single characters at the end of a line. Lets try it out:

$ perl -lwe'$_ = "123456789"; /(.)*$/; print $1;'
9

It captures and overwrites several times until it finds the last match. It is a quite impressively inoperational regex. It does match, after a fashion, but it's hard to understand and predict. I.e.: You probably should not use it.


2: Is it correct to use /m modifier?

Yes, if you wish to use ^ and $ inside a multiline string to mean "match after a newline or beginning of string" (^), and "match before a newline or end of string" ($) respectively.

Is it necessary for this string? No, I will demonstrate a few options below.


3: Why doesn't the regex work as intended, i.e.: where is the mistake?

Since I don't know what your intention was, I don't know what your mistake is. The code adds tokens for odd line numbers, and a newline. And then discards even line numbers. What you probably want is to match each line, for which you really only need

/(.*)/g

So basically to match each line, you can just do

while ($token =~ /(.*)/g) {

You do need the /g modifier, otherwise it will be an infinite loop. With the redundant /m modifier it would be:

while ($token =~ /^(.*)$/mg) {

As solutions go, this is rudimentary. You can just split on newline to get the same result. You might consider these alternatives:

my %items;
while ($token =~ /(.+)\s+=\s+(.+)/g) {  # match non-newline strings around =
        $items{$1} = $2;               # store match
}
print Dumper \%items;

# I would probably have done this:
# Hash version #2
my %items2 = map { split /\s+=\s+/ }   # 2. split each line on the =
                split /\n/, $token;    # 1. split string on newline
print Dumper \%items2;

Both of these will produce:

$VAR1 = {
          'ACCOUNT_CHANGED_T  ' => '"1753690545"',
          'USER_ID            ' => '"testuser"',
          'WRAP_COLUMN        ' => '"999"',
          'ACCOUNT_CHANGED    ' => '"20250728081545Z"',
          'CURRENT_TIME_T     ' => '"1753863424"',
          'INSTANCE           ' => '"testing"',
          'TEMPLATE_TIME_T    ' => '"1753699621"',
          'COMMON_NAME        ' => '"User Test"'
        };

TLDR: You overcomplicated the regex. (.*) is all that was needed.

Comments

1

One more way. Simply match each line's key/val pair and send it into a hash.
At this point you can sort the keys during a print or other activity.

Note that the regex is a little fancy as it trims white space, which is not really necessary.
Aslo if you'd accept an empty value then use a regex that makes it optional :
/^\h*(\S.*?)\h* = \h*"\h*(\S?.*?)\h*"\h*/mxg.

use strict;
use warnings;

$/ = undef;

my $str = <DATA>;  # slurp it all in

my %pair = ($str =~ /^\h*(\S.*?)\h* = \h*"\h*(\S.*?)\h*"\h*/mxg);

for my $key ( sort keys %pair ) {
   printf( "%-20s= %s\n", $key, $pair{$key});
}

__DATA__

ACCOUNT_CHANGED     = "20250728081545Z"
ACCOUNT_CHANGED_T   = "1753690545"
COMMON_NAME         = "User Test"
CURRENT_TIME_T      = "1753863424"
INSTANCE            = "testing"
TEMPLATE_TIME_T     = "1753699621"
USER_ID             = "testuser"
WRAP_COLUMN         = "999"

Output

ACCOUNT_CHANGED     = 20250728081545Z
ACCOUNT_CHANGED_T   = 1753690545
COMMON_NAME         = User Test
CURRENT_TIME_T      = 1753863424
INSTANCE            = testing
TEMPLATE_TIME_T     = 1753699621
USER_ID             = testuser
WRAP_COLUMN         = 999

Comments

0

When processing data, it is often easier to just split it into lines, process each line and split that into key/value pairs. Then once you get the key/value pairs you can process them however you want. Put the keys and values into a separate @array linked by index, a %hash, or even put them into a database using something like DBI.

I put your sample data into a string $s and showed how to process it from there. That will definitely be the easiest. Your regex doesn't look quite right,

while($s =~ /(\S*) *\= *"(\S*)"(\n|$)/g){ #process line by line }

will work if you would rather process with regex instead of split.

Here is the code...

#!/usr/bin/perl -w

$s = 'ACCOUNT_CHANGED     = "20250728081545Z"
ACCOUNT_CHANGED_T   = "1753690545"
COMMON_NAME         = "User Test"
CURRENT_TIME_T      = "1753863424"
INSTANCE            = "testing"
TEMPLATE_TIME_T     = "1753699621"
USER_ID             = "testuser"
WRAP_COLUMN         = "999"';

#split each string into lines for easier processing
my @lines = split(/\n/,$s);
my (@key, @value, %keyValuePair);

#split each line into key/value pairs
#NOTE: arrays preserve the order they were entered
#the hash makes it easier to search through everything
#but the order they were entered is NOT preserved
for(@lines){
  my ($key, $value) = split(/ *= */);
  push(@keys, $key);
  push(@values, $value);
  $keyValuePair{$key} = $value;
}

print "All keys (order preserved): @keys\n\n";
print "All values (order preserved): @values\n\n";
print "All key/value pairs stored in hash NOTE: ORDER ENTERED IS NOT PRESERVED\n";
for(keys %keyValuePair){
  print "$_\t=>$keyValuePair{$_}\n";
}
print "\n";
print "Easy O(1) lookup with hash, ACCOUNT_CHANGED = $keyValuePair{ACCOUNT_CHANGED}\n\n";
print "Slightly less easy O(n) lookup with array, but order entered is preserved if that is important\n";
my $i=0;
for(@keys){
  if(/^ACCOUNT_CHANGED$/){ #NOTE: this regex has to be anchored, or it will also match "ACCOUNT_CHANGED_T"
                            #$_ eq "ACCOUNT_CHANGED" also works
    print "$keys[$i] = $values[$i]\n";
  }
  $i++; #@keys and @values are linked by index
}
print "\n";

print "If you absolutely must process this data via regex, this one should work...\n";
#backreference anything not a space until 0 or more spaces, an equal, 0 or more spaces
#then backreference everything not a space until the newline or end of file
#NOTE: this regex assumes the keys and values do not contain spaces or equal signs
$i=0;
while($s =~ /(\S*) *\= *"(\S*)"(\n|$)/g){
  ($key, $value) = ($1, $2);
  print "Match $i: key: \"$key\"\tvalue: \"$value\"\n";
  $i++;
}

Output looks like this...

$perl process.data.in.string.pl

All keys (order preserved): ACCOUNT_CHANGED ACCOUNT_CHANGED_T COMMON_NAME CURRENT_TIME_T INSTANCE TEMPLATE_TIME_T USER_ID WRAP_COLUMN

All values (order preserved): "20250728081545Z" "1753690545" "User Test" "1753863424" "testing" "1753699621" "testuser" "999"

All key/value pairs stored in hash NOTE: ORDER ENTERED IS NOT PRESERVED
ACCOUNT_CHANGED  =>"20250728081545Z"
WRAP_COLUMN      =>"999"
CURRENT_TIME_T   =>"1753863424"
TEMPLATE_TIME_T  =>"1753699621"
USER_ID          =>"testuser"
COMMON_NAME      =>"User Test"
INSTANCE         =>"testing"
ACCOUNT_CHANGED_T=>"1753690545"

Easy O(1) lookup with hash, ACCOUNT_CHANGED = "20250728081545Z"

Slightly less easy O(n) lookup with array, but order entered is preserved if that is important
ACCOUNT_CHANGED = "20250728081545Z"

If you absolutely must process this data via regex, this one should work...
Match 0: key: "ACCOUNT_CHANGED"   value: "20250728081545Z"
Match 1: key: "ACCOUNT_CHANGED_T" value: "1753690545"
Match 2: key: "CURRENT_TIME_T"    value: "1753863424"
Match 3: key: "INSTANCE"          value: "testing"
Match 4: key: "TEMPLATE_TIME_T"   value: "1753699621"
Match 5: key: "USER_ID"           value: "testuser"
Match 6: key: "WRAP_COLUMN"       value: "999"

1 Comment

In general one would use split instead. However if I'd change the regex to /\r?\n/ and I need the actual separator (as I do), I could not use split here. Of course you don't know the context of the code, but it's actually part of a specific line-wrapping algorithm that should not change things unless a very long line is being wrapped...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.