Taking a substring from a larger string that matches a regex in Perl?

Question

I am writing a program that is intended to read through a large log file of web server activity. My intent is to have a few different regular expressions that grab specific bits of each line of the log and place them into a hash to keep track of how many times each IP, browser type, etc. appear. The part that is giving me trouble is taking that specific bit of text that matches the regex out of each line so I can analyze it alone. What I have currently is:

my @regexes = (qr/^\S*/);
# Iterate through each line of the data with each regex
foreach my $regex (@regexes) {
    # Create an empty hash for all the data
    my %dataHash;
    foreach my $line (@data) {
        # Up to this point I have verified that $line contains the correct line I want to take a "substring" of.
        my ($relevantData) = ($line =~ $regex);
        #print("$relevantData\n");

Printing $relevantData is not the end goal here of course, but it's to verify that I'm properly getting what I need. I don't think it's relevant but the @data array is just an array of the aforementioned log split at each line.

When I have this print statement active, it just prints "1" over and over again. Currently, The regex I'm using is just meant to take from the start of each line until the first instance of whitespace, so what I'm expecting is that first word. I've tried messing around with parenthesis placement, and what I have seems to match examples I've found online, but I may be misunderstanding them. This is technically a repost due to being a duplicate of this, but I had utilized this post prior to posting, which is what I replicated, but it doesn't seem to be working, so I am unsure what I'm doing wrong. Thank you in advance!

LanX · Accepted Answer · 2023-04-09 12:29:23Z

3

As a general advice, if a Perl expression returns 1 where it shouldn't you are most probably dealing with a Boolean value ( 1 is true) or a count (resp. both since 0 is false).

That's because regexes are often used in conditional clauses like if(/regex/)

A good starting point for learning regexes in Perl is https://perldoc.perl.org/perlrequick

You'll find this concise example at perlrequick#Extracting-matches

In list context, a match /regex/ with groupings will return the list of matched values ($1,$2,...). So we could rewrite it as
($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);

and your qr/^\S*/ has no (groupings), that's why your $relevantData just returns 1 for "match was true".

(You haven't shown any sample input so I can't comment further)

edited Apr 9, 2023 at 12:29

answered Apr 9, 2023 at 12:17

LanX

4963 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

zdim · Accepted Answer · 2023-04-20 17:44:29Z

The match operator in list context indeed returns captures. So a regex has to capture something, not only to match. It says in perlop

Matching in list context

If the /g option is not used, m// in list context returns a list consisting of the subexpressions matched by the parentheses in the pattern, that is, ($1, $2, $3...) (Note that here $1 etc. are also set). When there are no parentheses in the pattern, the return value is the list (1) for success. With or without parentheses, an empty list is returned upon failure.

So capture, use /^(\S*)/. Or perhaps /^(\S+)/, see below.

That should take care of the immediate question; here are a few more comments.

The regex ^\S* matches even if the line begins with spaces: the * ensures that it always matches, even if only an empty string between ^ and space. Perhaps that's OK in your application but you could use ^(\S+), for which the match will fail if there are spaces at the beginning.

Then a variable is declared ($relevantData) which may get populated, or it may not if the match fails (or may get an empty string, with ^(\S*) when the line begins with a space). Then it prints, or is really otherwise processed, regardless of what happened.

Perhaps it is checked later, and perhaps real code is very different from this. Still, one can right away exclude some cases -- by first skipping lines with no non-space (next if not /\S/; or such, before even chomp-ing), or checking that variable right after the match.

Another way is to branch on whether it matched

if ( my ($relevantData) = /^(\S+)/ ) { 
    ... 
}
else  { ... }

Or can test simply as if (/^(\S+)/) and assign $1 in the block.

Or, if you don't care for it at all when it fails

next if not /^(\S+)/;

my $relevantData = $1;
...

This code could be considered a little tricky though as it relies a side-effect in a way. (The first statement firstly checks for a match as it needs to decide whether to next or not -- but it also captures, and the rest of the code uses that).

Collectives™ on Stack Overflow

Taking a substring from a larger string that matches a regex in Perl?

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related