2

I am writing a program that is intended to read through a large log file of web server activity. My intent is to have a few different regular expressions that grab specific bits of each line of the log and place them into a hash to keep track of how many times each IP, browser type, etc. appear. The part that is giving me trouble is taking that specific bit of text that matches the regex out of each line so I can analyze it alone. What I have currently is:

my @regexes = (qr/^\S*/);
# Iterate through each line of the data with each regex
foreach my $regex (@regexes) {
    # Create an empty hash for all the data
    my %dataHash;
    foreach my $line (@data) {
        # Up to this point I have verified that $line contains the correct line I want to take a "substring" of.
        my ($relevantData) = ($line =~ $regex);
        #print("$relevantData\n");

Printing $relevantData is not the end goal here of course, but it's to verify that I'm properly getting what I need. I don't think it's relevant but the @data array is just an array of the aforementioned log split at each line.

When I have this print statement active, it just prints "1" over and over again. Currently, The regex I'm using is just meant to take from the start of each line until the first instance of whitespace, so what I'm expecting is that first word. I've tried messing around with parenthesis placement, and what I have seems to match examples I've found online, but I may be misunderstanding them. This is technically a repost due to being a duplicate of this, but I had utilized this post prior to posting, which is what I replicated, but it doesn't seem to be working, so I am unsure what I'm doing wrong. Thank you in advance!

0

2 Answers 2

3

As a general advice, if a Perl expression returns 1 where it shouldn't you are most probably dealing with a Boolean value ( 1 is true) or a count (resp. both since 0 is false).

That's because regexes are often used in conditional clauses like if(/regex/)

A good starting point for learning regexes in Perl is https://perldoc.perl.org/perlrequick

You'll find this concise example at perlrequick#Extracting-matches

In list context, a match /regex/ with groupings will return the list of matched values ($1,$2,...). So we could rewrite it as

($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);

and your qr/^\S*/ has no (groupings), that's why your $relevantData just returns 1 for "match was true".

(You haven't shown any sample input so I can't comment further)

Sign up to request clarification or add additional context in comments.

Comments

2

The match operator in list context indeed returns captures. So a regex has to capture something, not only to match. It says in perlop

Matching in list context

If the /g option is not used, m// in list context returns a list consisting of the subexpressions matched by the parentheses in the pattern, that is, ($1, $2, $3...) (Note that here $1 etc. are also set). When there are no parentheses in the pattern, the return value is the list (1) for success. With or without parentheses, an empty list is returned upon failure.

So capture, use /^(\S*)/. Or perhaps /^(\S+)/, see below.


That should take care of the immediate question; here are a few more comments.

The regex ^\S* matches even if the line begins with spaces: the * ensures that it always matches, even if only an empty string between ^ and space. Perhaps that's OK in your application but you could use ^(\S+), for which the match will fail if there are spaces at the beginning.

Then a variable is declared ($relevantData) which may get populated, or it may not if the match fails (or may get an empty string, with ^(\S*) when the line begins with a space). Then it prints, or is really otherwise processed, regardless of what happened.

Perhaps it is checked later, and perhaps real code is very different from this. Still, one can right away exclude some cases -- by first skipping lines with no non-space (next if not /\S/; or such, before even chomp-ing), or checking that variable right after the match.

Another way is to branch on whether it matched

if ( my ($relevantData) = /^(\S+)/ ) { 
    ... 
}
else  { ... }

Or can test simply as if (/^(\S+)/) and assign $1 in the block.

Or, if you don't care for it at all when it fails

next if not /^(\S+)/;

my $relevantData = $1;
...

This code could be considered a little tricky though as it relies a side-effect in a way. (The first statement firstly checks for a match as it needs to decide whether to next or not -- but it also captures, and the rest of the code uses that).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.