Perl - Multi-line Regex & Appending Based on Capture Group

Question

I'm brand new to Perl, and trying to build a script to parse some output files from IBM SPSS Statistics (SPSS) to automatically generate syntax for some standard procedures (in this example, re-coding and designating missing values).

At this point, I've removed a number of extraneous lines and have my files pretty cleaned up and reformatted via some substitution regexes (where I turned the input record separator off to do my multi-line substitutions). The text I'm working with looks like this:

VALUE LABELS ROAD   
0 'No'   
1 'Yes'.

VALUE LABELS NOCALL   
1 'Refused to be interviewed'   
2 'Not at home'   
3 'No one on Premises'   
8 'Other'   
9997 'Not Applicable'   
9999 'Don't Know'.

VALUE LABELS Q1   
999 'Don't know'.     

VALUE LABELS Q2   
1 'Strongly dislike'   
2 'Somewhat dislike'   
3 'Would not care'   
4 'Somewhat like'   
5 'Strongly like'   
7 'Not Applicable'   
9 'Don't know'.

I want to add regexes to my script that will go through each block between "VALUE LABELS" and the "." at the end and look for either a 7 followed by "Not Applicable" or a 9 followed by "Don't Know", capturing the variable name that comes immediately after "VALUE LABELS" and appending it to the end of my output so that I know which variables have a "Not Applicable" value and which have a "Don't Know" value. So in this example, my output would be the original file with these additional lines at the end:

NOT APPLICABLE: NOCALL Q2  
DON'T KNOW: NOCALL Q1 Q2

At the moment, I can't for the life of me figure out how to get my regex to read only within each block from "VALUE LABELS" to the period. Instead, it will either grab from the first "VALUE LABELS" to the last instance of "7 Not Applicable" across blocks, or from the first "VALUE LABELS" to the first instance of "7 Not Applicable", whether or not the NA value is in the same block.

My current Perl code is as follows:

#!/bin/perl

use strict;
use warnings;

BEGIN {    # Input and Output Record Separators Off
    $\ = undef;
    $/ = undef;
}

open( my $infile, "<", $ARGV[0]);

my $outfile = "t2" . $ARGV[0];
open( my $write, ">", $outfile);

LINE: while ( <$infile> ) {

    # These are the regexes currently cleaning and reformatting the input

    s/\f/\n/g;
    s/(\d+\s.*)(\n\n)/$1\.$2/g;
    s/(\R\R).*\R\R/$1/g;
    s/(\R\R).*\R\R/$1/g;
    s/(\R\R)(.*\R)/$1VALUE LABELS $2/g;
}
continue {
    die "-p destination: $!\n" unless print $write "$_";
# Here is the regex I'm having an issue with
    if ( $infile =~ m/VALUE LABELS(.*)\n(?s).*\d+7 \x27Not Applicable\x27.*?\./g) {
    print $write "\n\nNOT APPLICABLE: $1";
    ]
}

Is there a way I can have this return what I'm looking for? Is there maybe a better way to write this entire script that would let me change the line separators part way through?

Is the blank line between blocks guaranteed to always be there? — Borodin
– Borodin, Commented Mar 20, 2017 at 5:06

zdim · Accepted Answer · 2017-03-22 17:31:33Z

1

On the face of it, you are asking for the range operator.

while (<$fh>)
{   
    if (/^\s*VALUE LABELS/ .. /\.$/) {
        # a line between the two identified above (including them)
        # process as below
    }
}

Your specification "to the period" is a little simple, but I trust that you know your data.

However, since your files have been "cleaned up" so that they have only blocks of the shown format, you don't really need to identify the range. The rest of the code is fairly straightforward.

Based on the data I take 7 or 9 to be the last in a group of numbers which is first on the line, followed by spaces and those phrases. Please clarify if this isn't correct.

my (%res, $label_name);    
while (<$fh>) 
{
    next if /^\s*$/;

    if (/^\s*VALUE LABELS\s*(.*)/) {
        $label_name = $1;
        next;
    }

    if (/^\d*7\s*'(Not Applicable)'/i or /^\d*9\s*'(Don't Know)'/i)  # '
    {
        # $1 has either "Not Applicable" or "Don't Know"
        push @{$res{uc $1}}, $label_name;
    } 
}
print "$_: @{$res{$_}}\n" for keys %res;

This prints the desired output.

We reset the $label_name once that line is encountered. Empty lines are skipped as well.

The data winds up in the hash %res with keys which are those two captured phrases. The value for each key is an anonymous array, and $label_name for that block is added each time a phrase is detected. This is done by pushing it to the dereferenced array for that key, @{ $res{$1} }.

For references and complex data structures see tutorial perlreftut and cookbook perldsc.

The uc is used to change to upper case, per desired output format. This is a little wasteful since uc runs every single time. You can instead omit it and post-process the obtained hash. That does involve copying the hash into a new one, which may or may not be more efficient. Or, you can use uc only when printing out the results.

In order to append content to a file open it in the append mode, with '>>'. See below.

What remains is to connect this with the processing you show, that cleans up the data. I don't know why you need to process the file as a string. There may well be good reasons for that, but I would not recommend it for what the question asks, after the data has been cleaned up. A regex on a multi-line text in place of the above simple processing is much harder and brittle to changes.

One change in your code is necessary, with how record separators are used. Normally you want to localize their changes, not set them in the BEGIN block. Like so

my $file_content;
CLEAN_UP_DATA: {
    local $/;  # slurp the file ($/ is now undef)
    open my $fh, '<', $file or die "Can't open $file: $!";
    $file_content = <$fh>;    
    # process file content, for example like with code in the question
};

# Here $/ is whatever it was before the block, likely the good old default

I named the block (CLEAN_UP_DATA:) just so, that isn't necessary. The semicolon at the end }; is. Note, once we unset $/ the whole file is read into a string at once. (Your while (<$infile>) has one iteration. You can see this by printing $. inside the loop.)

Then you can continue. One way is to break the string with cleaned up content into lines

foreach my $line (split /\n/, $file_content) {
    # process line by line
}

and use the code in this answer as it stands (or other line-by-line approaches).

Another way is to simply write out the cleaned-up file and open that afresh.

CLEAN_UP_DATA: {
    local $/;  # slurp the file ($/ is now undef)
    open my $fh, '<', $file or die "Can't open $file: $!";
    my $file_content = <$fh>;    
    # process file content
    my $fh_out, '>', $outfile  or die "Can't open $outfile: $!";
    # write it out
}; 

open my $fh, '<', $outfile  or die "Can't open $outfile: $!";
# Process line by line, obtaining %res
close $fh;

open my $fh_app, '>>', $outfile  or die "Can't open $outfile to append: $!";
# Now append results as needed, for example
print $fh_app "$_: @{$res{$_}}\n" for keys %res;

Here you can also use code in this answer as is, or other line-by-line solutions.

edited Mar 22, 2017 at 17:31

answered Mar 19, 2017 at 20:40

zdim

67.2k5 gold badges59 silver badges87 bronze badges

Sign up to request clarification or add additional context in comments.

12 Comments

CGhost Over a year ago

Forgive me, as I'm so new to this, for such simple questions, but 1. how can I write from the $label_name variable into my $write variable, and 2. because I have $\ set to undefined, ^ and $ won't work. Is it possible to do my initial multi-line regexes without having to turn $\ off?

CGhost Over a year ago

OK, I have this partially working, but I seem to only be able to bring one result outside of the block. I need to capture the full list of variable names that have "Don't Know" or "Not Applicable" labels with these matches, and then print them at the end of the continue block. How can I store all of the hits inside the while{} block, and then bring them outside into my continue statement?

zdim Over a year ago

@CGhost I'll add more. As for $/, normally you'd localize it and change it in a block, -- local $/;. Then as the code exits that block the global value is seen again. Like { local $/; # use it ... }; # now global (previous) $/ is on. What you are seeing is precisely the purpose of that.

zdim Over a year ago

@CGhost I updated a lot, restructuring to what I think better fits what you have, and adding a bit. Let me know how it goes.

zdim Over a year ago

@CGhost Another thing I just noticed in your first comment. You don't have to touch record-separators for the regex, and you don't want to. Multi-line text is processed fine with regex, in a number of ways. But note that it generally gets harder.

|

Borodin · Accepted Answer · 2017-03-20 09:21:05Z

1

If the full stops . are guaranteed to appear only at the end of each block then I would recommend using it as the input delimiter

This program reads each block into $_ and extracts the variable name after VALUE LABELS. Then the block is checked for 7 Not Applicable and 9 Don't Know, and the variable name is added to the list in %info for each phrase that was present

The output is simply a matter of dumping the hash

use strict;
use warnings 'all';

my ($file) = @ARGV;

my %info;

open my $fh, '<', $file or die qq{Unable to open "$file" for input: $!};

local $/ = ".";    # Terminate each read at a full stop

while ( <$fh> ) {

    next unless my ($var) = /VALUE LABELS\s+(\S+)/;

    for my $pattern ( qr/7\s+'(Not Applicable)'/i, qr/9 '(Don't Know)'/i ) {
        push @{ $info{uc $1} }, $var if /$pattern/;
    }
}

while ( my ($label, $vars) = each %info ) {
    printf "%s: %s\n", $label, "@$vars";
}

output

DON'T KNOW: NOCALL Q1 Q2
NOT APPLICABLE: NOCALL Q2

edited Mar 20, 2017 at 9:21

answered Mar 20, 2017 at 5:38

Borodin

127k9 gold badges72 silver badges146 bronze badges

5 Comments

zdim Over a year ago

This won't work as a complete program since OP clearly doesn't have the shown data in a file.

Borodin Over a year ago

@zdim: I understood from the question that the data is in a file, as the OP say he wants to*"parse some output files from one program (SPSS)"* and their code uses $ARGV[0] as the input file name.

CGhost Over a year ago

That's the formatting after those substitution regexes run. The period delimiter is actually added through multi-line regexes in the while loop, which is part of why it's been so difficult to grab these variable names with the specified Don't Know and Not Applicable values. I see comments to run the while loop with $/ kept as the default; I briefly tried to do that last night and it resulted in chaos with getting my formatting the way I wanted it. I'll take another stab this morning.

Borodin Over a year ago

@CGhost: Then I think you should show your original data so that we can help with the whole process. How about opening a new question and explaining exactly what output you need, and whether it comes from a file or through a pipe?

CGhost Over a year ago

Thanks! I went with zdim's answer due to the level of detail provided and the way it was broken down, but comparing the similarities and minor differences between the two was very helpful. I've up-voted both of you but my account is too new for them to display just yet.

Hellmar Becker · Accepted Answer · 2017-03-19 20:40:03Z

-1

I would read the entire input file into a single variable, and then try to match something like /(VALUE LABELS(.*?)\.\n)/gm. The /m modifier tells the regex engine to use a multiline match and the .*? does a non-greedy match up to the first dot that immediately precedes a newline.

Then, inside the result of that match, use a second regex to look for the "Not Applicable" string. Repeat until all input has been consumed.

answered Mar 19, 2017 at 20:40

Hellmar Becker

3,05215 silver badges19 bronze badges

Collectives™ on Stack Overflow

Perl - Multi-line Regex & Appending Based on Capture Group

3 Answers 3

12 Comments

output

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

12 Comments

output

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related