I'm brand new to Perl, and trying to build a script to parse some output files from IBM SPSS Statistics (SPSS) to automatically generate syntax for some standard procedures (in this example, re-coding and designating missing values).
At this point, I've removed a number of extraneous lines and have my files pretty cleaned up and reformatted via some substitution regexes (where I turned the input record separator off to do my multi-line substitutions). The text I'm working with looks like this:
VALUE LABELS ROAD
0 'No'
1 'Yes'.
VALUE LABELS NOCALL
1 'Refused to be interviewed'
2 'Not at home'
3 'No one on Premises'
8 'Other'
9997 'Not Applicable'
9999 'Don't Know'.
VALUE LABELS Q1
999 'Don't know'.
VALUE LABELS Q2
1 'Strongly dislike'
2 'Somewhat dislike'
3 'Would not care'
4 'Somewhat like'
5 'Strongly like'
7 'Not Applicable'
9 'Don't know'.
I want to add regexes to my script that will go through each block between "VALUE LABELS" and the "." at the end and look for either a 7 followed by "Not Applicable" or a 9 followed by "Don't Know", capturing the variable name that comes immediately after "VALUE LABELS" and appending it to the end of my output so that I know which variables have a "Not Applicable" value and which have a "Don't Know" value. So in this example, my output would be the original file with these additional lines at the end:
NOT APPLICABLE: NOCALL Q2
DON'T KNOW: NOCALL Q1 Q2
At the moment, I can't for the life of me figure out how to get my regex to read only within each block from "VALUE LABELS" to the period. Instead, it will either grab from the first "VALUE LABELS" to the last instance of "7 Not Applicable" across blocks, or from the first "VALUE LABELS" to the first instance of "7 Not Applicable", whether or not the NA value is in the same block.
My current Perl code is as follows:
#!/bin/perl
use strict;
use warnings;
BEGIN { # Input and Output Record Separators Off
$\ = undef;
$/ = undef;
}
open( my $infile, "<", $ARGV[0]);
my $outfile = "t2" . $ARGV[0];
open( my $write, ">", $outfile);
LINE: while ( <$infile> ) {
# These are the regexes currently cleaning and reformatting the input
s/\f/\n/g;
s/(\d+\s.*)(\n\n)/$1\.$2/g;
s/(\R\R).*\R\R/$1/g;
s/(\R\R).*\R\R/$1/g;
s/(\R\R)(.*\R)/$1VALUE LABELS $2/g;
}
continue {
die "-p destination: $!\n" unless print $write "$_";
# Here is the regex I'm having an issue with
if ( $infile =~ m/VALUE LABELS(.*)\n(?s).*\d+7 \x27Not Applicable\x27.*?\./g) {
print $write "\n\nNOT APPLICABLE: $1";
]
}
Is there a way I can have this return what I'm looking for? Is there maybe a better way to write this entire script that would let me change the line separators part way through?