1

There are already several good discussions of regular expressions and empty lines on SO. I'll remove this question if it is a duplicate.

Can anyone explain why this script outputs 5 3 4 5 4 3 instead of 4 3 4 4 4 3? When I run it in the debugger $blank and $classyblank stay at "4" (which I assume is the correct value) until the just before the print statement.

my ( $blank, $nonblank, $non_nonblank, 
     $classyblank,  $classyspace, $blanketyblank ) = 0 ;

while (<DATA>) {

  $blank++ if /\p{IsBlank}/         ; # POSIXly blank - 4?
  $nonblank++ if /^\P{IsBlank}$/    ; # POSIXly non-blank - 3
  $non_nonblank++ if not /\S/       ; # perlishly not non-blank - 4
  $classyblank++ if /[[:blank:]]/   ; # older(?) charclass blankness - 4?
  $classyspace++ if /^[[:space:]]$/ ; # older(?) charclass whitespace - 4
  $blanketyblank++ if /^$/          ; # perlishly *really empty*  - 3

}

print join " ", $blank, $nonblank, $non_nonblank,
            $classyblank, $classyspace, $blanketyblank , "\n" ;

__DATA__

line above only has a linefeed this one is not blank because: words

this line is followed by a line with white space (you may need to add it)

then another blank line following this one

THE END :-\

Is it something to do with the __DATA__ section or am I misunderstanding POSIX regular expressions?


ps:

As noted in comment on a timely post elsewhere, "really empty" (/^$/) can miss non-emptiness:

perl -E 'my $string = "\n" . "foo\n\n" ; say "empty" if $string =~ /^$/ ;'
perl -E 'my $string = "\n" . "bar\n\n" ; say "empty" if $string =~ /\A\z/ ;'
perl -E 'my $string = "\n" . "baz\n\n" ; say "empty" if $string =~ /\S/ ;' 
8
  • Then there's if /\A\Z/ and if /\A\z/ ... which are pretty consistent across different languages except python but that's OK. Commented Mar 22, 2016 at 18:56
  • This is perl 5, version 22, subversion 0 (v5.22.0) built for amd64-freebsd Commented Mar 22, 2016 at 19:07
  • 1
    Not related to your core question, but my $string = "\n", "foo\n\n" assigns a single newline to $string. The rest is thrown away because of the comma operator. Commented Mar 22, 2016 at 19:31
  • 1
    I addressed this at length in my answer to your recent comment I won't write it up again to match your newly-phrased question. The only delinquent in the patterns you have used is $, which will match the end of a string or before the newline of it is the last character. \p{IsBlank}, [[:blank:]] are simple character classes and you can check what they do from perldoc perluniprops Commented Mar 22, 2016 at 19:41
  • @Borodin - thanks. I'm trying to get straightened out about the character classes from the charts in perlrecharclass by lining them up with well their known perl equivalents (such as /\S/) and/or related "idioms". I was getting results I couldn't explain: specifically how \v, \s' and \h interact with \n and " ". I think I have it figured out now and will add a separate answer if one doesn't appear. Commented Mar 23, 2016 at 1:12

1 Answer 1

2

/\p{IsBlank}/ doesn't check for a empty string. \p matches a character that has the specified Unicode property.

$ unichars '\p{IsBlank}' | cat
 ---- U+0009 CHARACTER TABULATION
 ---- U+0020 SPACE
 ---- U+00A0 NO-BREAK SPACE
 ---- U+1680 OGHAM SPACE MARK
 ---- U+2000 EN QUAD
 ---- U+2001 EM QUAD
 ---- U+2002 EN SPACE
 ---- U+2003 EM SPACE
 ---- U+2004 THREE-PER-EM SPACE
 ---- U+2005 FOUR-PER-EM SPACE
 ---- U+2006 SIX-PER-EM SPACE
 ---- U+2007 FIGURE SPACE
 ---- U+2008 PUNCTUATION SPACE
 ---- U+2009 THIN SPACE
 ---- U+200A HAIR SPACE
 ---- U+202F NARROW NO-BREAK SPACE
 ---- U+205F MEDIUM MATHEMATICAL SPACE
 ---- U+3000 IDEOGRAPHIC SPACE

It matches " \n" since SPACE has the IsBlank property.


/[[:blank:]]/ doesn't check for a empty string. [...] matches a character that is a member of the specified class.

$ unichars '[[:blank:]]' | cat
 ---- U+0009 CHARACTER TABULATION
 ---- U+0020 SPACE
 ---- U+00A0 NO-BREAK SPACE
 ---- U+1680 OGHAM SPACE MARK
 ---- U+2000 EN QUAD
 ---- U+2001 EM QUAD
 ---- U+2002 EN SPACE
 ---- U+2003 EM SPACE
 ---- U+2004 THREE-PER-EM SPACE
 ---- U+2005 FOUR-PER-EM SPACE
 ---- U+2006 SIX-PER-EM SPACE
 ---- U+2007 FIGURE SPACE
 ---- U+2008 PUNCTUATION SPACE
 ---- U+2009 THIN SPACE
 ---- U+200A HAIR SPACE
 ---- U+202F NARROW NO-BREAK SPACE
 ---- U+205F MEDIUM MATHEMATICAL SPACE
 ---- U+3000 IDEOGRAPHIC SPACE

It matches " \n" since SPACE is a member of the [:blank:] POSIX character class and thus a member of the [[:blank:]] character class.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, beginning to grok ... since $nonblank++ if /\P{IsBlank}/ (without the anchors) gives me "8" (__DATA__ has 8 lines) I assume it is counting \n as non-{IsBlank} (due to \P) and thus is seeing 8 matches. Then, as /^\P{IsBlank}$/, incrementing is based on the three lines of single non-blank horizontal characters (\n) so I get "3". However /\p{IsBlank} gives me a count of "5" because there are five rows with \s style horizontal "blank characters": the four with text (and whitespace between words), and line number 5 which consists of " "\n appearing as an empty row.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.