Parsing a string character by character in Perl

Question

I want to parse a string character by character. I am using perl to do that. Is there any way where we can start from the first character of the string and then loop character by character. Right now I have split the string into an array and I am loo[ping through the array.

$var="junk shit here. fkuc lkasjdfie.";
@chars=split("",$var);

But instead of spliting the wholes string before itself, is there any descriptor which would point to the first character of the string and then traverse each character? Is there any way to do this?

With Perl, there is generally a [better] way of solving a problem without traversing/parsing strings yourself. What is your real goal? — perreal
– perreal, Commented Apr 27, 2014 at 7:09
@perreal: My main goal is to split a given string (a large one) into sentences. Splitting using ".", "?". But there are again more constraints like "Dr. W. Fletcher" should not split into a 3 sentences just because there are 3 occurences of "." (fullstop). It is still a single sentence — Ashwin
– Ashwin, Commented Apr 27, 2014 at 7:13
And how will you know if the current . is one that marks end of sentence or if it is part of a sentence? — szabgab
– szabgab, Commented Apr 27, 2014 at 7:23
@Ashwin, sounds like you can use a regex split with look-around assertions. — perreal
– perreal, Commented Apr 27, 2014 at 7:24
@szabgab :That is the problem I am trying to solve. As soon as a fullstop is encountered, it has to be matched with the previous characters to determine if they were initials, titles etc — Ashwin
– Ashwin, Commented Apr 27, 2014 at 7:45

mpapec · Accepted Answer · 2014-04-27 06:54:26Z

3

my $var = "junk sit here. fkuc lkasjdfie.";

while ($var =~ /(.)/sg) {
   my $char = $1;
   # do something with $char 
}

or

for my $i (1 .. length $var) {
  my $char = substr($var, $i-1, 1);
}

and when bench-marked, substr method is better performing than while,

use Benchmark qw( cmpthese ) ;
my $var = "junk sit here. fkuc lkasjdfie." x1000;

cmpthese( -5, {
    "while" => sub{
      while ($var =~ /(.)/sg) {
         my $char = $1;
         # do something with $char 
      }
    },
    "substr" => sub{
      for my $i (1 .. length $var) {
        my $char = substr($var, $i-1, 1);
      }
    },
});

result

         Rate  while substr
while  56.3/s     --   -53%
substr  121/s   114%     --

edited Apr 27, 2014 at 6:54

answered Apr 27, 2014 at 6:16

mpapec

50.7k8 gold badges72 silver badges133 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ashwin Over a year ago

How does the first method work? And would using regex be faster than that the second method

mpapec Over a year ago

@Ashwin check perldoc.perl.org/perlrequick.html#More-matching To be sure about speed, benchmark is required perldoc.perl.org/Benchmark.html

szabgab · Accepted Answer · 2014-04-27 09:29:40Z

2

This can be the skeleton of the script/regex:

use strict;
use warnings;
use Data::Dumper qw(Dumper);

my $str = "The story of Dr. W. Fletcher who is a dentist. The hero of the community.";

my @sentences = split /(?<!(Dr| \w))\./, $str;
print Dumper \@sentences;

And the output is:

$VAR1 = [
      'The story of Dr. W. Fletcher who is a dentist',
      undef,
      ' The hero of the community'
    ];

edited Apr 27, 2014 at 9:29

answered Apr 27, 2014 at 9:19

szabgab

6,31211 gold badges56 silver badges64 bronze badges

2 Comments

Ashwin Over a year ago

What is the "<" symbol signify after "?" ?

szabgab Over a year ago

(?<! ... ) is a negative look-behind. Search metacpan.org/pod/distribution/perl/pod/perlre.pod for 'look-behind'

Victor Mironov · Accepted Answer · 2022-06-06 00:37:39Z

1

Uses less memory than split, faster than "while ( $text =~ /(.)/sg ) { ... }":

my $text = 'Ö' x 10000;  # encoded
if ( open my $fh, '<:encoding(UTF-8)', \$text ) {
    while ( read $fh, my $chr, 1 ) {
        my $enc = $chr;  # decoded
        utf8::encode($enc) if utf8::is_utf8($enc);
        print $enc, ' ';
    }
}

edited Jun 6, 2022 at 0:37

answered Jun 4, 2022 at 0:38

Victor Mironov

694 bronze badges

2 Comments

brian d foy Over a year ago

You aren't reading a character here: you are reading an octet.

Victor Mironov Over a year ago

Brian you are right, code corrected, now reads by character.

Andreas Wederbrand · Accepted Answer · 2014-04-27 06:15:42Z

0

I don't know if it's faster than splitting it but you can make a copy, reverse it and the chop it until it's empty.

$a = "dude"; 
$b = reverse($a); 
for ($i = length($b) ; $i>0 ; $i--) {
  print chop $b; print "\n";'
}

answered Apr 27, 2014 at 6:15

Andreas Wederbrand

40.5k12 gold badges71 silver badges82 bronze badges

Collectives™ on Stack Overflow

Parsing a string character by character in Perl

4 Answers 4

2 Comments

2 Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

2 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related