5

I want to parse a string character by character. I am using perl to do that. Is there any way where we can start from the first character of the string and then loop character by character. Right now I have split the string into an array and I am loo[ping through the array.

$var="junk shit here. fkuc lkasjdfie.";
@chars=split("",$var);

But instead of spliting the wholes string before itself, is there any descriptor which would point to the first character of the string and then traverse each character? Is there any way to do this?

6
  • With Perl, there is generally a [better] way of solving a problem without traversing/parsing strings yourself. What is your real goal? Commented Apr 27, 2014 at 7:09
  • @perreal: My main goal is to split a given string (a large one) into sentences. Splitting using ".", "?". But there are again more constraints like "Dr. W. Fletcher" should not split into a 3 sentences just because there are 3 occurences of "." (fullstop). It is still a single sentence Commented Apr 27, 2014 at 7:13
  • And how will you know if the current . is one that marks end of sentence or if it is part of a sentence? Commented Apr 27, 2014 at 7:23
  • @Ashwin, sounds like you can use a regex split with look-around assertions. Commented Apr 27, 2014 at 7:24
  • @szabgab :That is the problem I am trying to solve. As soon as a fullstop is encountered, it has to be matched with the previous characters to determine if they were initials, titles etc Commented Apr 27, 2014 at 7:45

4 Answers 4

3
my $var = "junk sit here. fkuc lkasjdfie.";

while ($var =~ /(.)/sg) {
   my $char = $1;
   # do something with $char 
}

or

for my $i (1 .. length $var) {
  my $char = substr($var, $i-1, 1);
}

and when bench-marked, substr method is better performing than while,

use Benchmark qw( cmpthese ) ;
my $var = "junk sit here. fkuc lkasjdfie." x1000;

cmpthese( -5, {
    "while" => sub{
      while ($var =~ /(.)/sg) {
         my $char = $1;
         # do something with $char 
      }
    },
    "substr" => sub{
      for my $i (1 .. length $var) {
        my $char = substr($var, $i-1, 1);
      }
    },
});

result

         Rate  while substr
while  56.3/s     --   -53%
substr  121/s   114%     --
Sign up to request clarification or add additional context in comments.

2 Comments

How does the first method work? And would using regex be faster than that the second method
@Ashwin check perldoc.perl.org/perlrequick.html#More-matching To be sure about speed, benchmark is required perldoc.perl.org/Benchmark.html
2

This can be the skeleton of the script/regex:

use strict;
use warnings;
use Data::Dumper qw(Dumper);

my $str = "The story of Dr. W. Fletcher who is a dentist. The hero of the community.";

my @sentences = split /(?<!(Dr| \w))\./, $str;
print Dumper \@sentences;

And the output is:

$VAR1 = [
      'The story of Dr. W. Fletcher who is a dentist',
      undef,
      ' The hero of the community'
    ];

2 Comments

What is the "<" symbol signify after "?" ?
(?<! ... ) is a negative look-behind. Search metacpan.org/pod/distribution/perl/pod/perlre.pod for 'look-behind'
1

Uses less memory than split, faster than "while ( $text =~ /(.)/sg ) { ... }":

my $text = 'Ö' x 10000;  # encoded
if ( open my $fh, '<:encoding(UTF-8)', \$text ) {
    while ( read $fh, my $chr, 1 ) {
        my $enc = $chr;  # decoded
        utf8::encode($enc) if utf8::is_utf8($enc);
        print $enc, ' ';
    }
}

2 Comments

You aren't reading a character here: you are reading an octet.
Brian you are right, code corrected, now reads by character.
0

I don't know if it's faster than splitting it but you can make a copy, reverse it and the chop it until it's empty.

$a = "dude"; 
$b = reverse($a); 
for ($i = length($b) ; $i>0 ; $i--) {
  print chop $b; print "\n";'
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.