1

I'm writing some parser on Perl and here is a problem with split. Here is my code:

my $str = 'a,b,"c,d",e';
my @arr = split(/,(?=([^\"]*\"[^\"]*\")*[^\"]*$)/, $str);
# try to split the string by comma delimiter, but only if comma is followed by the even or zero number of quotes 

foreach my $val (@arr) {
    print "$val\n"
}

I'm expecting the following:

a
b
"c,d"
e

But this is what am I really received:

a
b,"c,d"
b
"c,d"
"c,d"

e

I see my string parts are in array, their indices are 0, 2, 4, 6. But how to avoid these odd b,"c,d" and other rest string parts in the resulting array? Is there any error in my regexp delimiter or is there some special split options?

2
  • do matching instead of splitting "[^"]*"|[^,]+ Commented Oct 30, 2015 at 13:02
  • You're using split and a fancy regex in Perl and never heard that split creates elements from capture groups ? Commented Oct 30, 2015 at 15:42

4 Answers 4

4

You need to use a non-capturing group:

my @arr = split(/,(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)/, $str);
                      ^^

See IDEONE demo

Otherwise, the captured texts are output as part of the resulting array.

See perldoc reference:

If the regex has groupings, then the list produced contains the matched substrings from the groupings as well

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks! I enjoy stackoverflow :)
I only showed where those "weird" elements come from. Please check Sobrique's answer. If you really work with CSV, it is best to use the existing proven tools to parse delimited text.
4

What's tripping you up is a feature in split in that if you're using a group, and it's set to capture - it returns the captured 'bit' as well.

But rather than using split I would suggest the Text::CSV module, that already handles quoting for you:

#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV;

my $csv    = Text::CSV->new();
my $fields = $csv->getline( \*DATA );

print join "\n", @$fields;

__DATA__
a,b,"c,d",e

Prints:

a
b
c,d
e

My reasoning is fairly simple - you're doing quote matching and may have things like quoted/escaped quotes, etc. mean you're trying to do a recursive parse, which is something regex simply isn't well suited to doing.

6 Comments

Why advocate Text::CSV , how does it work? Can't simple regex work better? I don't get why you mention recursive parsing, which Perl is very good at.
I advocate it because it works, and makes the code to parse your thing very simple. Regular expressions are great for splitting a line by a delimiter or finding a pattern to search and replace. But matched delimiters - like quotes, brackets or start/end tags - it's worse at, because that's a recursive problem. You can do it, but you'll find lots of edge cases where it doesn't work. Far better to use a parser that can handle it.
perl can do recursive parsing - Text::CSV or XML::Twig do it very nicely. regex cannot (very well - technically it's possible, but you end up with a very big and hard to understand regex)
CSV parsing does not involve recursion, it's delimited. And Perl regex can indeed do recursive function calls (I didn't mean Perl itself, which is just a language).
But what if you've got escaped quotes? What about mismatched quotes within your CSV fields? Or escaped commas. Yes, you can do it - by using an perl's extended form of regex - but you reliably end up with... well, something like the OP - something that's hard to follow what it's actually doing and is difficult to troubleshoot when it doesn't work. My opinion is that if your CSV is simple comma delimited (and you don't have to worry about quotes) split /,/ is fine. But for more complicated, Text::CSV is the most appropriate tool.
|
2

You can use parse_line() of Text::ParseWords, if you are not really bounded for regex:

use  Text::ParseWords;

my $str = 'a,b,"c,d",e';

my @arr = parse_line(',', 1, $str);

foreach (@arr)
{
    print "$_\n";
}

Output:

a
b
"c,d"
e

3 Comments

Can you explain how Text::ParseWords works, instead of just saying use it?
@sln : Does linked Perl documentation for Text::ParseWords not provide a good explanation?
Well, it might if I want to see nothing but links when I search for solutions. Eventually finding out it's not what I need.
0

Do matching instead of splitting.

use strict; use warnings;

my $str = 'a,b,"c,d",e';
my @matches = $str =~ /"[^"]*"|[^,]+/g;
foreach my $val (@matches) {
    print "$val\n"
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.