A Perl script to process a CSV file, aggregating properties spread over multiple records

Question

Sorry for the vague question, I'm struggling to think how to better word it!

I have a CSV file that looks a little like this, only a lot bigger:

The values in the first column are a ID numbers and the second column could be described as a property (for want of a better word...). The ID number 550672 has properties 1,2,3,4. Can anyone point me towards how I can begin solving how to produce strings such as that for all the ID numbers? My ideal output would be a new csv file which looks something like:

550672,1;2;3;4
656372,1;2
766153,1;4

etc.

I am very much a Perl baby (only 3 days old!) so would really appreciate direction rather than an outright solution, I'm determined to learn this stuff even if it takes me the rest of my days! I have tried to investigate it myself as best as I can, although I think I've been encumbered by not really knowing what to really search for. I am able to read in and parse CSV files (I even got so far as removing duplicate values!) but that is really where it drops off for me. Any help would be greatly appreciated!

Look into using some csv modules. Text::CSV is a favorite of mine. — squiguy
– squiguy, Commented Sep 16, 2012 at 22:43

Borodin · Accepted Answer · 2012-09-16 23:37:41Z

4

I think it is best if I offer you a working program rather than a few hints. Hints can only take you so far, and if you take the time to understand this code it will give you a good learning experience

It is best to use Text::CSV whenever you are processing CSV data as all the debugging has already been done for you

use strict;
use warnings;

use Text::CSV;

my $csv = Text::CSV->new;

open my $fh, '<', 'data.txt' or die $!;
my %data;
while (my $line = <$fh>) {
  $csv->parse($line) or die "Invalid data line";
  my ($key, $val) = $csv->fields;
  push @{ $data{$key} }, $val
}

for my $id (sort keys %data) {
  printf "%s,%s\n", $id, join ';', @{ $data{$id} };
}

output

550672,1;2;3;4
656372,1;2
766151,2
766153,1;4
868179,3
868194,2;3

answered Sep 16, 2012 at 23:37

Borodin

127k9 gold badges72 silver badges146 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user1597452 Over a year ago

Thank you for editing my question for the proper formatting, I will make sure in future I follow that! I agree sometimes the best way to learn is by example, this is very kind of you to have taken the time to write this out for me, I look forward to deconstructing it to understand it! Edit: This is very neat, thank you for this.

Community · Accepted Answer · 2017-05-23 12:12:14Z

3

Firstly props for seeking an approach not a solution. As you've probably already found with perl, There Is More Than One Way To Do It.

The approach I would take would be;

use strict;  # will save you big time in the long run

my %ids      # Use a hash table with the id as the key to accumulate the properties
open a file handle on csv or die
while (read another line from the file handle){
  split line into ID and property variable  # google the split function
  append new property to existing properties for this id in the hash table  # If it doesn't exist already, it will be created
}

foreach my $key (keys %ids) {
  deduplicate properties
  print/display/do whatever you need to do with the result
}

This approach means you will need to iterate over the whole set twice (once in memory), so depending on the size of the dataset that may be a problem. A more sophisticated approach would be to use a hashtable of hashtables to do the de duplication in the intial step, but depending on how quickly you want/need to get it working, that may not be worthwhile in the first instance.

Check out this question for a discussion on how to do the deduplication.

edited May 23, 2017 at 12:12

CommunityBot

11 silver badge

answered Sep 16, 2012 at 22:04

TaninDirect

4682 gold badges8 silver badges15 bronze badges

1 Comment

user1597452 Over a year ago

Gah! Again I wish I could give you an up arrow! This is very helpful, thank you very much. Great description, I will be going through this thoroughly- really enjoying learning this stuff!

Piotr Wadas · Accepted Answer · 2012-09-16 22:02:48Z

2

Well, open the file as stdin in perl, assume each row is of two columns, then iterate over all lines using left column as hash identifier, and gathering right column into an array pointed by a hash key. At the end of input file you'll get a hash of arrays, so iterate over it, printing a hash key and assigned array elements separated by ";" or any other sign you wish.

and here you go

dtpwmbp:~ pwadas$ cat input.txt 
550672,1
656372,1
766153,1
550672,2
656372,2
868194,2
766151,2
550672,3
868179,3
868194,3
550672,4
766153,4
dtpwmbp:~ pwadas$ cat bb2.pl 
#!/opt/local/bin/perl

my %hash;
while (<>)
{
    chomp;
    my($key, $value) = split /,/;
    push @{$hash{$key}} , $value ;
}

foreach my $key (sort keys %hash)
{
     print $key . "," . join(";", @{$hash{$key}} ) . "\n" ;
}
dtpwmbp:~ pwadas$ cat input.txt | perl -f bb2.pl 
550672,1;2;3;4
656372,1;2
766151,2
766153,1;4
868179,3
868194,2;3
dtpwmbp:~ pwadas$

edited Sep 16, 2012 at 22:02

answered Sep 16, 2012 at 21:47

Piotr Wadas

1,8841 gold badge10 silver badges13 bronze badges

5 Comments

user1597452 Over a year ago

Thanks ever so much for the speedy reply, I have been reading about hash and suspected that might come into play. I'm going to get stuck into reading about these elements. If you don't mind, if I get stuck again, can I come back to you? Thanks again ever so much (I would like to click the up arrow to say your answer was useful but unfortunately it seems I require more rep!)

Piotr Wadas Over a year ago

Note, that such approach as presented. joins duplicate keys somewhat automatically. One could use some Text::CSV module for that, however someone else could use a one-liner for it :)

carillonator Over a year ago

hashes of arrays, etc, are not as straightforward as you would hope in perl, check out amon's answer from this post, and read the linked docs: stackoverflow.com/questions/12450851/…

user1597452 Over a year ago

I don't think I can type as fast as you can code.... one day... one day... thanks again, there are a lot of new terms in there for me to learn- I'm looking forward to going through it!

Piotr Wadas Over a year ago

lol, actually saving and prepending a text to paste with four spaces each line to mark it as code when pasted in here took more time and clicks on the damn MBP than coding ;-) thx :)

Vijay · Accepted Answer · 2012-09-17 09:56:50Z

2

perl -F"," -ane 'chomp($F[1]);$X{$F[0]}=$X{$F[0]}.";".$F[1];if(eof){for(keys %X){$X{$_}=~s/;//;print $_.",".$X{$_}."\n"}}'

answered Sep 17, 2012 at 9:56

Vijay

67.7k94 gold badges238 silver badges327 bronze badges

2 Comments

user1597452 Over a year ago

Thank you for the answer, although I've got to admit that at my current state of perl knowledge, understanding one liners such as this is very challenging. I do find it amazing, however, that the solution to my problem can be compressed into something as succinct as this!

user1597452 Over a year ago

In fact, I think your answer is helpful as it demonstrates what is possible with such a short amount of code, thank you.

user1666959 · Accepted Answer · 2012-09-16 23:13:09Z

1

Another (not perl) way which incidentally is shorter and more elegant:

#!/opt/local/bin/gawk -f

BEGIN {FS=OFS=",";}

NF > 0 { IDs[$1]=IDs[$1] ";" $2; }

END { for (i in IDs) print i, substr(IDs[i], 2); }

The first line (after specifying the interpreter) sets the input FIELD SEPARATOR and the OUTPUT FIELD SEPARATOR to the comma. The second line checks of we have more than zero fields and if you do it makes the ID ($1) number the key and $2 the value. You do this for all lines.

The END statement will print these pairs out in an unspecified order. If you want to sort them you have to option of asorti gnu awk function or connecting the output of this snippet with a pipe to sort -t, -k1n,1n.

answered Sep 16, 2012 at 23:13

user1666959

1,87512 silver badges11 bronze badges

1 Comment

user1597452 Over a year ago

Thank you for the suggestion- although I'm not sure my brain is big enough for two programming languages at once!

Collectives™ on Stack Overflow

A Perl script to process a CSV file, aggregating properties spread over multiple records

5 Answers 5

1 Comment

1 Comment

5 Comments

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

1 Comment

5 Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related