4

Sorry for the vague question, I'm struggling to think how to better word it!

I have a CSV file that looks a little like this, only a lot bigger:

550672,1
656372,1
766153,1
550672,2
656372,2
868194,2
766151,2
550672,3
868179,3
868194,3
550672,4
766153,4

The values in the first column are a ID numbers and the second column could be described as a property (for want of a better word...). The ID number 550672 has properties 1,2,3,4. Can anyone point me towards how I can begin solving how to produce strings such as that for all the ID numbers? My ideal output would be a new csv file which looks something like:

550672,1;2;3;4
656372,1;2
766153,1;4

etc.

I am very much a Perl baby (only 3 days old!) so would really appreciate direction rather than an outright solution, I'm determined to learn this stuff even if it takes me the rest of my days! I have tried to investigate it myself as best as I can, although I think I've been encumbered by not really knowing what to really search for. I am able to read in and parse CSV files (I even got so far as removing duplicate values!) but that is really where it drops off for me. Any help would be greatly appreciated!

1
  • Look into using some csv modules. Text::CSV is a favorite of mine. Commented Sep 16, 2012 at 22:43

5 Answers 5

4

I think it is best if I offer you a working program rather than a few hints. Hints can only take you so far, and if you take the time to understand this code it will give you a good learning experience

It is best to use Text::CSV whenever you are processing CSV data as all the debugging has already been done for you

use strict;
use warnings;

use Text::CSV;

my $csv = Text::CSV->new;

open my $fh, '<', 'data.txt' or die $!;
my %data;
while (my $line = <$fh>) {
  $csv->parse($line) or die "Invalid data line";
  my ($key, $val) = $csv->fields;
  push @{ $data{$key} }, $val
}

for my $id (sort keys %data) {
  printf "%s,%s\n", $id, join ';', @{ $data{$id} };
}

output

550672,1;2;3;4
656372,1;2
766151,2
766153,1;4
868179,3
868194,2;3
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for editing my question for the proper formatting, I will make sure in future I follow that! I agree sometimes the best way to learn is by example, this is very kind of you to have taken the time to write this out for me, I look forward to deconstructing it to understand it! Edit: This is very neat, thank you for this.
3

Firstly props for seeking an approach not a solution. As you've probably already found with perl, There Is More Than One Way To Do It.

The approach I would take would be;

use strict;  # will save you big time in the long run

my %ids      # Use a hash table with the id as the key to accumulate the properties
open a file handle on csv or die
while (read another line from the file handle){
  split line into ID and property variable  # google the split function
  append new property to existing properties for this id in the hash table  # If it doesn't exist already, it will be created
}

foreach my $key (keys %ids) {
  deduplicate properties
  print/display/do whatever you need to do with the result
}

This approach means you will need to iterate over the whole set twice (once in memory), so depending on the size of the dataset that may be a problem. A more sophisticated approach would be to use a hashtable of hashtables to do the de duplication in the intial step, but depending on how quickly you want/need to get it working, that may not be worthwhile in the first instance.

Check out this question for a discussion on how to do the deduplication.

1 Comment

Gah! Again I wish I could give you an up arrow! This is very helpful, thank you very much. Great description, I will be going through this thoroughly- really enjoying learning this stuff!
2

Well, open the file as stdin in perl, assume each row is of two columns, then iterate over all lines using left column as hash identifier, and gathering right column into an array pointed by a hash key. At the end of input file you'll get a hash of arrays, so iterate over it, printing a hash key and assigned array elements separated by ";" or any other sign you wish.

and here you go

dtpwmbp:~ pwadas$ cat input.txt 
550672,1
656372,1
766153,1
550672,2
656372,2
868194,2
766151,2
550672,3
868179,3
868194,3
550672,4
766153,4
dtpwmbp:~ pwadas$ cat bb2.pl 
#!/opt/local/bin/perl

my %hash;
while (<>)
{
    chomp;
    my($key, $value) = split /,/;
    push @{$hash{$key}} , $value ;
}

foreach my $key (sort keys %hash)
{
     print $key . "," . join(";", @{$hash{$key}} ) . "\n" ;
}
dtpwmbp:~ pwadas$ cat input.txt | perl -f bb2.pl 
550672,1;2;3;4
656372,1;2
766151,2
766153,1;4
868179,3
868194,2;3
dtpwmbp:~ pwadas$

5 Comments

Thanks ever so much for the speedy reply, I have been reading about hash and suspected that might come into play. I'm going to get stuck into reading about these elements. If you don't mind, if I get stuck again, can I come back to you? Thanks again ever so much (I would like to click the up arrow to say your answer was useful but unfortunately it seems I require more rep!)
Note, that such approach as presented. joins duplicate keys somewhat automatically. One could use some Text::CSV module for that, however someone else could use a one-liner for it :)
hashes of arrays, etc, are not as straightforward as you would hope in perl, check out amon's answer from this post, and read the linked docs: stackoverflow.com/questions/12450851/…
I don't think I can type as fast as you can code.... one day... one day... thanks again, there are a lot of new terms in there for me to learn- I'm looking forward to going through it!
lol, actually saving and prepending a text to paste with four spaces each line to mark it as code when pasted in here took more time and clicks on the damn MBP than coding ;-) thx :)
2
perl -F"," -ane 'chomp($F[1]);$X{$F[0]}=$X{$F[0]}.";".$F[1];if(eof){for(keys %X){$X{$_}=~s/;//;print $_.",".$X{$_}."\n"}}'

2 Comments

Thank you for the answer, although I've got to admit that at my current state of perl knowledge, understanding one liners such as this is very challenging. I do find it amazing, however, that the solution to my problem can be compressed into something as succinct as this!
In fact, I think your answer is helpful as it demonstrates what is possible with such a short amount of code, thank you.
1

Another (not perl) way which incidentally is shorter and more elegant:

#!/opt/local/bin/gawk -f

BEGIN {FS=OFS=",";}

NF > 0 { IDs[$1]=IDs[$1] ";" $2; }

END { for (i in IDs) print i, substr(IDs[i], 2); }

The first line (after specifying the interpreter) sets the input FIELD SEPARATOR and the OUTPUT FIELD SEPARATOR to the comma. The second line checks of we have more than zero fields and if you do it makes the ID ($1) number the key and $2 the value. You do this for all lines.

The END statement will print these pairs out in an unspecified order. If you want to sort them you have to option of asorti gnu awk function or connecting the output of this snippet with a pipe to sort -t, -k1n,1n.

1 Comment

Thank you for the suggestion- although I'm not sure my brain is big enough for two programming languages at once!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.