3

I have a csv file in which the first column contains an identifier and the second column associated data. The identifier is replicated an arbitrary number of times so the file looks like this.
data1,123
data1,345
data1,432
data2,654
data2,431
data3,947
data3,673

I would like to merge the records to generate a single record for each identifier and get.
data1,123,345,432
data2,654,431
data3,947,673

Is there an efficient way to do this in python or numpy? Dictionaries appear to be out due to duplicate keys. At the moment I have the lines in a list of lists then looping through and testing for identity with the previous value at index 0 in the list but this is very clumsy. Thanks for any help.

1
  • 1
    "Dictionaries appear to be out due to duplicate keys" I'm not sure I understand why this is a problem. Would a dictionary with lists for values not work? like in your example it would be pretty easy to cook up something that ends up with {'data1': [123, 345, 432], 'data2': [654, 431], 'data3': [947, 673]} Commented Jan 27, 2012 at 0:48

3 Answers 3

3

If all the instances of a given value in the first column are consecutive, this is a perfect use case for itertools.groupby. It would be used something like this:

from itertools import groupby
from csv import reader
from operator import itemgetter

with open(filename) as f:
     for k, g in groupby(reader(f), key=itemgetter(0)):
         record = ','.join(k, *g)
         # do something with record, e.g. write to a file

(You might have to do ','.join(k, *list(g)) or something like that, I can't test it out at the moment)

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks very much for this. I don't know what the * in this context does but this approach worked if I stringed the g variable when joining it. I got the one below working first and went with that.
The * is the argument unpacking operator, which converts a list into function arguments. So f(x, *y) is equivalent to f(x, y[0], y[1], ...).
3

You can use a dictionary if the values are lists. defaultdict in the collections module is very useful for this.

Comments

1

This is how you can use a defaultdict to do what you need,

import csv
from collections import defaultdict

records = defaultdict(list)
for key, value in csv.reader(open(filename)):
    records[key].append(int(value))

for key in records:
    print key, records[key]

the result,

data1 [123, 345, 432]
data3 [947, 673]
data2 [654, 431]

1 Comment

Great, thanks (and to MRAB). This did the job and script is working.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.