merging records in python or numpy

Question

I have a csv file in which the first column contains an identifier and the second column associated data. The identifier is replicated an arbitrary number of times so the file looks like this.
data1,123
data1,345
data1,432
data2,654
data2,431
data3,947
data3,673

I would like to merge the records to generate a single record for each identifier and get.
data1,123,345,432
data2,654,431
data3,947,673

Is there an efficient way to do this in python or numpy? Dictionaries appear to be out due to duplicate keys. At the moment I have the lines in a list of lists then looping through and testing for identity with the previous value at index 0 in the list but this is very clumsy. Thanks for any help.

"Dictionaries appear to be out due to duplicate keys" I'm not sure I understand why this is a problem. Would a dictionary with lists for values not work? like in your example it would be pretty easy to cook up something that ends up with {'data1': [123, 345, 432], 'data2': [654, 431], 'data3': [947, 673]} — Free Monica Cellio
– Free Monica Cellio, Commented Jan 27, 2012 at 0:48

David Z · Accepted Answer · 2012-01-27 00:17:43Z

3

If all the instances of a given value in the first column are consecutive, this is a perfect use case for itertools.groupby. It would be used something like this:

from itertools import groupby
from csv import reader
from operator import itemgetter

with open(filename) as f:
     for k, g in groupby(reader(f), key=itemgetter(0)):
         record = ','.join(k, *g)
         # do something with record, e.g. write to a file

(You might have to do ','.join(k, *list(g)) or something like that, I can't test it out at the moment)

answered Jan 27, 2012 at 0:17

David Z

133k29 gold badges264 silver badges284 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Andy Ellington Over a year ago

Thanks very much for this. I don't know what the * in this context does but this approach worked if I stringed the g variable when joining it. I got the one below working first and went with that.

David Z Over a year ago

The * is the argument unpacking operator, which converts a list into function arguments. So f(x, *y) is equivalent to f(x, y[0], y[1], ...).

MRAB · Accepted Answer · 2012-01-27 01:07:15Z

3

You can use a dictionary if the values are lists. defaultdict in the collections module is very useful for this.

answered Jan 27, 2012 at 1:07

MRAB

20.7k6 gold badges44 silver badges34 bronze badges

Comments

Bi Rico · Accepted Answer · 2012-01-27 03:43:09Z

1

This is how you can use a defaultdict to do what you need,

import csv
from collections import defaultdict

records = defaultdict(list)
for key, value in csv.reader(open(filename)):
    records[key].append(int(value))

for key in records:
    print key, records[key]

the result,

data1 [123, 345, 432]
data3 [947, 673]
data2 [654, 431]

answered Jan 27, 2012 at 3:43

Bi Rico

25.9k3 gold badges57 silver badges75 bronze badges

1 Comment

Andy Ellington Over a year ago

Great, thanks (and to MRAB). This did the job and script is working.

Collectives™ on Stack Overflow

merging records in python or numpy

3 Answers 3

2 Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related