simple python sort over csv

Question

I have two csv files: one is 98 mb and the other one is 152 kb. the smaller file is a random subset of the bigger one, and I want to write a third file from the big csv such that the rows correspond to each line in the smaller csv file.

Big file (excerpt):

ZINC_ID MWT LogP    Desolv_apolar   Desolv_polar    HBD HBA tPSA    Charge  NRB SMILES
ZINC00000017    281.337 1.33    3.07    -19.2   2   6   87  0   4   CCC[S@](=O)c1ccc2c(c1)[nH]/c(=N/C(=O)OC)/[nH]2
ZINC00000036    151.141 0.37    3.51    -45.3   1   3   60  -1  2   c1ccc(cc1)[C@@H](C(=O)[O-])O
ZINC00000048    222.24  2.42    3.78    -8.68   0   4   37  0   4   COc1cc(c(c2c1OCO2)OC)CC=C
ZINC00000053    179.151 1.43    6.59    -56.84  0   4   66  -1  3   CC(=O)Oc1ccccc1C(=O)[O-]

Small File (excerpt):

SMILES
CCOc1ccc(cc1)NC(=O)C[C@@H](C)O
C[C@@H](c1ccc2c(c1)nc(o2)c3ccc(cc3)Cl)C(=O)[O-]
CC(=O)Oc1ccccc1C(=O)[O-]
COc1cc(c(c2c1OCO2)OC)CC=C

here is my code:

import csv

writer = csv.writer(open('/Users/Eric/Desktop/newZincSubset.csv','wb'))
count = 0
with open('/Users/Eric/Desktop/test700.csv','rU') as i:
    with open('/Users/Eric/Desktop/initial_data.csv','rU') as j:
        subject = csv.reader(i)
        reference = csv.reader(j)
        for row in subject:
            smiles = row[0]
            for reference_row in reference:
                suspect = reference_row[10]
                if (smiles == suspect):
                    writer.writerow(reference_row)

It seems to write the header just fine (ZINC_ID MWT LogP) just fine, but stops searching for every line. Is it a memory issue or is something wrong with my code?

Thanks!

ZINC_ID MWT LogP Desolv_apolar Desolv_polar HBD HBA tPSA Charge NRB SMILES is the header. — ejang
– ejang, Commented Jan 3, 2012 at 1:19

thesamet · Accepted Answer · 2012-01-03 01:22:06Z

2

The CSV readers can be iterated just once. After the first inner iteration is done, the underlying file object reaches the end of the file. Once you try to iterate over the reference reader for the second time there is nothing more to read.

I'd recommend that you first read the small file to a dictionary, and then iterate on the larger file searching for matches against the data in memory. You can also key the elements in the dictionary by what you will end up looking for (ref[10] I think), so there will be no need for nested loops.

edited Jan 3, 2012 at 1:22

answered Jan 3, 2012 at 1:11

thesamet

6,6022 gold badges33 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Ricardo Cárdenes Over a year ago

He could just read the sample (quite small) once, store it in memory and then iterate over it per each line in the big (avoiding grabbing a big chunk of RAM)

David Z Over a year ago

+1 for identifying the problem, though I would actually recommend only reading the small file into memory and building a set out of its lines. Then you can still iterate over the large file and process it without having to load the whole thing into memory.

thesamet Over a year ago

True, for both previous commenters. I've updated the last paragraph.. I'd actually suggeset building a dict rather than a set, since he could use the key for efficient lookup for the property he joins on.

Ricardo Cárdenes Over a year ago

@thesamet: a set will allow also for efficient lookup :), and there's no need to add an empty value per key!

ejang Over a year ago

thansk, i loaded the smaller file into memory and then used the 'for iterable in bigfile:' to loop over each line in the file loaded into smaller memory.

Ricardo Cárdenes · Accepted Answer · 2012-01-03 01:34:21Z

2

An implementation (using DictReader and DictWriter, to make use of the header):

import csv

with open('sample.csv','rU') as i: 
    smiles = set(x['SMILES'] for x in csv.DictReader(i))

with open('init.csv','rU') as j:
    reference = csv.DictReader(j, delimiter = '\t')
    fields = reference.fieldnames
    writer = csv.DictWriter(open('newZincSubset.csv','wb'),
                            fields,
                            delimiter = '\t')
    writer.writerow(dict((x,x) for x in fields))
    for reference_row in reference:
        if reference_row['SMILES'] in smiles:
            writer.writerow(reference_row)

edited Jan 3, 2012 at 1:34

answered Jan 3, 2012 at 1:28

Ricardo Cárdenes

9,1941 gold badge23 silver badges35 bronze badges

1 Comment

Ricardo Cárdenes Over a year ago

Note: just edited the code to make it use tabs as delimiters (seems what is being used in the examples)

Collectives™ on Stack Overflow

simple python sort over csv

2 Answers 2

5 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related