1

I have two csv files: one is 98 mb and the other one is 152 kb. the smaller file is a random subset of the bigger one, and I want to write a third file from the big csv such that the rows correspond to each line in the smaller csv file.

Big file (excerpt):

ZINC_ID MWT LogP    Desolv_apolar   Desolv_polar    HBD HBA tPSA    Charge  NRB SMILES
ZINC00000017    281.337 1.33    3.07    -19.2   2   6   87  0   4   CCC[S@](=O)c1ccc2c(c1)[nH]/c(=N/C(=O)OC)/[nH]2
ZINC00000036    151.141 0.37    3.51    -45.3   1   3   60  -1  2   c1ccc(cc1)[C@@H](C(=O)[O-])O
ZINC00000048    222.24  2.42    3.78    -8.68   0   4   37  0   4   COc1cc(c(c2c1OCO2)OC)CC=C
ZINC00000053    179.151 1.43    6.59    -56.84  0   4   66  -1  3   CC(=O)Oc1ccccc1C(=O)[O-]

Small File (excerpt):

SMILES
CCOc1ccc(cc1)NC(=O)C[C@@H](C)O
C[C@@H](c1ccc2c(c1)nc(o2)c3ccc(cc3)Cl)C(=O)[O-]
CC(=O)Oc1ccccc1C(=O)[O-]
COc1cc(c(c2c1OCO2)OC)CC=C

here is my code:

import csv

writer = csv.writer(open('/Users/Eric/Desktop/newZincSubset.csv','wb'))
count = 0
with open('/Users/Eric/Desktop/test700.csv','rU') as i:
    with open('/Users/Eric/Desktop/initial_data.csv','rU') as j:
        subject = csv.reader(i)
        reference = csv.reader(j)
        for row in subject:
            smiles = row[0]
            for reference_row in reference:
                suspect = reference_row[10]
                if (smiles == suspect):
                    writer.writerow(reference_row)

It seems to write the header just fine (ZINC_ID MWT LogP) just fine, but stops searching for every line. Is it a memory issue or is something wrong with my code?

Thanks!

2
  • is the header ZINC_ID MWT LogP or the full line ? Commented Jan 3, 2012 at 1:13
  • ZINC_ID MWT LogP Desolv_apolar Desolv_polar HBD HBA tPSA Charge NRB SMILES is the header. Commented Jan 3, 2012 at 1:19

2 Answers 2

2

The CSV readers can be iterated just once. After the first inner iteration is done, the underlying file object reaches the end of the file. Once you try to iterate over the reference reader for the second time there is nothing more to read.

I'd recommend that you first read the small file to a dictionary, and then iterate on the larger file searching for matches against the data in memory. You can also key the elements in the dictionary by what you will end up looking for (ref[10] I think), so there will be no need for nested loops.

Sign up to request clarification or add additional context in comments.

5 Comments

He could just read the sample (quite small) once, store it in memory and then iterate over it per each line in the big (avoiding grabbing a big chunk of RAM)
+1 for identifying the problem, though I would actually recommend only reading the small file into memory and building a set out of its lines. Then you can still iterate over the large file and process it without having to load the whole thing into memory.
True, for both previous commenters. I've updated the last paragraph.. I'd actually suggeset building a dict rather than a set, since he could use the key for efficient lookup for the property he joins on.
@thesamet: a set will allow also for efficient lookup :), and there's no need to add an empty value per key!
thansk, i loaded the smaller file into memory and then used the 'for iterable in bigfile:' to loop over each line in the file loaded into smaller memory.
2

An implementation (using DictReader and DictWriter, to make use of the header):

import csv

with open('sample.csv','rU') as i: 
    smiles = set(x['SMILES'] for x in csv.DictReader(i))

with open('init.csv','rU') as j:
    reference = csv.DictReader(j, delimiter = '\t')
    fields = reference.fieldnames
    writer = csv.DictWriter(open('newZincSubset.csv','wb'),
                            fields,
                            delimiter = '\t')
    writer.writerow(dict((x,x) for x in fields))
    for reference_row in reference:
        if reference_row['SMILES'] in smiles:
            writer.writerow(reference_row)

1 Comment

Note: just edited the code to make it use tabs as delimiters (seems what is being used in the examples)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.