Optimize Python CSV Reader Performance

Question

My following code works correctly, but far too slowly. I would greatly appreciate any help you can provide:

import gf
import csv
cic = gf.ct
cii = gf.cit
li = gf.lt
oc = "Output.csv"
with open(cic, "rb") as input1:
  reader = csv.DictReader(cie,gf.ctih)
  with open(oc,"wb") as outfile:
    writer = csv.DictWriter(outfile,gf.ctoh)
    writer.writerow(dict((h,h) for h in gf.ctoh))
    next(reader)
    for ci in reader:
      row = {}
      row["ci"] = ci["id"]
      row["cyf"] = ci["yf"]
      with open(cii,"rb") as ciif:
        reader2 = csv.DictReader(ciif,gf.citih)
        next(reader2)
        with open(li, "rb") as lif:
          reader3 = csv.DictReader(lif,gf.lih)
          next(reader3)
          for cii in reader2:
            if ci["id"] == cii["id"]:
              row["ci"] = cii["ca"]
          for li in reader3:
            if ci["id"] == li["en_id"]:
              row["cc"] = li["c"]
      writer.writerow(row)

The reason I open reader2 and reader3 for every row in reader is because reader objects iterate through once and then are done. But there has to be a much more efficient way of doing this and I would greatly appreciate any help you can provide!

If it helps, the intuition behind this code is the following: From Input file 1, grab two cells; see if input file 2 has the same Primary Key as in input file 1, if so, grab a cell from input file 2 and save it with the two other saved cells; see if input file 3 has the same primary key as in input file 1, if so, grab a cell from inputfile3 and save it. Then output these four values. That is, I'm grabbing meta-data from normalized tables and I'm trying to denormalize it. There must be a way of doing this very efficiently in Python. One problem with the current code is that I iterate through reader objects until I find the relevant ID, when there must be a simpler way of searching for a given ID in a reader object...

Is there anything special about the data (like is it sorted)? Is the data small enough that you can hold it in memory? — Michael
– Michael, Commented Oct 2, 2013 at 20:27
Thanks Michael. The data is not sorted. I'm not positive, but I;m pretty sure that I could probably hold it in memory. — user7186
– user7186, Commented Oct 2, 2013 at 20:31
It seems like either grabbing reading in all of the data and storing in a hash table (i.e., dict) you should be able to get fast lookup of your index. Right now you're repeating working by reading in file 2 and file 3 on every loop. — Michael
– Michael, Commented Oct 2, 2013 at 20:35
Sounds good. I'll give it a try with a hash table. Yeah, the reading in files two and three takes adds a lot of time to it. Thanks for your help! — user7186
– user7186, Commented Oct 2, 2013 at 20:38

user632657 · Accepted Answer · 2013-10-02 20:59:02Z

1

For one, if this really does live in a relational database, why not just do a big join with some carefully phrased selects?

If I were doing this, I would use pandas.DataFrame and merge the 3 tables together, then I would iterate over each row and use suitable logic to transform the resulting "join"ed datasets into a single final result.

answered Oct 2, 2013 at 20:59

user632657

4933 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user632657 Over a year ago

Glad I could help. I discovered pandas about 2 weeks ago, and am using it in two completely unrelated projects. I love how fast those merges are.

Collectives™ on Stack Overflow

Optimize Python CSV Reader Performance

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related