2

My following code works correctly, but far too slowly. I would greatly appreciate any help you can provide:

import gf
import csv
cic = gf.ct
cii = gf.cit
li = gf.lt
oc = "Output.csv"
with open(cic, "rb") as input1:
  reader = csv.DictReader(cie,gf.ctih)
  with open(oc,"wb") as outfile:
    writer = csv.DictWriter(outfile,gf.ctoh)
    writer.writerow(dict((h,h) for h in gf.ctoh))
    next(reader)
    for ci in reader:
      row = {}
      row["ci"] = ci["id"]
      row["cyf"] = ci["yf"]
      with open(cii,"rb") as ciif:
        reader2 = csv.DictReader(ciif,gf.citih)
        next(reader2)
        with open(li, "rb") as lif:
          reader3 = csv.DictReader(lif,gf.lih)
          next(reader3)
          for cii in reader2:
            if ci["id"] == cii["id"]:
              row["ci"] = cii["ca"]
          for li in reader3:
            if ci["id"] == li["en_id"]:
              row["cc"] = li["c"]
      writer.writerow(row)

The reason I open reader2 and reader3 for every row in reader is because reader objects iterate through once and then are done. But there has to be a much more efficient way of doing this and I would greatly appreciate any help you can provide!

If it helps, the intuition behind this code is the following: From Input file 1, grab two cells; see if input file 2 has the same Primary Key as in input file 1, if so, grab a cell from input file 2 and save it with the two other saved cells; see if input file 3 has the same primary key as in input file 1, if so, grab a cell from inputfile3 and save it. Then output these four values. That is, I'm grabbing meta-data from normalized tables and I'm trying to denormalize it. There must be a way of doing this very efficiently in Python. One problem with the current code is that I iterate through reader objects until I find the relevant ID, when there must be a simpler way of searching for a given ID in a reader object...

4
  • Is there anything special about the data (like is it sorted)? Is the data small enough that you can hold it in memory? Commented Oct 2, 2013 at 20:27
  • Thanks Michael. The data is not sorted. I'm not positive, but I;m pretty sure that I could probably hold it in memory. Commented Oct 2, 2013 at 20:31
  • It seems like either grabbing reading in all of the data and storing in a hash table (i.e., dict) you should be able to get fast lookup of your index. Right now you're repeating working by reading in file 2 and file 3 on every loop. Commented Oct 2, 2013 at 20:35
  • Sounds good. I'll give it a try with a hash table. Yeah, the reading in files two and three takes adds a lot of time to it. Thanks for your help! Commented Oct 2, 2013 at 20:38

1 Answer 1

1

For one, if this really does live in a relational database, why not just do a big join with some carefully phrased selects?

If I were doing this, I would use pandas.DataFrame and merge the 3 tables together, then I would iterate over each row and use suitable logic to transform the resulting "join"ed datasets into a single final result.

Sign up to request clarification or add additional context in comments.

1 Comment

Glad I could help. I discovered pandas about 2 weeks ago, and am using it in two completely unrelated projects. I love how fast those merges are.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.