2

I've got a big table in a csv file, which has 5 million rows and 4 columns. My objective is to take each row from the first 500k and to compare it with all the following rows (i.e. 5kk - n) based on certain condition. The condition is something like

row(n).column1 == row(n+1).column1 AND row(n).column2 == row(n+1).column2 AND row(n).column3 == row(n+1).column3

OR

row(n).column1 == row(n+1).column1 AND row(n).column2 == row(n+1).column2 AND row(n+1).column4.split()[0] in row(n).column4

Now I'm using simple loop over lists:

for idx,i in enumerate(big[:500000]):
    for jdx,j in enumerate(big):
        if (jdx>idx and i[0]==j[0] and i[1]==j[1] and i[2]==j[2]) or (i[0]==j[0] and i[1]==j[1] and j[3].split()[0] if j[3].split() else '' in i[3]):
            matches.append([idx,jdx])

Which obviously takes very long time to complete (about a week using single proccess). Pandas and numpy are good for operations on the whole array at a time, but I don't know if I can convert this task into them somehow.

So the question is, how can I speed up the proccess?

12
  • 1
    You are executing something like 1.2 trillion if-statements. Since you have the whole data set in primary memory, I/O shouldn't be a problem. If it takes a week, then you execute approximately 2 million if-statements per second, which sounds like a pretty good speed. I wonder how much faster it would be, even if the whole loop is written in C? Commented Oct 6, 2014 at 11:46
  • 1
    It probably won't change execution time but - you could refactor the terms in the logic statement: if i[0]==j[0] and i[1]==j[1] and ((jdx>idx and i[2]==j[2]) or (j[3].split()[0] if j[3].split() else '' in i[3])) Commented Oct 6, 2014 at 12:05
  • 2
    Avoid this condition by generating indexes in the correct range: jdx > idx. That will save a bit of wasted time. Commented Oct 6, 2014 at 12:37
  • 1
    shift allows you to compare whole arrays against a shifted row, what you are doing could be done very quickly using numpy or pandas, for example you could take a slice of your data (imported into numpy or pandas) and then filter them using your criteria as all you are doing is looking for matches between 3 consecutive rows Commented Oct 6, 2014 at 13:28
  • 1
    Good spotting by arhuaco. If you do 'for idx,i in enumerate(big[:500000]): for jdx,j in enumerate(big[idx:]):' you will cut the number of iterations. Commented Oct 6, 2014 at 13:57

1 Answer 1

0

I ended up using following things to improve performance.

  1. Separated loop in a function (about 20% gain)
  2. Improved logical operations.
  3. Used PyPy interpreter (300+% increase)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.