Working with a big csv in python

Question

I've got a big table in a csv file, which has 5 million rows and 4 columns. My objective is to take each row from the first 500k and to compare it with all the following rows (i.e. 5kk - n) based on certain condition. The condition is something like

row(n).column1 == row(n+1).column1 AND row(n).column2 == row(n+1).column2 AND row(n).column3 == row(n+1).column3

OR

row(n).column1 == row(n+1).column1 AND row(n).column2 == row(n+1).column2 AND row(n+1).column4.split()[0] in row(n).column4

Now I'm using simple loop over lists:

for idx,i in enumerate(big[:500000]):
    for jdx,j in enumerate(big):
        if (jdx>idx and i[0]==j[0] and i[1]==j[1] and i[2]==j[2]) or (i[0]==j[0] and i[1]==j[1] and j[3].split()[0] if j[3].split() else '' in i[3]):
            matches.append([idx,jdx])

Which obviously takes very long time to complete (about a week using single proccess). Pandas and numpy are good for operations on the whole array at a time, but I don't know if I can convert this task into them somehow.

So the question is, how can I speed up the proccess?

You are executing something like 1.2 trillion if-statements. Since you have the whole data set in primary memory, I/O shouldn't be a problem. If it takes a week, then you execute approximately 2 million if-statements per second, which sounds like a pretty good speed. I wonder how much faster it would be, even if the whole loop is written in C? — Moose
– Moose, Commented Oct 6, 2014 at 11:46
It probably won't change execution time but - you could refactor the terms in the logic statement: if i[0]==j[0] and i[1]==j[1] and ((jdx>idx and i[2]==j[2]) or (j[3].split()[0] if j[3].split() else '' in i[3])) — wwii
– wwii, Commented Oct 6, 2014 at 12:05
Avoid this condition by generating indexes in the correct range: jdx > idx. That will save a bit of wasted time. — arhuaco
– arhuaco, Commented Oct 6, 2014 at 12:37
shift allows you to compare whole arrays against a shifted row, what you are doing could be done very quickly using numpy or pandas, for example you could take a slice of your data (imported into numpy or pandas) and then filter them using your criteria as all you are doing is looking for matches between 3 consecutive rows — EdChum
– EdChum, Commented Oct 6, 2014 at 13:28
Good spotting by arhuaco. If you do 'for idx,i in enumerate(big[:500000]): for jdx,j in enumerate(big[idx:]):' you will cut the number of iterations. — Moose
– Moose, Commented Oct 6, 2014 at 13:57

hellpanderr · Accepted Answer · 2014-10-12 00:47:01Z

0

I ended up using following things to improve performance.

Separated loop in a function (about 20% gain)
Improved logical operations.
Used PyPy interpreter (300+% increase)

answered Oct 12, 2014 at 0:47

hellpanderr

5,9563 gold badges42 silver badges50 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Working with a big csv in python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related