I've got a big table in a csv file, which has 5 million rows and 4 columns. My objective is to take each row from the first 500k and to compare it with all the following rows (i.e. 5kk - n) based on certain condition. The condition is something like
row(n).column1 == row(n+1).column1 AND row(n).column2 == row(n+1).column2 AND row(n).column3 == row(n+1).column3
OR
row(n).column1 == row(n+1).column1 AND row(n).column2 == row(n+1).column2 AND row(n+1).column4.split()[0] in row(n).column4
Now I'm using simple loop over lists:
for idx,i in enumerate(big[:500000]):
for jdx,j in enumerate(big):
if (jdx>idx and i[0]==j[0] and i[1]==j[1] and i[2]==j[2]) or (i[0]==j[0] and i[1]==j[1] and j[3].split()[0] if j[3].split() else '' in i[3]):
matches.append([idx,jdx])
Which obviously takes very long time to complete (about a week using single proccess). Pandas and numpy are good for operations on the whole array at a time, but I don't know if I can convert this task into them somehow.
So the question is, how can I speed up the proccess?
if i[0]==j[0] and i[1]==j[1] and ((jdx>idx and i[2]==j[2]) or (j[3].split()[0] if j[3].split() else '' in i[3]))