2

I usually use python for research, but it is my first time handling a large dataset (over a hundred millions of lines broken into multiple files), and an old but good workstation (Xeon E5-2637 v4 CPU, Quadro K420 GPU).

Any help with speeding up the algorithm below would be greatly appreciated. I am currently looking at enhancing performance to maximize hardware and using groupby to maybe change my for loop code, but to no avail. I have also viewed previous questions, but I (believe) what I need is more elementary.

Data format is as follows. (same file format for all files)

C:/../data1.csv
--
  col1  col2  col3
parent abcde   NaN
 child   d3d   a1a
 child   s2s   f4f
parent fghij   NaN
 child   g5g   h6h
 child   j7j   k8k

My original code

#list of file locations
filelist = {'files': ['C:/../data1.csv', 'C:/../data2.csv', 'C:/../data3.csv']}
filelist_df = pd.DataFrame(data=filelist)
filelist_df = filelist_df["files"].str.strip("[]")

#data transformation
column_names=['1', '2', '3', '4']
temp_parent=[]

for i in range(3):
  new_df=pd.DataFrame(columns=column_names)
  data_df=pd.read_csv(filelist_df[i], skiprows=1, names=column_names)
  for j in range(len(data_df)):
    if data_df['1'][j]=='parent':
      temp_parent=data_df['2'][j]
    else:
      data_df['4'][j]=temp_parent
      temp_row=data_df.loc[j,:]
      new_df = new_df.append(temp_row, ignore_index=True)
  new_df.to_csv('C:/../new%d' % i + '.csv', index=False, header=False)
  del new_df, data_df, temp_parent, temp_row

Output (just for data1.csv):

C:/../new0.csv
--
 child   d3d   a1a abcde
 child   s2s   f4f abcde
 child   g5g   h6h fghij
 child   j7j   k8k fghij
1
  • Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. Commented Jul 24, 2022 at 22:07

1 Answer 1

3

If I understand you correctly, you want to create new column with value from parent row, col2 column:

mask = df.col1.eq("parent")

df["col4"] = df.loc[mask, "col2"]
df["col4"] = df["col4"].ffill()
print(df[~mask])

Prints:

    col1 col2 col3   col4
1  child  d3d  a1a  abcde
2  child  s2s  f4f  abcde
4  child  g5g  h6h  fghij
5  child  j7j  k8k  fghij

Input dataframe:

     col1   col2 col3
0  parent  abcde  NaN
1   child    d3d  a1a
2   child    s2s  f4f
3  parent  fghij  NaN
4   child    g5g  h6h
5   child    j7j  k8k
Sign up to request clarification or add additional context in comments.

5 Comments

Thank you, this has already worked wonders in speeding up the algorithm. I'm sorry if this is out of place, but would it be possible to speed it up even more through utilizing gpu's and such? As it stands, it would still take days to finish. Thank you so much already though.
@byeme This algorithm is very simple, it should run quick even on huge dataframes. Something else must be going on that slows the program.
It is going through each dataframe a lot faster than it was before, but I think the fact that each file can be up to 100MB, and there are over 20,000 files is the problem. I may just split up the files and physically use multiple computers to run them all if this is already very simple.
@byeme In that case you can look at multiprocessing module. Your processor has 8 threads so it should speed processing significantly.
thanks, I'll definitely look into it some more

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.