Performance optimization in pandas dataframe transormation currently using "for loop"

Question

I usually use python for research, but it is my first time handling a large dataset (over a hundred millions of lines broken into multiple files), and an old but good workstation (Xeon E5-2637 v4 CPU, Quadro K420 GPU).

Any help with speeding up the algorithm below would be greatly appreciated. I am currently looking at enhancing performance to maximize hardware and using groupby to maybe change my for loop code, but to no avail. I have also viewed previous questions, but I (believe) what I need is more elementary.

Data format is as follows. (same file format for all files)

C:/../data1.csv
--
  col1  col2  col3
parent abcde   NaN
 child   d3d   a1a
 child   s2s   f4f
parent fghij   NaN
 child   g5g   h6h
 child   j7j   k8k

My original code

#list of file locations
filelist = {'files': ['C:/../data1.csv', 'C:/../data2.csv', 'C:/../data3.csv']}
filelist_df = pd.DataFrame(data=filelist)
filelist_df = filelist_df["files"].str.strip("[]")

#data transformation
column_names=['1', '2', '3', '4']
temp_parent=[]

for i in range(3):
  new_df=pd.DataFrame(columns=column_names)
  data_df=pd.read_csv(filelist_df[i], skiprows=1, names=column_names)
  for j in range(len(data_df)):
    if data_df['1'][j]=='parent':
      temp_parent=data_df['2'][j]
    else:
      data_df['4'][j]=temp_parent
      temp_row=data_df.loc[j,:]
      new_df = new_df.append(temp_row, ignore_index=True)
  new_df.to_csv('C:/../new%d' % i + '.csv', index=False, header=False)
  del new_df, data_df, temp_parent, temp_row

Output (just for data1.csv):

C:/../new0.csv
--
 child   d3d   a1a abcde
 child   s2s   f4f abcde
 child   g5g   h6h fghij
 child   j7j   k8k fghij

Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. — Community
– Community Bot, Commented Jul 24, 2022 at 22:07

Andrej Kesely · Accepted Answer · 2022-07-24 22:16:37Z

3

If I understand you correctly, you want to create new column with value from parent row, col2 column:

mask = df.col1.eq("parent")

df["col4"] = df.loc[mask, "col2"]
df["col4"] = df["col4"].ffill()
print(df[~mask])

Prints:

    col1 col2 col3   col4
1  child  d3d  a1a  abcde
2  child  s2s  f4f  abcde
4  child  g5g  h6h  fghij
5  child  j7j  k8k  fghij

Input dataframe:

     col1   col2 col3
0  parent  abcde  NaN
1   child    d3d  a1a
2   child    s2s  f4f
3  parent  fghij  NaN
4   child    g5g  h6h
5   child    j7j  k8k

answered Jul 24, 2022 at 22:16

Andrej Kesely

196k15 gold badges60 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

byeme Over a year ago

Thank you, this has already worked wonders in speeding up the algorithm. I'm sorry if this is out of place, but would it be possible to speed it up even more through utilizing gpu's and such? As it stands, it would still take days to finish. Thank you so much already though.

Andrej Kesely Over a year ago

@byeme This algorithm is very simple, it should run quick even on huge dataframes. Something else must be going on that slows the program.

byeme Over a year ago

It is going through each dataframe a lot faster than it was before, but I think the fact that each file can be up to 100MB, and there are over 20,000 files is the problem. I may just split up the files and physically use multiple computers to run them all if this is already very simple.

Andrej Kesely Over a year ago

@byeme In that case you can look at multiprocessing module. Your processor has 8 threads so it should speed processing significantly.

byeme Over a year ago

thanks, I'll definitely look into it some more

Collectives™ on Stack Overflow

Performance optimization in pandas dataframe transormation currently using "for loop"

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related