I have a very large DF that contains data like the following:
import pandas as pd
df = pd.DataFrame()
df['CODE'] = [1,2,3,1,2,4,2,2,4,5]
df["DATA"] = [ 'AA', 'BB', 'CC', 'DD', 'AA', 'BB', 'EE', 'FF','GG', 'HH']
df.sort_values('CODE')
df
CODE DATA
0 1 AA
3 1 DD
1 2 BB
4 2 AA
6 2 EE
7 2 FF
2 3 CC
5 4 BB
8 4 GG
9 5 HH
because of the size I need to split it into chunks and parse it. However equals element contained in the CODE column should not end up in different chunks, instead those should be added in the previous chunk even if the size is exceeded.
Basically if I choose a chunk size of 4 rows the first chunk could be increased up to include all elements with "2" and be:
chunk1:
CODE DATA
0 1 AA
3 1 DD
1 2 BB
4 2 AA
6 2 EE
7 2 FF
I found some posts about chunking and grouping like the following:
split dataframe into multiple dataframes based on number of rows
However the above provide an equal size chunking and I need a smart chunking that takes into account the values in the CODE column.
Any ideas how to do that?