1

I have a very large DF that contains data like the following:

import pandas as pd
df = pd.DataFrame()
df['CODE'] = [1,2,3,1,2,4,2,2,4,5]
df["DATA"] = [ 'AA', 'BB', 'CC', 'DD', 'AA', 'BB', 'EE', 'FF','GG', 'HH']
df.sort_values('CODE')
df
  CODE DATA
 0     1   AA
 3     1   DD
 1     2   BB
 4     2   AA
 6     2   EE
 7     2   FF
 2     3   CC
 5     4   BB
 8     4   GG
 9     5   HH

because of the size I need to split it into chunks and parse it. However equals element contained in the CODE column should not end up in different chunks, instead those should be added in the previous chunk even if the size is exceeded.

Basically if I choose a chunk size of 4 rows the first chunk could be increased up to include all elements with "2" and be:

chunk1:
      CODE DATA
 0     1   AA
 3     1   DD
 1     2   BB
 4     2   AA
 6     2   EE
 7     2   FF

I found some posts about chunking and grouping like the following:

split dataframe into multiple dataframes based on number of rows

However the above provide an equal size chunking and I need a smart chunking that takes into account the values in the CODE column.

Any ideas how to do that?

2 Answers 2

1

I maybe came up with a solution, (still testing all cases), not very elegant though.

I create a recursive function returning the intervals to take:

def findrange(start,step):
        for i in range(start,len(df)+1, step):
            if i+step > len(df): return [i, len(df)]
            if df.CODE[i+step:i+step+1].values != df.CODE[i+step-1:i+step].values:
                return [i,i+step]
        else:
            return findrange(i,step+1)

Then I call the function to get the ranges and process the data

interval = [0,0]
idx = 0
N=2
while interval[1] < len(df):
    if idx < interval[1]: idx = interval[1]
    interval = findrange(idx, N)
    idx+=N # this point became useless once interval[1] > idx 

I tried with the DF posted using many different values for N > 0 and looks good. if you have an approach more pandas like I am open to that.

Sign up to request clarification or add additional context in comments.

Comments

0

I think you can create new column GROUPS by cumcount and then floor divide by N - get chunks for each CODE values:

N = 2
df['GROUPS'] = df.groupby('CODE').cumcount() // N
print (df)
   CODE DATA  GROUPS
0     1   AA       0
3     1   DD       0
1     2   BB       0
4     2   AA       0
6     2   EE       1
7     2   FF       1
2     3   CC       0
5     4   BB       0
8     4   GG       0
9     5   HH       0

groups = df.groupby(['CODE','GROUPS'])
for (frameno, frame) in groups:
    print (frame.to_csv("%s.csv" % frameno))

You can also create new Series and use it for groupby:

chunked_ser = df.groupby('CODE').cumcount() // N
print (chunked_ser)
0    0
3    0
1    0
4    0
6    1
7    1
2    0
5    0
8    0
9    0
dtype: int64

groups = df.groupby([df.CODE,chunked_ser])
for (frameno, frame) in groups:
    print (frame.to_csv("%s.csv" % frameno))

3 Comments

but with N=2 for example I end up splitting the rows having CODE equal to 2. Instead I would keep them in the same chunk.
Hmmm, I understand yoour problem you need split each group by CODE column to more groups, because large data. So you can use large N like 1000 - then split by 1000 rows chunks. Or am I something missing?
If use floor division e.g. try N=3 get last group with size 1 - get same length most of groups, but last can be obviosly smaller.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.