Variable Chunk size Pandas Dataframe

Question

I have a very large DF that contains data like the following:

import pandas as pd
df = pd.DataFrame()
df['CODE'] = [1,2,3,1,2,4,2,2,4,5]
df["DATA"] = [ 'AA', 'BB', 'CC', 'DD', 'AA', 'BB', 'EE', 'FF','GG', 'HH']
df.sort_values('CODE')
df
  CODE DATA
 0     1   AA
 3     1   DD
 1     2   BB
 4     2   AA
 6     2   EE
 7     2   FF
 2     3   CC
 5     4   BB
 8     4   GG
 9     5   HH

because of the size I need to split it into chunks and parse it. However equals element contained in the CODE column should not end up in different chunks, instead those should be added in the previous chunk even if the size is exceeded.

Basically if I choose a chunk size of 4 rows the first chunk could be increased up to include all elements with "2" and be:

chunk1:
      CODE DATA
 0     1   AA
 3     1   DD
 1     2   BB
 4     2   AA
 6     2   EE
 7     2   FF

I found some posts about chunking and grouping like the following:

split dataframe into multiple dataframes based on number of rows

However the above provide an equal size chunking and I need a smart chunking that takes into account the values in the CODE column.

Any ideas how to do that?

user2320577 · Accepted Answer · 2016-11-07 14:19:24Z

1

I maybe came up with a solution, (still testing all cases), not very elegant though.

I create a recursive function returning the intervals to take:

def findrange(start,step):
        for i in range(start,len(df)+1, step):
            if i+step > len(df): return [i, len(df)]
            if df.CODE[i+step:i+step+1].values != df.CODE[i+step-1:i+step].values:
                return [i,i+step]
        else:
            return findrange(i,step+1)

Then I call the function to get the ranges and process the data

interval = [0,0]
idx = 0
N=2
while interval[1] < len(df):
    if idx < interval[1]: idx = interval[1]
    interval = findrange(idx, N)
    idx+=N # this point became useless once interval[1] > idx

I tried with the DF posted using many different values for N > 0 and looks good. if you have an approach more pandas like I am open to that.

edited Nov 7, 2016 at 14:19

answered Nov 7, 2016 at 14:09

user2320577

1499 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

jezrael · Accepted Answer · 2016-11-07 10:03:01Z

0

I think you can create new column GROUPS by cumcount and then floor divide by N - get chunks for each CODE values:

N = 2
df['GROUPS'] = df.groupby('CODE').cumcount() // N
print (df)
   CODE DATA  GROUPS
0     1   AA       0
3     1   DD       0
1     2   BB       0
4     2   AA       0
6     2   EE       1
7     2   FF       1
2     3   CC       0
5     4   BB       0
8     4   GG       0
9     5   HH       0

groups = df.groupby(['CODE','GROUPS'])
for (frameno, frame) in groups:
    print (frame.to_csv("%s.csv" % frameno))

You can also create new Series and use it for groupby:

chunked_ser = df.groupby('CODE').cumcount() // N
print (chunked_ser)
0    0
3    0
1    0
4    0
6    1
7    1
2    0
5    0
8    0
9    0
dtype: int64

groups = df.groupby([df.CODE,chunked_ser])
for (frameno, frame) in groups:
    print (frame.to_csv("%s.csv" % frameno))

answered Nov 7, 2016 at 10:03

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

3 Comments

user2320577 Over a year ago

but with N=2 for example I end up splitting the rows having CODE equal to 2. Instead I would keep them in the same chunk.

jezrael Over a year ago

Hmmm, I understand yoour problem you need split each group by CODE column to more groups, because large data. So you can use large N like 1000 - then split by 1000 rows chunks. Or am I something missing?

jezrael Over a year ago

If use floor division e.g. try N=3 get last group with size 1 - get same length most of groups, but last can be obviosly smaller.

Collectives™ on Stack Overflow

Variable Chunk size Pandas Dataframe

2 Answers 2

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related