0

I have a dataframe with 1M+ rows. A sample of the dataframe is shown below:

df

    ID      Type        File    
 0  123     Phone       1       
 1  122     Computer    2       
 2  126     Computer    1       

I want to split this dataframe based on Type and File. If the total count of Type is 2 (Phone and Computer), total number of files is 2 (1,2), then the total number of splits will be 4.

In short, total splits is as given below:

 total_splits=len(set(df['Type']))*len(set(df['File']))

In this example, total_splits=4. Now, I want to split the dataframe df in 4 based on Type and File.

So the new dataframes should be:

  1. df1 (having data of type=Phone and File=1)
  2. df2 (having data of type=Computer and File=1)
  3. df3 (having data of type=Phone and File=2)
  4. df4 (having data of type=Computer and File=2)

The splitting should be done inside a loop.

I know we can split a dataframe based on one condition (shown below), but how do you split it based on two ?

My Code:

data = {'ID' : ['123', '122', '126'],'Type' :['Phone','Computer','Computer'],'File' : [1,2,1]}

df=pd.DataFrame(data) 

types=list(set(df['Type']))
total_splits=len(set(df['Type']))*len(set(df['File']))

cnt=1

for i in range(0,total_splits):
      for j in types:
            locals()["df"+str(cnt)] = df[df['Type'] == j]
            cnt += 1

The result of the above code gives 2 dataframes, df1 and df2. df1 will have data of Type='Phone' and df2 will have data of Type='Computer'.

But this is just half of what I want to do. Is there a way we can make 4 dataframes here based on 2 conditions ?

Note: I know I can first split on 'Type' and then split the resulting dataframe based on 'File' to get the output. However, I want to know of a more efficient way of performing the split instead of having to create multiple dataframes to get the job done.

EDIT

This is not a duplicate question as I want to split the dataframe based on multiple column values, not just one!

1

1 Answer 1

1

You can make do with groupby:

dfs = {}
for k, d in df.groupby(['Type','File']):
    type, file = k

    # do want ever you want here
    # d is the dataframe corresponding with type, file
    dfs[k] = d

You can also create a mask:

df['mask'] = df['File'].eq(1) * 2 + df['Type'].eq('Phone')

Then, for example:

df[df['mask'].eq(0)]

gives you the first dataframe you want, i.e. Type==Phone and File==1, and so on.

Sign up to request clarification or add additional context in comments.

2 Comments

I used the groupby method but ended up with 8 dataframes with a lot of duplicate dataframes generated. Code I used is: cnt=1 for k, d in df.groupby(['File','Type']): type, file = k locals()["df"+str(cnt)] = d cnt += 1
That means you have more categories, e.g, 3 Types. You probably want to filter them out first.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.