How to divide a dataframe into several dataframes

Question

How to divide a large data frame having multiple categorical columns with multiple labels or classes in it.

For example, I'm having 1million rows with 100 columns and 50 columns having categorical data with different labels in it.

Now how to divide the data frame into 2 or 3 parts(or subsets) in which all labels in categorical columns should be present in the 2 or 3 subsets. Is it possible to do that for large datasets?

def rec():
    print('#rec Started')
    shuf_data = df.sample(frac=1)
    ran_data = np.random.rand(len(shuf_data)) < 0.5
    p_d = shuf_data[ran_data]
    d = shuf_data[~ran_data]

    def rrec(p_d,d):
        print('#rrec Started')
        for col in df_cat_cols:
            p_dcol = p_d[col].unique()
            dcol = d[col].unique()
            outcome = all(elem in p_dcol for elem in dcol)
            if outcome:
                print("Yes, list1 contains all elements in list2")
            else:
                print("No, list1 does not contains all elements in list2")
                return rec()
        return p_d,d

    return rrec(p_d,d)

The above code kills the process due to a very large dataset(1Million records). Please suggest a better and efficient answer. Thank You.

Here is an example:

Eg:
    Fruits  Color   Price
0   Banana  Yellow  60
1   Grape   Black   100
2   Apple   Red     200
3   Papaya  Yellow  50
4   Dragon  Pink    150
5   Mango   Yellow  400
6   Banana  Yellow  75
7   Grape   Black   106
8   Apple   Red     190
9   Papaya  Yellow  60
10  Dragon  Pink    120
11  Mango   Yellow  390

Expected 50:50 split:

df1:

3   Papaya  Yellow  50
4   Dragon  Pink    150
5   Mango   Yellow  400
6   Banana  Yellow  75
7   Grape   Black   106
8   Apple   Red     190

df2:
0   Banana  Yellow  60
1   Grape   Black   100
2   Apple   Red     200
9   Papaya  Yellow  60
10  Dragon  Pink    120
11  Mango   Yellow  390

Do you mind to provide a minimal reproducible example?

rpanai
– rpanai

2021-11-04 15:07:18 +00:00
Commented Nov 4, 2021 at 15:07 — rpanai
– rpanai, Commented Nov 4, 2021 at 15:07
Hi @rpanai, do you want an example?

swarna
– swarna

2021-11-04 15:30:28 +00:00
Commented Nov 4, 2021 at 15:30 — swarna
– swarna, Commented Nov 4, 2021 at 15:30
Yes I do. Please have a look at How to Ask too.

rpanai
– rpanai

2021-11-04 15:52:54 +00:00
Commented Nov 4, 2021 at 15:52 — rpanai
– rpanai, Commented Nov 4, 2021 at 15:52
Hi @rpanai, I added example can you please check it once.

swarna
– swarna

2021-11-04 16:12:16 +00:00
Commented Nov 4, 2021 at 16:12 — swarna
– swarna, Commented Nov 4, 2021 at 16:12

Quang Hoang · Accepted Answer · 2021-11-04 14:44:32Z

1

Yes, one way is to enumerate all rows with the same categories:

cat_cols = ['cat_col1', 'cat_col2']

groups = df.groupby(cat_cols).cumcount() // 3

sub_df = {g: d for g,d in df.groupby(groups)}

answered Nov 4, 2021 at 14:44

Quang Hoang

151k11 gold badges64 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

swarna Over a year ago

Hi @QuangHoang, how to assign two different dataframes to variables so that I can save.

Quang Hoang Over a year ago

You already have it. You can save, for example sub_df[0].to_csv('file0.csv').

swarna Over a year ago

when I tested with 80k samples I'm getting only 1400 records each CSV file.

swarna Over a year ago

I checked it and it's working. Each CSV file has all the labels but how to make two CSV files with 50:50 data or 70:30 data ...etc. I'm just waiting for this only. Please suggest me.

Quang Hoang Over a year ago

train = df.groupby(cat_cols).sample(frac=.7); test = df.drop(train.idx).

yagyesh · Accepted Answer · 2021-11-04 14:43:52Z

1

Why don't don't you try using train_test_split() method from sk-learn And OneHotEncoder() to break the categorical columns down. This is more of a machine learning approach, and I have used it to break the dataset with a 1million rows before, so it should work

answered Nov 4, 2021 at 14:43

yagyesh

1048 bronze badges

1 Comment

Quang Hoang Over a year ago

Train/test split doesn't guarantee existing of all feature values. Although it's unlikely, but you might end up with, e.g. train having all the ones while test having all the zeros.

Collectives™ on Stack Overflow

How to divide a dataframe into several dataframes

2 Answers 2

5 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related