0

How to divide a large data frame having multiple categorical columns with multiple labels or classes in it.

For example, I'm having 1million rows with 100 columns and 50 columns having categorical data with different labels in it.

Now how to divide the data frame into 2 or 3 parts(or subsets) in which all labels in categorical columns should be present in the 2 or 3 subsets. Is it possible to do that for large datasets?

def rec():
    print('#rec Started')
    shuf_data = df.sample(frac=1)
    ran_data = np.random.rand(len(shuf_data)) < 0.5
    p_d = shuf_data[ran_data]
    d = shuf_data[~ran_data]

    def rrec(p_d,d):
        print('#rrec Started')
        for col in df_cat_cols:
            p_dcol = p_d[col].unique()
            dcol = d[col].unique()
            outcome = all(elem in p_dcol for elem in dcol)
            if outcome:
                print("Yes, list1 contains all elements in list2")
            else:
                print("No, list1 does not contains all elements in list2")
                return rec()
        return p_d,d

    return rrec(p_d,d)

The above code kills the process due to a very large dataset(1Million records). Please suggest a better and efficient answer. Thank You.

Here is an example:

Eg:
    Fruits  Color   Price
0   Banana  Yellow  60
1   Grape   Black   100
2   Apple   Red     200
3   Papaya  Yellow  50
4   Dragon  Pink    150
5   Mango   Yellow  400
6   Banana  Yellow  75
7   Grape   Black   106
8   Apple   Red     190
9   Papaya  Yellow  60
10  Dragon  Pink    120
11  Mango   Yellow  390

Expected 50:50 split:

df1:

3   Papaya  Yellow  50
4   Dragon  Pink    150
5   Mango   Yellow  400
6   Banana  Yellow  75
7   Grape   Black   106
8   Apple   Red     190

df2:
0   Banana  Yellow  60
1   Grape   Black   100
2   Apple   Red     200
9   Papaya  Yellow  60
10  Dragon  Pink    120
11  Mango   Yellow  390
4
  • Do you mind to provide a minimal reproducible example? Commented Nov 4, 2021 at 15:07
  • 1
    Hi @rpanai, do you want an example? Commented Nov 4, 2021 at 15:30
  • Yes I do. Please have a look at How to Ask too. Commented Nov 4, 2021 at 15:52
  • Hi @rpanai, I added example can you please check it once. Commented Nov 4, 2021 at 16:12

2 Answers 2

1

Yes, one way is to enumerate all rows with the same categories:

cat_cols = ['cat_col1', 'cat_col2']

groups = df.groupby(cat_cols).cumcount() // 3

sub_df = {g: d for g,d in df.groupby(groups)}
Sign up to request clarification or add additional context in comments.

5 Comments

Hi @QuangHoang, how to assign two different dataframes to variables so that I can save.
You already have it. You can save, for example sub_df[0].to_csv('file0.csv').
when I tested with 80k samples I'm getting only 1400 records each CSV file.
I checked it and it's working. Each CSV file has all the labels but how to make two CSV files with 50:50 data or 70:30 data ...etc. I'm just waiting for this only. Please suggest me.
train = df.groupby(cat_cols).sample(frac=.7); test = df.drop(train.idx).
1

Why don't don't you try using train_test_split() method from sk-learn And OneHotEncoder() to break the categorical columns down. This is more of a machine learning approach, and I have used it to break the dataset with a 1million rows before, so it should work

1 Comment

Train/test split doesn't guarantee existing of all feature values. Although it's unlikely, but you might end up with, e.g. train having all the ones while test having all the zeros.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.