2

the data which i have is in below format :

col_1         col_2                            col_3

NaN            NaN                              NaN
Date         21-04-2022                         NaN
Id            Name                            status
01            A11                              Pass
02            A22                              F_1
03            A33                              P_2
SUMMARY    'Total :$20  Approved $ 10'         NaN
NaN            NaN                             NaN
Date         22-04-2022                        NaN
Id            Name                           status
04            A12                              P_2
05            A23                              F_1
06            A34                              P_2
SUMMARY    'Total :$30  Approved $ 20'         NaN

Expected Output : df_1 -

Id            Name                            status
01            A11                              Pass
02            A22                              F_1
03            A33                              P_2
SUMMARY    'Total :$20  Approved $ 10'         NaN

df_2 -

Id            Name                           status
04            A12                              P_2
05            A23                              F_1
06            A34                              P_2
SUMMARY    'Total :$30  Approved $ 20'         NaN

Above is just the sample data. Actual Number of columns which i have is around 24K. thus many number of df's will be created how it can be approached..?

4 Answers 4

2

You can use:

grp = df['col_1'].eq('Id').cumsum()  # create virtual groups
msk = ~df.isna().all(axis=1) & df['col_1'].ne('Date')  # keep wanted rows

# create a dict with subset dataframes
dfs = {f'df{name}': pd.DataFrame(temp.values[1:], columns=temp.iloc[0].tolist()) 
           for name, temp in df[msk].groupby(grp)}

Output:

>>> dfs['df1']
        Id                       Name status
0       01                        A11   Pass
1       02                        A22    F_1
2       03                        A33    P_2
3  SUMMARY  Total :$20  Approved $ 10    NaN

>>> dfs['df2']
        Id                       Name status
0       04                        A12    P_2
1       05                        A23    F_1
2       06                        A34    P_2
3  SUMMARY  Total :$30  Approved $ 20    NaN

Update: export to excel:

with pd.ExcelWriter('data.xlsx') as writer:
    for name, temp in dfs.items():
        temp.to_excel(writer, index=False, sheet_name=name)
Sign up to request clarification or add additional context in comments.

4 Comments

how to export the created dataframes into excel with incremental names i.e. df1 , df2, etc
@Romi. I updated my answer, can you check it please? (I used one excel and multiple sheets)
its working perfectly fine..!! one last question if we want to apply a common function to all these sheets at once(in a loop) how it could be done
dfs = {name: my_common_function(temp) for name, temp in dfs.items()}
0

You could create an auxiliary column of booleans and use that to slice the dataframe into smaller parts:

import pandas as pd
df = pd.DataFrame({'col_1': [1,2,'Id',3,4,5,'SUMMARY',1,2,'Id',3,4,5,'SUMMARY']})

mask = df['col_1'].eq('Id') | df['col_1'].eq('SUMMARY').shift()
df['group_id'] = mask.cumsum()
dfs = list()
for group_id in df['group_id'].unique():
    if group_id % 2 != 0:
        dfs.append(df[df['group_id'].eq(group_id)])

print(dfs[0])
print(dfs[1])

Comments

0

Highly inspired by piRSquared's answer here, you can approach your goal like this :

import pandas as pd
import numpy as np

df.columns = ["Id", "Name", "Status"]

# is the row a Margin ?
m = df["Id"].eq("SUMMARY")

l_df = list(filter(lambda d: not d.empty, np.split(df, np.flatnonzero(m) + 1)))

_ = [exec(f"globals()['df_{idx}'] = df.reset_index(drop=True) \
                                      .loc[3:].reset_index(drop=True)")
     for idx, df in enumerate(l_df, start=1)]

NB : We used globals to create the variables/sub-dataframes dynamically.

# Output :
print(len(l_df), "DataFrames was created!")
2 DataFrames was created!

print(df_1, type(df_1)), print(df_2, type(df_2)))
    
        Id                         Name Status
0       04                          A12    P_2
1       05                          A23    F_1
2       06                          A34    P_2
3  SUMMARY   'Total :$30 Approved $ 20'    NaN <class 'pandas.core.frame.DataFrame'>

        Id                         Name Status
0       01                          A11   Pass
1       02                          A22    F_1
2       03                          A33    P_2
3  SUMMARY   'Total :$20 Approved $ 10'    NaN <class 'pandas.core.frame.DataFrame'>

Comments

0

With the following code, you can select the row you want as column names and put the rows you want as new data frame rows.

import numpy as np

df_1 = pd.DataFrame(data = np.array(df.iloc[3:7]),columns=np.array(df.iloc[2:3])[0])
df_2 = pd.DataFrame(data = np.array(df.iloc[11:15]),columns=np.array(df.iloc[10:11])[0])

Output df_1:

Id            Name                            status
01            A11                              Pass
02            A22                              F_1
03            A33                              P_2
SUMMARY    'Total :$20  Approved $ 10'         NaN

Output df_2:

Id            Name                           status
04            A12                              P_2
05            A23                              F_1
06            A34                              P_2
SUMMARY    'Total :$30  Approved $ 20'         NaN

1 Comment

Please read How to Answer and edit your answer to contain an explanation as to why this code would actually solve the problem at hand. Always remember that you're not only solving the problem, but are also educating the OP and any future readers of this post

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.