Split a Pandas Dataframe into multiple smaller dataframes based on empty rows

Question

I have a csv file with a format like this:

Header 1, Header 2, Header 3
''          ''        ''
value 1,  value2,   value 3
value 1,  value2,   value 3
value 1,  value2,   value 3
''          ''        ''
value 1,  value 2,   value 3
value 1,  value 2,   value 3
value 1,  value 2,   value 3
 ''          ''        ''

I can read it into a pandas dataframe but the segments surrounded by empty rows (denoted by '') need to be each processed individually. What would be the simplest way to divide them into smaller dataframes based off of them being between empty rows? I have quite a few of these segments to go through.

Would it be easier to divide them into smaller dataframes or would removing the segment from the original dataframe after processing it be even easier?

EDIT:

IanS's answer was correct but in my case some of my files had simply no quotes in empty rows so the type was not a string. I modified his answer a little and this worked for them:

df['counter'] = (df['Header 1'].isnull()).cumsum()
df = df[df['Header 1'].isnull() == False]  # remove empty rows
df.groupby('counter').apply(lambda df: df.iloc[0])

The simplest would be to add a counter that increments each time it encounters an empty row. You can then get your individual dataframes via df.groupby('counter'). If you are interested I can write an answer. — IanS
– IanS, Commented Apr 8, 2016 at 10:19
That's a good idea, I'll try writing it on my end but if you write yours I shall accept it as an answer — GreenGodot
– GreenGodot, Commented Apr 8, 2016 at 10:24

jezrael · Accepted Answer · 2016-04-08 10:35:56Z

3

I think you can find empty rows by str.contains, create counter series by cumsum, groupby by it and in loop you get small DataFrames:

print df['Header 1'].str.contains("''").cumsum()
0    1
1    1
2    1
3    1
4    2
5    2
6    2
7    2
8    3
Name: Header 1, dtype: int32

for idx, group in df.groupby(df['Header 1'].str.contains("''").cumsum()):
    print idx
    print group[1:]
1
  Header 1  Header 2    Header 3
1  value 1    value2     value 3
2  value 1    value2     value 3
3  value 1    value2     value 3
2
  Header 1   Header 2    Header 3
5  value 1    value 2     value 3
6  value 1    value 2     value 3
7  value 1    value 2     value 3
3
Empty DataFrame
Columns: [Header 1,  Header 2,  Header 3]
Index: []

If you want, you can create dictionary of DataFrames:

dfs = {}
for idx, group in df.groupby(df['Header 1'].str.contains("''").cumsum()):
    dfs.update({idx:group[1:]})

edited Apr 8, 2016 at 10:35

answered Apr 8, 2016 at 10:27

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

Comments

IanS · Accepted Answer · 2016-04-08 10:35:06Z

1

The simplest would be to add a counter that increments each time it encounters an empty row. You can then get your individual dataframes via groupby.

df['counter'] = (df['Header1'] == "''").cumsum()
df = df[df['Header1'] != "''"]  # remove empty rows
df.groupby('counter').apply(lambda df: df.iloc[0])

The last line applies your processing function to each dataframe separately (I just put a dummy example).

Note that the exact condition testing for empty rows (here df['Header1'] == "''") should be adapted to your exact situation.

answered Apr 8, 2016 at 10:35

IanS

16.3k9 gold badges64 silver badges87 bronze badges

Collectives™ on Stack Overflow

Split a Pandas Dataframe into multiple smaller dataframes based on empty rows

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related