2

I have a csv file with a format like this:

Header 1, Header 2, Header 3
''          ''        ''
value 1,  value2,   value 3
value 1,  value2,   value 3
value 1,  value2,   value 3
''          ''        ''
value 1,  value 2,   value 3
value 1,  value 2,   value 3
value 1,  value 2,   value 3
 ''          ''        ''

I can read it into a pandas dataframe but the segments surrounded by empty rows (denoted by '') need to be each processed individually. What would be the simplest way to divide them into smaller dataframes based off of them being between empty rows? I have quite a few of these segments to go through.

Would it be easier to divide them into smaller dataframes or would removing the segment from the original dataframe after processing it be even easier?

EDIT:

IanS's answer was correct but in my case some of my files had simply no quotes in empty rows so the type was not a string. I modified his answer a little and this worked for them:

df['counter'] = (df['Header 1'].isnull()).cumsum()
df = df[df['Header 1'].isnull() == False]  # remove empty rows
df.groupby('counter').apply(lambda df: df.iloc[0])
2
  • 1
    The simplest would be to add a counter that increments each time it encounters an empty row. You can then get your individual dataframes via df.groupby('counter'). If you are interested I can write an answer. Commented Apr 8, 2016 at 10:19
  • That's a good idea, I'll try writing it on my end but if you write yours I shall accept it as an answer Commented Apr 8, 2016 at 10:24

2 Answers 2

3

I think you can find empty rows by str.contains, create counter series by cumsum, groupby by it and in loop you get small DataFrames:

print df['Header 1'].str.contains("''").cumsum()
0    1
1    1
2    1
3    1
4    2
5    2
6    2
7    2
8    3
Name: Header 1, dtype: int32

for idx, group in df.groupby(df['Header 1'].str.contains("''").cumsum()):
    print idx
    print group[1:]
1
  Header 1  Header 2    Header 3
1  value 1    value2     value 3
2  value 1    value2     value 3
3  value 1    value2     value 3
2
  Header 1   Header 2    Header 3
5  value 1    value 2     value 3
6  value 1    value 2     value 3
7  value 1    value 2     value 3
3
Empty DataFrame
Columns: [Header 1,  Header 2,  Header 3]
Index: []

If you want, you can create dictionary of DataFrames:

dfs = {}
for idx, group in df.groupby(df['Header 1'].str.contains("''").cumsum()):
    dfs.update({idx:group[1:]})
Sign up to request clarification or add additional context in comments.

Comments

1

The simplest would be to add a counter that increments each time it encounters an empty row. You can then get your individual dataframes via groupby.

df['counter'] = (df['Header1'] == "''").cumsum()
df = df[df['Header1'] != "''"]  # remove empty rows
df.groupby('counter').apply(lambda df: df.iloc[0])

The last line applies your processing function to each dataframe separately (I just put a dummy example).

Note that the exact condition testing for empty rows (here df['Header1'] == "''") should be adapted to your exact situation.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.