2

I have different excel files that I am processing with Pandas. I need to remove a certain number of rows from the top of each file. These extra rows could be empty or they could contain text. Pandas is combining some of the rows so I am not sure how many need to be removed. For example:

Here is an example excel file (represented as csv):

,,
,,
some text,,
,,
,,
,,
name, date, task
Jason,1-Jan,swim 
Aem,2-Jan,workout 

Here is my current python script:

import pandas as pd 
xl = pd.ExcelFile('extra_rows.xlsx') 
dfs = xl.parse(xl.sheet_names[0]) 
print ("dfs: ", dfs) 

Here is the results when I print the dataframe:

dfs:          Unnamed: 0           Unnamed: 1 Unnamed: 2
0  some other text                  NaN        NaN
1              NaN                  NaN        NaN
2              NaN                  NaN        NaN
3              NaN                  NaN        NaN
4             name                 date       task
5            Jason  2016-01-01 00:00:00       swim
6              Aem  2016-01-02 00:00:00    workout

From the file, I would remove the first 6 rows. However, from the dataframe I would only remove 4. Is there a way to read in the Excel file with the data in its raw state so the number of rows remains consistent?

3 Answers 3

2

I used python3 and pandas-0.18.1. The Excel load function is pandas.read_csv. You can try set the parameter header=None to achieve. Here are sample codes:

(1) With default parameters, result will ignore leading blank lines:

In [12]: pd.read_excel('test.xlsx')
Out[12]: 
  Unnamed: 0 Unnamed: 1 Unnamed: 2
0      text1        NaN        NaN
1        NaN        NaN        NaN
2         n1         t2         c3
3        NaN        NaN        NaN
4        NaN        NaN        NaN
5        jim        sum        tim

(2) With header=None, result will keep leading blank lines.

In [13]: pd.read_excel('test.xlsx', header=None)
Out[13]: 
       0    1    2
0    NaN  NaN  NaN
1    NaN  NaN  NaN
2  text1  NaN  NaN
3    NaN  NaN  NaN
4     n1   t2   c3
5    NaN  NaN  NaN
6    NaN  NaN  NaN
7    jim  sum  tim
Sign up to request clarification or add additional context in comments.

Comments

2

Here is what you are looking for:

import pandas as pd 
xl = pd.ExcelFile('extra_rows.xlsx') 
dfs = xl.parse(skiprows=6) 
print ("dfs: ", dfs) 

Check the docs on ExcelFile for more details.

Comments

2

If you read your file in with pd.read_excel and pass header=None, the blank rows should be included:

In [286]: df = pd.read_excel("test.xlsx", header=None)

In [287]: df
Out[287]:
           0     1      2
0        NaN   NaN    NaN
1        NaN   NaN    NaN
2  something   NaN    NaN
3        NaN   NaN    NaN
4       name  date  other
5          1     2      3

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.