Using pandas to read certain columns from an xlsx, with a condition from another row

Question

I have an xlsx where the first 9 rows are headers. Row 1 contains a name, like "Bob" and "Alice".

Row 4 contains either 'Monthly' or 'Quarterly'.

Sometimes there are two fields called 'Bob' but one has 'Monthly' and the other has 'Quarterly' in row 4.

I understand I could read in the column called 'Bob' into a dataframe, but is there a way to specify which one should be loaded into the dataframe?

e.g. below I have bob and alice, and as it stands I would read in 2 Bob fields and 2 Alice fields. Is there a way of reducing these somehow on the initial readthrough?

import pandas as pd
fields = ['Bob', 'Alice']
type = ['Monthly','Quarterly']


df = pd.read_excel('data.xlsx', sheet='Sheet1', usecols=fields)
# See the keys
print df.keys()
# See content in 'Bob'
print df.bob

Alternatively, is there a way I can read all 4 columns - Bob and Alice - and then only keep the one I want (e.g. monthly for Bob, quarterly for Alice)?

Example xlsx file is as follows (formatted as a csv to make it look nicer here though):

Mnemonic:,Alice,Bob,Mnemonic:,Alice,Bob
Description:,Test results for Alice,Test results for Bob,Description:,Test results for Alice,Test results for Bob
Source:,(na),(na),Source:,(na),(na)
Native Frequency:,Monthly,Monthly,Native Frequency:,Quarterly,Quarterly
Transformation:,None,None,Transformation:,None,None
Begin Date:,10/31/2006,10/31/2006,Begin Date:,09/30/2006,09/30/2006
Last Updated:,,,Last Updated:,,
Historical End Date:,12/30/2017,12/30/2017,Historical End Date:,12/30/2017,12/30/2017
Geography:,(na),(na),Geography:,(na),(na)
10/31/2006,3,2,09/30/2006,3,2
11/30/2006,3,2,12/31/2006,5,1
12/31/2006,3,2,03/31/2007,7,4
01/31/2007,5,1,06/30/2007,8,7
02/28/2007,5,1,09/30/2007,1,2
03/31/2007,5,1,12/31/2007,6,9
04/30/2007,7,4,03/31/2008,1,5
05/31/2007,7,4,06/30/2008,9,7
06/30/2007,7,4,09/30/2008,9,2
07/31/2007,8,7,12/31/2008,8,7
08/31/2007,8,7,03/31/2009,5,8
09/30/2007,8,7,06/30/2009,3,6

jpp · Accepted Answer · 2018-02-05 13:19:39Z

0

There isn't an option to filter the rows before the Excel file is loaded into a pandas object.

If your file was in csv format, you could iterate through chunks of the csv file and perform the filtering for each chunk. After this, you would then aggregate the chunks into one dataframe. See this answer for details.

answered Feb 5, 2018 at 13:19

jpp

166k37 gold badges301 silver badges363 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Aaraeus Over a year ago

What about filtering after it's in a pandas object? Can I load ALL the rows (including the headers), then delete the columns I don't want?

jpp Over a year ago

@Aaraeus, of course! would you mind amending your question accordingly? that way other users can find a problem-solution combo easily.

jpp Over a year ago

@Aaraeus, at the same time please provide a sample of the data so we can test.

Aaraeus Over a year ago

Just did both. Thanks in advance!

Collectives™ on Stack Overflow

Using pandas to read certain columns from an xlsx, with a condition from another row

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related