Python Pandas - extract excel data by column index and column header at the same time

Question

I'm trying to write a Pandas script which to extract data from several Excel files. They contain between 10 and 15 columns. From these columns I need the 1st one which has different header in every file, and some other columns which always have the same header names ('TOTAL', 'CLEAR', 'NON-CLEAR'and 'SYSTEM') but they are positioned under different column index in the different files. (I mean that in one of the files 'TOTAL' is the 3rd column in the table but in another file it is the 5th column)

I know that using usecols keyword I could specify which columns to use, but it looks like this argument takes only header names or only column indices, and never both of them in a combination.

Is it possible to write a statement which to take at the same time the 1st column by its index and then the other ones by header name?

The below statement doesn't work:

df = pd.read_excel(file, usecols = [0,'TOTAL', 'CLEAR', 'NON-CLEAR','SYSTEM'])

I don't think this is possible with the use_cols arg, you could read 0 rows and just splice your columns together. — Umar.H
– Umar.H, Commented Jul 9, 2020 at 16:23

TiTo · Accepted Answer · 2020-07-09 15:51:19Z

1

you could use pd.read_excel()twice and than join both dfs

df1 = pd.read_excel(file, usecols = [0])
df2 = pd.read_excel(file, usecols = ['TOTAL', 'CLEAR', 'NON-CLEAR','SYSTEM'])
df = pd.concat([df1, df2], axis = 1, join = 'outer')

answered Jul 9, 2020 at 15:51

TiTo

8852 gold badges13 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Savina Dimitrova Over a year ago

Currently I use the "hotfix" just to read the whole data files without to specify any usecols . Regarding the performance and program memory usage, which of these two solutions (yours or mine) is better? I'm talking about really big files with lots of data in them.

TiTo Over a year ago

I'd guess that mine is more efficient but I'm not an expert on that. You can run both soloutuions in jupyter notebooks and use %%time to track how fast they are

Oliver Richardson · Accepted Answer · 2020-07-09 15:55:45Z

0

If it has only fifteen columns, it is probably faster not to read the file twice. You can read the whole file into memory and then extract the columns you need with the much nicer pandas interface:

df = pd.read_excel(file)
df = df[ [ df.columns[0], 'TOTAL', 'CLEAR', 'NON-CLEAR','SYSTEM'] ]

answered Jul 9, 2020 at 15:55

Oliver Richardson

2811 silver badge6 bronze badges

Comments

Balaji Ambresh · Accepted Answer · 2020-07-09 16:37:35Z

0

Here you go:

def callable():
    first_column = None
    def process(column_name):
        nonlocal first_column
        if first_column is None:
            first_column = column_name
            return True
        if first_column == column_name:
            return True
        return column_name in ['TOTAL', 'CLEAR', 'NON-CLEAR','SYSTEM']
    return process
print(pd.read_csv(file, usecols=callable()))

edited Jul 9, 2020 at 16:37

answered Jul 9, 2020 at 16:01

Balaji Ambresh

5,0022 gold badges7 silver badges17 bronze badges

Collectives™ on Stack Overflow

Python Pandas - extract excel data by column index and column header at the same time

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related