1

I'm trying to write a Pandas script which to extract data from several Excel files. They contain between 10 and 15 columns. From these columns I need the 1st one which has different header in every file, and some other columns which always have the same header names ('TOTAL', 'CLEAR', 'NON-CLEAR'and 'SYSTEM') but they are positioned under different column index in the different files. (I mean that in one of the files 'TOTAL' is the 3rd column in the table but in another file it is the 5th column)

I know that using usecols keyword I could specify which columns to use, but it looks like this argument takes only header names or only column indices, and never both of them in a combination.

Is it possible to write a statement which to take at the same time the 1st column by its index and then the other ones by header name?

The below statement doesn't work:

df = pd.read_excel(file, usecols = [0,'TOTAL', 'CLEAR', 'NON-CLEAR','SYSTEM'])
2
  • I don't think this is possible with the use_cols arg, you could read 0 rows and just splice your columns together. Commented Jul 9, 2020 at 16:23
  • @Datanovice usecols does take a callable. Commented Jul 9, 2020 at 16:39

3 Answers 3

1

you could use pd.read_excel()twice and than join both dfs

df1 = pd.read_excel(file, usecols = [0])
df2 = pd.read_excel(file, usecols = ['TOTAL', 'CLEAR', 'NON-CLEAR','SYSTEM'])
df = pd.concat([df1, df2], axis = 1, join = 'outer')
Sign up to request clarification or add additional context in comments.

2 Comments

Currently I use the "hotfix" just to read the whole data files without to specify any usecols . Regarding the performance and program memory usage, which of these two solutions (yours or mine) is better? I'm talking about really big files with lots of data in them.
I'd guess that mine is more efficient but I'm not an expert on that. You can run both soloutuions in jupyter notebooks and use %%time to track how fast they are
0

If it has only fifteen columns, it is probably faster not to read the file twice. You can read the whole file into memory and then extract the columns you need with the much nicer pandas interface:

df = pd.read_excel(file)
df = df[ [ df.columns[0], 'TOTAL', 'CLEAR', 'NON-CLEAR','SYSTEM'] ]

Comments

0

Here you go:

def callable():
    first_column = None
    def process(column_name):
        nonlocal first_column
        if first_column is None:
            first_column = column_name
            return True
        if first_column == column_name:
            return True
        return column_name in ['TOTAL', 'CLEAR', 'NON-CLEAR','SYSTEM']
    return process
print(pd.read_csv(file, usecols=callable()))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.