0

I created function that iterates over a folder containing excel files and creates a list of all the headers across all sheets. It works fine but is VERY slow. Do you have any ideas on how to improve it? THANKS!

import glob

# file directory
path = r'C:\Users\John\Excel_folder' 
all_files = glob.glob(path + "/*.xlsx")

def get_columns(file):    
    sheets = pd.ExcelFile(file).sheet_names
    for sheet in sheets:
        for i in (list(pd.read_excel(file, sheet, nrows=0).columns)):
                  col.append(i)
col=[]
for i in all_files:
    get_columns(i)

col

1 Answer 1

1

you can pass None to sheet_name in read_excel to read all sheets at once. It creates a dictionary of dataframe, so at the end you can do with list comprehension.

def get_columns(file):
    return [c 
            for df in pd.read_excel(file, 
                                    sheet_name=None, 
                                    nrows=0).values() 
            for c in df.columns]

col = [c for file in all_files for c in get_columns(file)]

it should be faster because you open once the file instead of many times.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks Ben! I never came across a collections.OrderedDict class. how do I access the different elements of it?
@AlmogWoldenberg I'm not sure what you mean, pd.read_excel with sheet_name=None return a regular dict for me, but otherwise OrderedDict can be used like "regular" dict, items(), key(), values() like in the code above, one difference would be to keep the keys in the order they are added into the dict, but other than that I don't know enough about it

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.