0

I have some Excel .Xlsx files. Each file contains multiple sheets. I have used the following code to read and extract data from the files:

import pandas as pd
file = pd.ExcelFile('my_file.xlsx')
file.sheet_names #Displays the sheet names
df = file.parse('Sheet1') #To parse Sheet1
df.columns #To list columns

My interest is the email columns in each sheet. I have been doing this almost manually with the code above. I need a code to automatically iterate over the sheets and extract all emails. Help!

1 Answer 1

3

You can pass over all files and all sheets with a for loop:

import pandas as pd
import os

emails = []
files_dir = "/your_path_to_the_xlsx_files"
for file in os.listdir(files_dir):
    excel = pd.ExcelFile(os.path.join(files_dir,file))
    for sheet in excel.sheet_names:
        df = excel.parse(sheet)
        if 'email' not in df.columns:
            continue
        emails.extend(df['email'].tolist())

Now you have all the emails in the emails list.

Sign up to request clarification or add additional context in comments.

5 Comments

Forgot to mention that some sheets don't have these 'email' column. I am getting some error
Just edited, if it doesn't have an email column it continues
Thanks @Bruno. Can you advise on why am getting this error: FileNotFoundError: [Errno 2] No such file or directory: 'bongaigaon 500.xls.xlsx'" I have the file in my /tmp/work path
Sorry! My fault, I forgot to append the name of the dir along with the file name, see if it works now @urbanmonk
Working now! Thanks mate

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.