1

I have 4 excel files - 'a1.xlsx','a2.xlsx','a3.xlsx','a4.xlsx' The format of the files are same

for eg a1.xlsx looks like:

id    code    name
1      100    abc
2      200    zxc
...    ...    ...

i have to read this files in pandas dataframe and check whether the same value of code column exists in multiple excel files or not.

something like this.

if code=100 exists in 'a1.xlsx','a3.xlsx' , and code=200 exists only in 'a1.xlsx'

final dataframe should look like:

code    filename
100   a1.xlsx,a3.xlsx
200   a1.xlsx
...   ....
and so on

I have all the files in a directory and tried to iterate them through loop

import pandas as pd
import os
x = next(os.walk('path/to/files/'))[2]  #list all files in directory
os.chdir('path/to/files/')

for i in range (0,len(x)):
    df = pd.read_excel(x[i])

How to proceed? any leads?

1 Answer 1

3

Use:

import glob 

#get all filenames 
files = glob.glob('path/to/files/*.xlsx')
#list comprehension with assign new column for filenames
dfs = [pd.read_excel(fp).assign(filename=os.path.basename(fp).split('.')[0]) for fp in files]
#one big df from list of dfs
df = pd.concat(dfs, ignore_index=True)
#join all same codes
df1 = df.groupby('code')['filename'].apply(', '.join).reset_index()
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.