Add column to pandas dataframe with partial file name while importing many files

Question

I use this to make a giant dataframe from many files in a directory:

path = r'C:\\Users\\me\\data\\'              
all_files = glob.glob(os.path.join(path, "*"))

df_from_each_file = (pd.read_csv(f, sep='\t') for f in all_files)
concatdf = pd.concat(df_from_each_file, ignore_index=True)

The files in that path have names like

AAA.etc.etc.
AAA.etc.etc
BBB.etc.etc.

As I import each file, I want to add a column to the dataframe that has AAA or BBB next to all the rows imported from that file, like this:

col1  col2  col3
data1 data2 AAA
data3 data4 AAA
data1 data2 AAA
data3 data4 AAA
data1 data2 BBB
data3 data4 BBB

It's the name of the file, as it's imported. As I .read_csv for each file, before concatenating, I want to add a column that has the partial filename. — Liquidity
– Liquidity, Commented Jan 3, 2019 at 23:45

gold_cy · Accepted Answer · 2019-01-03 23:52:32Z

2

This is one way to do it:

from pathlib import PureWindowsPath

def fn_helper(fn):
    df = pd.read_csv(fn, sep='\t')
    p = PureWindowsPath(fn)
    part = p.name.split('.')[0]
    df['col3'] = part
    return df

df_from_each_file = (fn_helper(f) for f in all_files)
...

Or as other people are showing with one-liners:

(pd.read_csv(f, sep='\t').assign(col3=PureWindowsPath(f).name.split('.')[0]) for f in all_files)

answered Jan 3, 2019 at 23:52

gold_cy

14.2k4 gold badges27 silver badges55 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

BENY · Accepted Answer · 2019-01-04 01:07:15Z

1

You may check with keys + reset_index

key=[PureWindowsPath(i).name.split('.', 1)[0] for i in all_files]
concatdf = pd.concat(df_from_each_file, ignore_index=True,keys=key).reset_index(level=0)

edited Jan 4, 2019 at 1:07

answered Jan 3, 2019 at 23:47

BENY

324k22 gold badges176 silver badges250 bronze badges

3 Comments

gold_cy Over a year ago

this won't work since each filename has the whole path included in it

BENY Over a year ago

@aws_apprentice check the update , I borrow your PureWindowsPath

BENY Over a year ago

@Liquidity check the update , this should slightly faster than the for loop

Joe Patten · Accepted Answer · 2019-01-04 00:16:35Z

0

I usually change the current working directory to the path:

import os
os.chdir(path)

You can assign col3 to be the part of the filename you wish by using assign.

df_from_each_file = (pd.read_csv(f, sep='\t').assign(col3=f.split('.')[0]) for f in all_files)

So your code would look like:

os.chdir(path)
all_files = glob.glob('*')

df_from_each_file = (pd.read_csv(f).assign(col3=f.split('.')[0]) for f in all_files)
concatdf = pd.concat(df_from_each_file, ignore_index=True)

If you don't want to change the current working directory, then you could use os.path.basename(path) to get the filenames in the path. so your code would look like:

all_files = glob.glob('*')
df_from_each_file = (pd.read_csv(f).assign(col3=os.path.basename(f).split('.')[0]) for f in all_files)
concatdf = pd.concat(df_from_each_file, ignore_index=True)

edited Jan 4, 2019 at 0:16

answered Jan 3, 2019 at 23:51

Joe Patten

1,7041 gold badge11 silver badges15 bronze badges

2 Comments

Liquidity Over a year ago

Using f.split('.') truncates the file, but includes the path before, so the column is C:\\Users\\me\\data\\AAA instead of just AAA.

Joe Patten Over a year ago

Oh, I see. I usually use os.chdir(path) to change the current working directory to the path. I will update my answer a bit.

Collectives™ on Stack Overflow

Add column to pandas dataframe with partial file name while importing many files

3 Answers 3

Comments

3 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related