0

I use this to make a giant dataframe from many files in a directory:

path = r'C:\\Users\\me\\data\\'              
all_files = glob.glob(os.path.join(path, "*"))

df_from_each_file = (pd.read_csv(f, sep='\t') for f in all_files)
concatdf = pd.concat(df_from_each_file, ignore_index=True)

The files in that path have names like

AAA.etc.etc.
AAA.etc.etc
BBB.etc.etc.

As I import each file, I want to add a column to the dataframe that has AAA or BBB next to all the rows imported from that file, like this:

col1  col2  col3
data1 data2 AAA
data3 data4 AAA
data1 data2 AAA
data3 data4 AAA
data1 data2 BBB
data3 data4 BBB
2
  • what is the rule to know whether to put AAA or BBB ? Commented Jan 3, 2019 at 23:43
  • It's the name of the file, as it's imported. As I .read_csv for each file, before concatenating, I want to add a column that has the partial filename. Commented Jan 3, 2019 at 23:45

3 Answers 3

2

This is one way to do it:

from pathlib import PureWindowsPath

def fn_helper(fn):
    df = pd.read_csv(fn, sep='\t')
    p = PureWindowsPath(fn)
    part = p.name.split('.')[0]
    df['col3'] = part
    return df

df_from_each_file = (fn_helper(f) for f in all_files)
...

Or as other people are showing with one-liners:

(pd.read_csv(f, sep='\t').assign(col3=PureWindowsPath(f).name.split('.')[0]) for f in all_files)
Sign up to request clarification or add additional context in comments.

Comments

1

You may check with keys + reset_index

key=[PureWindowsPath(i).name.split('.', 1)[0] for i in all_files]
concatdf = pd.concat(df_from_each_file, ignore_index=True,keys=key).reset_index(level=0)

3 Comments

this won't work since each filename has the whole path included in it
@aws_apprentice check the update , I borrow your PureWindowsPath
@Liquidity check the update , this should slightly faster than the for loop
0

I usually change the current working directory to the path:

import os
os.chdir(path)

You can assign col3 to be the part of the filename you wish by using assign.

df_from_each_file = (pd.read_csv(f, sep='\t').assign(col3=f.split('.')[0]) for f in all_files)

So your code would look like:

os.chdir(path)
all_files = glob.glob('*')

df_from_each_file = (pd.read_csv(f).assign(col3=f.split('.')[0]) for f in all_files)
concatdf = pd.concat(df_from_each_file, ignore_index=True)

If you don't want to change the current working directory, then you could use os.path.basename(path) to get the filenames in the path. so your code would look like:

all_files = glob.glob('*')
df_from_each_file = (pd.read_csv(f).assign(col3=os.path.basename(f).split('.')[0]) for f in all_files)
concatdf = pd.concat(df_from_each_file, ignore_index=True)

2 Comments

Using f.split('.') truncates the file, but includes the path before, so the column is C:\\Users\\me\\data\\AAA instead of just AAA.
Oh, I see. I usually use os.chdir(path) to change the current working directory to the path. I will update my answer a bit.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.