How to import multiple csvs, assign variable and concatenate into one DataFrame with Pandas concat?

Question

I would like to optimize the code below. It works but I would like suggestions if it can be done more concisely and efficiently.

import os
import glob
import pandas as pd
import numpy as np

files = glob.glob(os.path.join('data','*.csv'))

dfs = []

for file in files:

       variable = os.path.basename(file).split("_")[0] #split filename 
       df= pd.read_csv(file)
       df['variable'] = variable #assign variable
       dfs.append(df)

finalDf = pd.concat(dfs, ignore_index = True)

Any ideas ? Thank you in advance

Pandas 0.21.1 and Python 3.6.5

It looks good to me

kosnik
– kosnik

2018-06-08 16:14:22 +00:00
Commented Jun 8, 2018 at 16:14 — kosnik
– kosnik, Commented Jun 8, 2018 at 16:14

jpp · Accepted Answer · 2018-06-08 16:20:16Z

1

The structure of your code is perfectly fine. Concatenating a list of dataframes is more efficient than repeatedly appending to an existing dataframe.

Set dtype

What you can try and optimize is reading your csv file, i.e. df = pd.read_csv(file). My only suggestion is to specify dtype parameter with a dictionary mapping column names to types. In particular, if you have columns with categorical data, map to 'category' to ensure you optimize memory usage.

List comprehension + assign

You mention more concise code. You can utilize pd.DataFrame.assign to create a new series and set it to your filename. In addition, you can use a list comprehension:

dfs = [pd.read_csv(file).assign(variable=os.path.basename(file).split('_')[0]) \
       for file in glob.glob(os.path.join('data','*.csv'))]

finalDf = pd.concat(dfs, ignore_index=True)

If you choose this method, you may lose readability, so document what you are doing.

answered Jun 8, 2018 at 16:20

jpp

166k37 gold badges301 silver badges363 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Francisco Over a year ago

Thank you. I wasn't aware assign could be set like that as a chain and wrap all into a list comprehension . Also thanks for the dtype mapping parameter for optimisation.

Collectives™ on Stack Overflow

How to import multiple csvs, assign variable and concatenate into one DataFrame with Pandas concat?

1 Answer 1

Set dtype

List comprehension + assign

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Set dtype

List comprehension + assign

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related