Python Pandas - Process folder of CSV files and output final combined CSV

Question

I am trying read in a folder of CSV files, process them one by one to remove duplicates, and then add them to a master dataframe which will then finally be output to a CSV. I have this...

import pandas as pd
import os
import sys

output = pd.DataFrame(columns=['col1', 'col2'])

for root, dirs, files in os.walk("sourcefolder", topdown=False):

    for name in files:

        data = pd.read_csv(os.path.join(root, name), usecols=[1], skiprows=1)
        output.append(data)

output.to_csv("output.csv", index=False, encoding='utf8')

But my output CSV is empty apart fom the column names. Anyone any idea where I am going wrong?

if your folder is correct try: output = output.append(data) — RBowen
– RBowen, Commented Aug 23, 2020 at 22:34
@DeepSpace if you were talking about a list you're correct but output is a dataframe and so the result is the two frames added together — RBowen
– RBowen, Commented Aug 23, 2020 at 22:38
don't append to dataframes, use concat. merge or update, think of it like a database. if you instead set output = [] then append to it, then call pd.concat(pd.DataFrame(output)) — Umar.H
– Umar.H, Commented Aug 23, 2020 at 22:39

RBowen · Accepted Answer · 2020-08-23 22:46:14Z

1

Pandas dataframes don't act like a list so you can't use append like that try:

import pandas as pd
import os
import sys

output = pd.DataFrame(columns=['col1', 'col2'])

for root, dirs, files in os.walk("sourcefolder", topdown=False):

    for name in files:

        data = pd.read_csv(os.path.join(root, name), usecols=[1], skiprows=1)
        output = output.append(data)

output_df.to_csv("output.csv", index=False, encoding='utf8')

Alternatively you can make output a list of dataframes and then use pd.concat to create a consolidated dataframe at the end, depending on the volume of data this could be more efficient

edited Aug 23, 2020 at 22:46

answered Aug 23, 2020 at 22:35

RBowen

3054 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Vijay Sarathy · Accepted Answer · 2020-08-23 22:50:41Z

1

The built in pandas method concat is also pretty good. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html#pandas.concat

import pandas as pd
import os
import sys

output = pd.DataFrame(columns=['col1', 'col2'])

for root, dirs, files in os.walk("sourcefolder", topdown=False):

    for name in files:

        data = pd.read_csv(os.path.join(root, name), usecols=[1], skiprows=1)
        output = pd.concat([output, data], columns=output.columns)

output_df.to_csv("output.csv", index=False, encoding='utf8')

answered Aug 23, 2020 at 22:50

Vijay Sarathy

113 bronze badges

Collectives™ on Stack Overflow

Python Pandas - Process folder of CSV files and output final combined CSV

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related