0

I am trying read in a folder of CSV files, process them one by one to remove duplicates, and then add them to a master dataframe which will then finally be output to a CSV. I have this...

import pandas as pd
import os
import sys

output = pd.DataFrame(columns=['col1', 'col2'])

for root, dirs, files in os.walk("sourcefolder", topdown=False):

    for name in files:

        data = pd.read_csv(os.path.join(root, name), usecols=[1], skiprows=1)
        output.append(data)

output.to_csv("output.csv", index=False, encoding='utf8')

But my output CSV is empty apart fom the column names. Anyone any idea where I am going wrong?

3
  • if your folder is correct try: output = output.append(data) Commented Aug 23, 2020 at 22:34
  • @DeepSpace if you were talking about a list you're correct but output is a dataframe and so the result is the two frames added together Commented Aug 23, 2020 at 22:38
  • don't append to dataframes, use concat. merge or update, think of it like a database. if you instead set output = [] then append to it, then call pd.concat(pd.DataFrame(output)) Commented Aug 23, 2020 at 22:39

2 Answers 2

1

Pandas dataframes don't act like a list so you can't use append like that try:

import pandas as pd
import os
import sys

output = pd.DataFrame(columns=['col1', 'col2'])

for root, dirs, files in os.walk("sourcefolder", topdown=False):

    for name in files:

        data = pd.read_csv(os.path.join(root, name), usecols=[1], skiprows=1)
        output = output.append(data)

output_df.to_csv("output.csv", index=False, encoding='utf8')

Alternatively you can make output a list of dataframes and then use pd.concat to create a consolidated dataframe at the end, depending on the volume of data this could be more efficient

Sign up to request clarification or add additional context in comments.

Comments

1

The built in pandas method concat is also pretty good. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html#pandas.concat

import pandas as pd
import os
import sys

output = pd.DataFrame(columns=['col1', 'col2'])

for root, dirs, files in os.walk("sourcefolder", topdown=False):

    for name in files:

        data = pd.read_csv(os.path.join(root, name), usecols=[1], skiprows=1)
        output = pd.concat([output, data], columns=output.columns)

output_df.to_csv("output.csv", index=False, encoding='utf8')


Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.