5

I have n files in a directory that I need to combine into one. They have the same amount of columns, for example, the contents of test1.csv are:

test1,test1,test1  
test1,test1,test1  
test1,test1,test1  

Similarly, the contents of test2.csv are:

test2,test2,test2  
test2,test2,test2  
test2,test2,test2  

I want final.csv to look like this:

test1,test1,test1  
test1,test1,test1  
test1,test1,test1  
test2,test2,test2  
test2,test2,test2  
test2,test2,test2  

But instead it comes out like this:

test file 1,test file 1.1,test file 1.2,test file 2,test file 2.1,test file 2.2  
,,,test file 2,test file 2,test file 2  
,,,test file 2,test file 2,test file 2  
test file 1,test file 1,test file 1,,,  
test file 1,test file 1,test file 1,,,  

Can someone help me figure out what is going on here? I have pasted my code below:

import csv
import glob
import pandas as pd
import numpy as np 

all_data = pd.DataFrame() #initializes DF which will hold aggregated csv files

for f in glob.glob("*.csv"): #for all csv files in pwd
    df = pd.read_csv(f) #create dataframe for reading current csv
    all_data = all_data.append(df) #appends current csv to final DF

all_data.to_csv("final.csv", index=None)
2
  • Why are you using pandas just to create a single csv ? Commented Dec 12, 2015 at 18:29
  • I'm a noob and I thought this was the best way to do it. :/ Commented Dec 12, 2015 at 21:28

3 Answers 3

5

I think there are more problems:

  1. I removed import csv and import numpy as np, because in this demo they are not used (but maybe they in missing, lines so they can be imported)
  2. I created list of all dataframes dfs, where dataframes are appended by dfs.append(df). Then I used function concat for joining this list to final dataframe.
  3. In function read_csv I added parameter header=None, because the main problem was that read_csv reads first row as header.
  4. In function to_csv I added parameter header=None for omitting header.
  5. I added folder test to final destination file, because if use function glob.glob("*.csv") you should read output file as input file.

Solution:

import glob
import pandas as pd

all_data = pd.DataFrame() #initializes DF which will hold aggregated csv files

#list of all df
dfs = []
for f in glob.glob("*.csv"): #for all csv files in pwd
    #add parameters to read_csv
    df = pd.read_csv(f, header=None) #create dataframe for reading current csv
    #print df
    dfs.append(df) #appends current csv to final DF
all_data = pd.concat(dfs, ignore_index=True)
print all_data
#       0      1      2
#0  test1  test1  test1
#1  test1  test1  test1
#2  test1  test1  test1
#3  test2  test2  test2
#4  test2  test2  test2
#5  test2  test2  test2
all_data.to_csv("test/final.csv", index=None, header=None)

Next solution is similar.
I add parameter header=None to read_csv and to_csv and add parameter ignore_index=True to append.

import glob
import pandas as pd

all_data = pd.DataFrame() #initializes DF which will hold aggregated csv files

for f in glob.glob("*.csv"): #for all csv files in pwd
    df = pd.read_csv(f, header=None) #create dataframe for reading current csv
    all_data = all_data.append(df, ignore_index=True) #appends current csv to final DF
print all_data
#       0      1      2
#0  test1  test1  test1
#1  test1  test1  test1
#2  test1  test1  test1
#3  test2  test2  test2
#4  test2  test2  test2
#5  test2  test2  test2

all_data.to_csv("test/final.csv", index=None, header=None)
Sign up to request clarification or add additional context in comments.

1 Comment

I think pandas is very good library for data processing. So you can try it. And if you are new in Stackoverflow, you can check this.
2

You can concat. Let df1 be your first dataframe and df2 the second, you can:

df = pd.concat([df1,df2],ignore_index=True)

The ignore_index is optional, you can set it to True if you don't mind the original indexes of the single dataframes.

4 Comments

This will work, if you pass "axis=0" as a parameter.
@hahdawg thanks for pointing it out. Actually 0 is the default value for axis in concat.
@JackBauer you're welcome. Please consider to accept one of the two answers received to help other users.
I have limited experience with this stuff so it will take me some time to go through it all but I definitely will.
1

pandas is not a tool to use when all you want is to create a single csv file, you can simply write each csv to a new file as you go:

import glob

with open("out.csv","w") as out:
    for fle in glob.glob("*.csv"):
        with open(fle) as f:
             out.writelines(f)

Or with the csv lib if you prefer:

import glob
import csv

with open("out.csv", "w") as out:
    wr = csv.writer(out)
    for fle in glob.glob("*.csv"):
        with open(fle) as f:
            wr.writerows(csv.reader(f))  

Creating a large dataframe just to eventually write to disk makes no real sense, furthermore if you had a lot of large files it may not even be possible.

1 Comment

No worries, pandas is a great tool if you actually want to do some computation on the data, it is not the tool to use to concat a few files into one

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.