2

I need to create a .csv file and append subsets of multiple dataframes into it.

All the dataframes are structured identically, however I need to create the output data set with headers, and then append all the subsequent data frames without headers.

I know I could just create the output file using the headers from the first data frame and then do an append loop with no headers from there, but I'd really like to learn how to do this in a more efficient way.

path ='/Desktop/NYC TAXI/Green/*.csv' 
allFiles = glob.glob(path)

for file in allFiles:
    df = pd.read_csv(file, skiprows=[1,2], usecols=np.arange(20))
    metsdf = df.loc[df['Stadium_Code'] == 2]
    yankdf = df.loc[df['Stadium_Code'] == 1]
    with open('greenyankeetaxi.csv','a') as yankeetaxi:
        yankdf.to_csv(yankeetaxi,header=false)
    with open('greenmetstaxi.csv','a') as metstaxi:
        metsdf.to_csv(metstaxi,header=false)
    print(file + " done")

2 Answers 2

3

The efficient way to append multiple subsets of a dataframe in a large file with only one header is following:

        for df in dataframes:

            if not os.path.isfile(filename):
                df.to_csv(filename, header='column_names', index=False)
            else:  # else it exists so append without writing the header
                df.to_csv(filename, mode='a', header=False, index=False)

In the above code, I have written a file for the first time with a header and after that, I checked the existence of the file and just appended it without the header in the file.

you can use the above code in any scenario where you need to append multiple dataframes in the same file without the header multiple times.

Sign up to request clarification or add additional context in comments.

Comments

2

To do it efficiently, you can use one of the Merge, join, and concatenate so you have two complete dataframe (yankdf and metsdf), then write to csv using to_csv as you have been doing.


Current data

Here we have 2 dataframe, one from each file:

First dataframe df

   a  b  c
0  1  2  3
1  4  5  6

Second dataframe df2

   a   b   c
0  7   6   8
1  9  10  11

Using append

df = df.append(df2) 

The above line will result in a single df which can be written to file

   a   b   c
0  1   2   3
1  4   5   6
0  7   6   8
1  9  10  11

In short:

  • Loop through files in directory
  • Add data to dataframe using append instead of re-assigning everytime
  • Write a single dataframe to file

5 Comments

That definitely helps efficiency wise, but I was more stuck on how to import the headers from the first iteration of the loop and then only the data from there on
Having one dataframe will take care of that for you. The goal is to minimize the loops.
sorry... maybe I'm confused, but I'm sorta new to python... The directory I'm looking in has about 20 files in it, so the loop has to happen for each of those larger files, both of which create two unique data frames (mets and yankee). So instead of having 40 writes, there would be 20, but I still think I would run into the issue of the headers.
No worries, you're only looping to read the files then as you're reading you append to dataframe. Once all the files are done being read and you have a single df, then write to csv without loops. How big are all the files combined?
oh, I thought it was performing all of those steps for each loop... so would recreate a new mets and yankees data frame every loop and overwrite what I had. I'll try implementing what you said and see what happens. The files are about 1.5M-6M lines each

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.