How to split csv file keeping its header in each smaller files in Python?

Question

I have splited up a csv files into many smaller ones using code from here(Scroll down to see the full code): https://dzone.com/articles/splitting-csv-files-in-python

files have been successfully split up with its structure preserved,but the headers have disappeared. I suspect something off with the parameters within the pd.read()function.

Please help me have a look at this:

inputfile:

    Text Header    tag
0    textbody1    Y
1    textbody2    N
2    textbody2    Y

outcome(Structure is still there but my headers are gone in my splitup csv files):

0    textbody1    Y
1    textbody2    N
2    textbody2    Y

Please see below the full script:

    import pandas as pd
    
    #csv file name to be read in 
    in_csv = 'iii_baiterEmailTagged.csv'
    
    #get the number of lines of the csv file to be read
    number_lines = sum(1 for row in (open(in_csv)))
     
    #size of rows of data to write to the csv, 
    
    #you can change the row size according to your need
    rowsize = 10000
    
    #start looping through data writing it to a new file for each set
    for i in range(1,number_lines,rowsize):
    
        df = pd.read_csv(in_csv,
    
              header=None,
    
              nrows = rowsize,#number of rows to read at each loop
    
              skiprows = i)#skip rows that have been read
    
    
        #csv to write data to a new file with indexed name. input_1.csv etc.
        out_csv = 'Enronset' + str(i) + '.csv'
    
        df.to_csv(out_csv,
    
              index=False,
    
              header=False,
              mode='a',#append data to csv file
              chunksize=rowsize)#size of data to append for each loop

Thanks

Remove header=False in to_csv.

sushanth
– sushanth

2021-07-22 08:29:11 +00:00
Commented Jul 22, 2021 at 8:29 — sushanth
– sushanth, Commented Jul 22, 2021 at 8:29

Yoxcu · Accepted Answer · 2021-07-22 08:35:58Z

You are skipping the first row in your for loop (1 instead of 0)

for i in range(1,number_lines,rowsize):

and you are telling pandas explicitly that there is no header for reading (simply omit it)

pd.read_csv(...,header=None)

and to not write one (replace False with True)

pd.write_csv(...,header=False,...)

Here is a fully working code:

import pandas as pd

#csv file name to be read in
in_csv = 'iii_baiterEmailTagged.csv'

#get the number of lines of the csv file to be read
number_lines = sum(1 for row in (open(in_csv)))

#size of rows of data to write to the csv,

#you can change the row size according to your need
rowsize = 10000

#start looping through data writing it to a new file for each set
for i in range(0,number_lines,rowsize):

    df = pd.read_csv(in_csv,
          nrows = rowsize,#number of rows to read at each loop
          skiprows = i)#skip rows that have been read

    #csv to write data to a new file with indexed name. input_1.csv etc.
    out_csv = 'Enronset' + str(i) + '.csv'

    df.to_csv(out_csv,
          index=False,
          header=True,
          mode='a',#append data to csv file
          chunksize=rowsize)#size of data to append for each loop

Rob Raymond · Accepted Answer · 2021-07-22 08:35:46Z

1

You can slice you dataframe using iloc[]. Below code builds a dataframe of 1000 rows and splits in into 100 row CSVs with headers.

import numpy as np
import pandas as pd
df = pd.DataFrame({"id":range(1000),"value":np.random.uniform(1,5,1000),"cat":np.random.choice(list("ABCD"),1000)})

for s in range(0, len(df),100):
    df.iloc[s:s+100].to_csv(f"SO_{s//100}.csv", index=False)

answered Jul 22, 2021 at 8:35

Rob Raymond

31.5k3 gold badges19 silver badges34 bronze badges

Comments

Gleb Karpushkin · Accepted Answer · 2021-11-15 15:35:56Z

0

This worked for me like a charm. I am splitting big files with header in each splitted file

import pandas as pd

#csv file name to be read in
in_csv = 'asd.csv'

#get the number of lines of the csv file to be read
number_lines = sum(1 for row in (open(in_csv)))

#size of rows of data to write to the csv,

#you can change the row size according to your need
rowsize = 5000
header= ['Year', 'Versions','Periods','ref3','ref2','ref1','Value']

#start looping through data writing it to a new file for each set
for i in range(0,number_lines,rowsize):

    df = pd.read_csv(in_csv,
                     sep=';',
          nrows = rowsize,#number of rows to read at each loop
          skiprows = i)#skip rows that have been read
    df.columns = header
    
    #csv to write data to a new file with indexed name. input_1.csv etc.
    out_csv = 'Enronset' + str(i) + '.csv'
    
    df.to_csv(out_csv,
          index=False,
          header=True,
          mode='a',#append data to csv file
          chunksize=rowsize)#size of data to append for each loop

edited Nov 15, 2021 at 15:35

answered Nov 11, 2021 at 18:55

Gleb Karpushkin

538 bronze badges

1 Comment

gshpychka Over a year ago

Add some explanations and justifications to the answer.

Dharman · Accepted Answer · 2022-02-11 05:22:28Z

This code works for me.



    import re
    import pandas as pd
    import random
    
    ## Provide file name with path for example: "C:\Users\xxxxx\flights.csv"
    split_source_file = input("File Name with absolute Path? : ")
    
    ## find number of lines using Pandas
    pd_dataframe = pd.read_csv(split_source_file, header=0, encoding = 'latin_1')
    number_of_rows = len(pd_dataframe.index) + 1
    the_header = pd_dataframe.columns.tolist()
    
    
    print(number_of_rows)
    
    ## Incase of equal split, provide the same number for min and max
    min_rows = int(input("Minimum Number of rows per file? : "))
    max_rows = int(input("Maximum Number of rows per file? : "))
    
    file_increment = 1
    skip_rows = 1
    
    ## first file random numbers
    number_of_rows_perfile = random.randint(min_rows, max_rows)
    
    while True:
        if number_of_rows_perfile<=0:
            break
        ## Read CSV file with number of rows and skip respective number of lines
        df = pd.read_csv(split_source_file, header=None, nrows = number_of_rows_perfile,skiprows = skip_rows,encoding = 'latin_1', lineterminator='\n' )
    
        # Thanks for Gist https://gist.github.com/smram/d6ded3c9028272360eb65bcab564a18a
        # got to handle both escaped and literal 
        df.replace(to_replace=[r"\\t|\\n|\\r", "\t|\n|\r"], value=["",""], regex=True, inplace=True)
    
        ## Target file name
        split_target_file = split_source_file[:-4] + "_" + str(file_increment) + ".csv"
    
        ## write to csv 
        df.to_csv(split_target_file, index=False, header=the_header, mode='a', chunksize=number_of_rows_perfile)
    
        file_increment += 1
    
        skip_rows += number_of_rows_perfile
    
        ## Last file handler
        if skip_rows >= number_of_rows:
            number_of_rows_perfile = number_of_rows - skip_rows
        else:
            number_of_rows_perfile = random.randint(min_rows, max_rows)

fredsta98 · Accepted Answer · 2022-07-14 20:37:56Z

A small addition to this script that helped me iterate through a folder of csv files using pathlib...

import pandas as pd
from pathlib import Path
folder=("C:\\Path\\to\\Folder")
#csv file name to be read in
for file in Path(folder).glob('*.csv'):
    in_csv = file

    #get the number of lines of the csv file to be read
    number_lines = sum(1 for row in (open(in_csv)))

    #size of rows of data to write to the csv,

    #you can change the row size according to your need
    rowsize = 48000
    header= ['Column_01', 'Column_02', 'Column_03','Column_04']
    #start looping through data writing it to a new file for each set
    for i in range(0,number_lines,rowsize):

        df = pd.read_csv(in_csv,
            nrows = rowsize,#number of rows to read at each loop
            skiprows = i)#skip rows that have been read
        df.columns = header # adds headers

        #csv to write data to a new file with indexed name. input_1.csv etc.
        out_csv = str(in_csv) + str(i) + '.csv'

        df.to_csv(out_csv,
            index=False,
            header=True,
            mode='a',#append data to csv file
            chunksize=rowsize)#size of data to append for each loop

Collectives™ on Stack Overflow

How to split csv file keeping its header in each smaller files in Python?

5 Answers 5

Comments

Comments

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related