0

I have splited up a csv files into many smaller ones using code from here(Scroll down to see the full code): https://dzone.com/articles/splitting-csv-files-in-python

files have been successfully split up with its structure preserved,but the headers have disappeared. I suspect something off with the parameters within the pd.read()function.

Please help me have a look at this:

inputfile:

    Text Header    tag
0    textbody1    Y
1    textbody2    N
2    textbody2    Y

outcome(Structure is still there but my headers are gone in my splitup csv files):

0    textbody1    Y
1    textbody2    N
2    textbody2    Y

Please see below the full script:

    import pandas as pd
    
    #csv file name to be read in 
    in_csv = 'iii_baiterEmailTagged.csv'
    
    #get the number of lines of the csv file to be read
    number_lines = sum(1 for row in (open(in_csv)))
     
    #size of rows of data to write to the csv, 
    
    #you can change the row size according to your need
    rowsize = 10000
    
    #start looping through data writing it to a new file for each set
    for i in range(1,number_lines,rowsize):
    
        df = pd.read_csv(in_csv,
    
              header=None,
    
              nrows = rowsize,#number of rows to read at each loop
    
              skiprows = i)#skip rows that have been read
    
    
        #csv to write data to a new file with indexed name. input_1.csv etc.
        out_csv = 'Enronset' + str(i) + '.csv'
    
        df.to_csv(out_csv,
    
              index=False,
    
              header=False,
              mode='a',#append data to csv file
              chunksize=rowsize)#size of data to append for each loop

Thanks

1
  • 3
    Remove header=False in to_csv. Commented Jul 22, 2021 at 8:29

5 Answers 5

3

You are skipping the first row in your for loop (1 instead of 0)

for i in range(1,number_lines,rowsize):

and you are telling pandas explicitly that there is no header for reading (simply omit it)

pd.read_csv(...,header=None)

and to not write one (replace False with True)

pd.write_csv(...,header=False,...)

Here is a fully working code:

import pandas as pd

#csv file name to be read in
in_csv = 'iii_baiterEmailTagged.csv'

#get the number of lines of the csv file to be read
number_lines = sum(1 for row in (open(in_csv)))

#size of rows of data to write to the csv,

#you can change the row size according to your need
rowsize = 10000

#start looping through data writing it to a new file for each set
for i in range(0,number_lines,rowsize):

    df = pd.read_csv(in_csv,
          nrows = rowsize,#number of rows to read at each loop
          skiprows = i)#skip rows that have been read

    #csv to write data to a new file with indexed name. input_1.csv etc.
    out_csv = 'Enronset' + str(i) + '.csv'

    df.to_csv(out_csv,
          index=False,
          header=True,
          mode='a',#append data to csv file
          chunksize=rowsize)#size of data to append for each loop
Sign up to request clarification or add additional context in comments.

Comments

1

You can slice you dataframe using iloc[]. Below code builds a dataframe of 1000 rows and splits in into 100 row CSVs with headers.

import numpy as np
import pandas as pd
df = pd.DataFrame({"id":range(1000),"value":np.random.uniform(1,5,1000),"cat":np.random.choice(list("ABCD"),1000)})

for s in range(0, len(df),100):
    df.iloc[s:s+100].to_csv(f"SO_{s//100}.csv", index=False)

Comments

0

This worked for me like a charm. I am splitting big files with header in each splitted file

import pandas as pd

#csv file name to be read in
in_csv = 'asd.csv'

#get the number of lines of the csv file to be read
number_lines = sum(1 for row in (open(in_csv)))

#size of rows of data to write to the csv,

#you can change the row size according to your need
rowsize = 5000
header= ['Year', 'Versions','Periods','ref3','ref2','ref1','Value']

#start looping through data writing it to a new file for each set
for i in range(0,number_lines,rowsize):

    df = pd.read_csv(in_csv,
                     sep=';',
          nrows = rowsize,#number of rows to read at each loop
          skiprows = i)#skip rows that have been read
    df.columns = header
    
    #csv to write data to a new file with indexed name. input_1.csv etc.
    out_csv = 'Enronset' + str(i) + '.csv'
    
    df.to_csv(out_csv,
          index=False,
          header=True,
          mode='a',#append data to csv file
          chunksize=rowsize)#size of data to append for each loop

1 Comment

Add some explanations and justifications to the answer.
0

This code works for me.



    import re
    import pandas as pd
    import random
    
    ## Provide file name with path for example: "C:\Users\xxxxx\flights.csv"
    split_source_file = input("File Name with absolute Path? : ")
    
    ## find number of lines using Pandas
    pd_dataframe = pd.read_csv(split_source_file, header=0, encoding = 'latin_1')
    number_of_rows = len(pd_dataframe.index) + 1
    the_header = pd_dataframe.columns.tolist()
    
    
    print(number_of_rows)
    
    ## Incase of equal split, provide the same number for min and max
    min_rows = int(input("Minimum Number of rows per file? : "))
    max_rows = int(input("Maximum Number of rows per file? : "))
    
    file_increment = 1
    skip_rows = 1
    
    ## first file random numbers
    number_of_rows_perfile = random.randint(min_rows, max_rows)
    
    while True:
        if number_of_rows_perfile<=0:
            break
        ## Read CSV file with number of rows and skip respective number of lines
        df = pd.read_csv(split_source_file, header=None, nrows = number_of_rows_perfile,skiprows = skip_rows,encoding = 'latin_1', lineterminator='\n' )
    
        # Thanks for Gist https://gist.github.com/smram/d6ded3c9028272360eb65bcab564a18a
        # got to handle both escaped and literal 
        df.replace(to_replace=[r"\\t|\\n|\\r", "\t|\n|\r"], value=["",""], regex=True, inplace=True)
    
        ## Target file name
        split_target_file = split_source_file[:-4] + "_" + str(file_increment) + ".csv"
    
        ## write to csv 
        df.to_csv(split_target_file, index=False, header=the_header, mode='a', chunksize=number_of_rows_perfile)
    
        file_increment += 1
    
        skip_rows += number_of_rows_perfile
    
        ## Last file handler
        if skip_rows >= number_of_rows:
            number_of_rows_perfile = number_of_rows - skip_rows
        else:
            number_of_rows_perfile = random.randint(min_rows, max_rows)


Comments

0

A small addition to this script that helped me iterate through a folder of csv files using pathlib...

import pandas as pd
from pathlib import Path
folder=("C:\\Path\\to\\Folder")
#csv file name to be read in
for file in Path(folder).glob('*.csv'):
    in_csv = file

    #get the number of lines of the csv file to be read
    number_lines = sum(1 for row in (open(in_csv)))

    #size of rows of data to write to the csv,

    #you can change the row size according to your need
    rowsize = 48000
    header= ['Column_01', 'Column_02', 'Column_03','Column_04']
    #start looping through data writing it to a new file for each set
    for i in range(0,number_lines,rowsize):

        df = pd.read_csv(in_csv,
            nrows = rowsize,#number of rows to read at each loop
            skiprows = i)#skip rows that have been read
        df.columns = header # adds headers

        #csv to write data to a new file with indexed name. input_1.csv etc.
        out_csv = str(in_csv) + str(i) + '.csv'

        df.to_csv(out_csv,
            index=False,
            header=True,
            mode='a',#append data to csv file
            chunksize=rowsize)#size of data to append for each loop

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.