1

I have been attempting to combine a large number of csv files (46 files, about 86 MB each about 1 million rows each) using the windows command line with command:

copy *.csv output.csv

This has worked for me in the past but apparently with this dataset it is not working. Upon completion the dataset produces several "inconsistent number of columns detected" errors. They always occur at the same location (for instance, row 12944). When looking at the rows with errors, it appears to be cutting off the first couple of columns and shifting the data left, causing the errors in that row but does not appear to affect the data below it. Strange.

The issue is if I just use say, 3 or less files, there is no error at location 12944 nor does there appear to be an issue with the data if I inspect it via the individual files.

Also tried using a python script to do a similar thing:

import glob
import pandas as pd

extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]

combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
combined_csv.to_csv( "combined_csv.csv", index=False, encoding='utf-8-sig')

This was impossible to run with all 46 files as I run out of RAM but even trying with 4 files gives similar errors to the command line output.

It's almost as if handling more than 3 files is causing an issue when combining them but I have never seen this error occur. I am totally stumped. Any ideas?

Edit: output when using more than 3 files

6
  • This question might help you with the python issue. That way you don't have to store the full df in memory, but can simply append your files to the "complete" file : stackoverflow.com/questions/17530542/… Commented Jan 17, 2021 at 22:26
  • I would advise that something like this would seem safer, copy *.csv output.txt, then ren output.txt output.csv than copy *.csv output.csv. Commented Jan 17, 2021 at 23:44
  • I've uploaded a screencap of the output so you can see what I am talking about. Again this only happens when I use more than 3 files. It doesn't make any sense to me. I had to sanitize the output but I think you can see what I mean. It randomly decides to cut off columns from the row and shifts the data over. Commented Jan 18, 2021 at 0:11
  • I tried that doing txt first than changing to csv but the error persists. Commented Jan 18, 2021 at 0:12
  • Could be that one of the files has a byte order mark (BOM) at the start of the file. Simply combining files without removing the BOM when necessary could be the issue. Use utf-8-sig for reading the files in Python and it wil remove it, assuming your files are encoded in UTF-8 to begin with... Commented Jan 18, 2021 at 0:16

1 Answer 1

1

To reduce the amount of memory used, you could read a small chunk of the file at a time. This is what shutil.copyfileobj() does for you:

import glob
import shutil

with open('combined_csv.csv', 'wb') as output_fp:
    for file_name in glob.glob('*.csv'):
        with open(file_name, 'rb') as fp:
            shutil.copyfileobj(fp, output_fp)

It's possible depending on the file endings that this will cause corruption when combining, in which case adding something like output_fp.write(b'\n') might help.

Sign up to request clarification or add additional context in comments.

5 Comments

That might help but the issue is even using 4 files to reduce memory as a test, it still gives me an error at that row in the output file. So it's not the size of the dataset
Could you try running this and give the specific error? I don't quite understand your description of the error
I've uploaded a screencap of the output so you can see what I am talking about. Again this only happens when I use more than 3 files. It doesn't make any sense to me. I had to sanitize the output but I think you can see what I mean. It randomly decides to cut off columns from the row and shifts the data over.
Ran your script and it still throws that same error at line 12944 with 4 or more files. With 3 files, no error. Dumb founded.
Two things I can think of is the newline tip I gave or your input has issues (but I'm assuming you've checked this). The only real thing common to all these methods are the inputs. Maybe there's an issue in the 4th file? It might not even be visible, for example mixing different line endings

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.