6

Attempting to upload a bunch of csv's to a database. The csvs are not necesarily always separated by a comma so I used a regular expression to ensure the correct delimiters are used. I then added the

error_bad_lines=False

in order to handle CParserError: Error tokenizing data. C error: Expected 3 fields in line 127, saw 4 which resulted in me getting this error

ValueError: Falling back to the 'python' engine because the 'c' engine does not support regex separators, but this causes 'error_bad_lines' to be ignored as it is not supported by the 'python' engine. 

for the following code

Is there a workaround?

import psycopg2
import pandas as pd
import sqlalchemy as sa
csvList = []
tableList = []
filenames = find_csv_filenames(directory)
for name in filenames:  
    lhs, rhs = str(name).split(".", 1)
    print name
    dataRaw = pd.read_csv(name,sep=";|,",chunksize=5000000, error_bad_lines=False)
    for chunk in dataRaw:
        chunk.to_sql(name = str(lhs),if_exists='append',con=con) 
5
  • What does your data look like? If your fields aren't always separated by commas, it's not really CSV. You may be able to hack something together, but if even using a regex separator doesn't allow you to consistently extract the fields, it sounds like you may be getting beyond what a CSV parser will handle. Commented Dec 18, 2015 at 17:02
  • the fields are separated by comma and semicolon as far as I know. I can manually go into each file and upload one at a time but then I have defeated the purpose of programming Commented Dec 18, 2015 at 17:04
  • 1
    Could you change these files? If yes you could preprocess your files and change ; to , with python re.sub or linux sed for example. Commented Dec 18, 2015 at 21:50
  • I suppose I could do that some of the files have problems with loading into memory they have 30 columns 55 million rows kind of thing and it seems to blow up my 32GB of RAM pretty quick. ill look into re.sub Commented Dec 18, 2015 at 21:54
  • You could do it line by line and create another clean file (if you have enough storage to store it). Use answer for that question. Commented Dec 18, 2015 at 22:53

2 Answers 2

9

As per pandas parameter in this link Pandas-link if the separator is more than one character you need to add engine parameter as 'python'.

try this,

dataRaw = pd.read_csv(name,sep=";|,",engine ='python',chunksize=5000000,
error_bad_lines=False)
Sign up to request clarification or add additional context in comments.

1 Comment

I don't understand why this was downvoted. Good answer
1

If you could preprocess and change your file try to change ; separator to , to make clean csv file. You could do it with fileinput to change it inplace:

import fileinput

for line in fileinput.FileInput('your_file', inplace=True):
    line = line.replace(';', ',')
    print(line, end='')
fileinput.close()

Then you could use read_csv with c engine and use parameter error_bad_lines or you could also preprocess them with that loop.

Note: If you want to make backup of your file you could use backup parameter for FileInput

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.