NULL in CSV file, what encoding to use [duplicate]

Question

I have the following to extract data from a CSV into an SQLite db:

    elif file.endswith('.csv'):
    read_path = os.path.join(dir_read, file)
    with open(read_path,'r',encoding='utf-8') as fin:
        dr = csv.DictReader(fin)
        to_db = [(i['InvoiceNumber'],i['InvoiceType'], i['ChargeType'],i['SupplierID'], i['Net_Amount']) for i in dr]
    
    c.executemany("INSERT INTO mf (InvoiceNumber, InvoiceType, ChargeType, SupplierID, Net_Amount) VALUES (?, ?, ?, ?, ?) ;", to_db)
    con.commit()

This works for most of the files I am importing (200 files), but there are a couple with a NUL in one of the fields, this is not one of the fields in the to_db. I can see this in Notepad++, one row has a null all the others have an empty string. The error I get is:

_csv.Error: line contains NULL byte

I have tried different encoding (utf-16, le, be), but then they fail on all the other csv files.

Is there a way to remove these NULL values and replace with a blank string when importing?

This has nothing to do with encodings. Whoever created that file decided to emit a NUL for missing values instead of an empty string or the NULL string. You'll have to handle this explicitly — Panagiotis Kanavos
– Panagiotis Kanavos, Commented Dec 10, 2020 at 11:59
BTW you aren't extracting data from a CSV into SQLite, you're importing it. Something that can be done from the command line or the .import SQLite command. You can even handle that file as a CSV virtual table. — Panagiotis Kanavos
– Panagiotis Kanavos, Commented Dec 10, 2020 at 12:01

Dharman · Accepted Answer · 2020-12-10 12:08:38Z

0

You can use codec library to handle null byte error.

import csv
import codecs
csv_reader = csv.reader(codecs.open(read_path, 'rU', 'utf-16'))

edited Dec 10, 2020 at 12:08

Dharman♦

33.9k27 gold badges106 silver badges157 bronze badges

answered Dec 10, 2020 at 12:03

Caner Burc BASKAYA

1765 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Panagiotis Kanavos Over a year ago

How is this going to fix the problem? NUL is valid in UTF8 and the OP already tried different encodings, including utf-16. The file is definitely not UTF16, otherwise all English characters, including commas, would have a leading 0x00

Caner Burc BASKAYA Over a year ago

Hi Panagiotis, thanks for your feedback. Actually the thing that supposed to work is the codec.open() function. And you may be right about the encoding because he says none of them worked. We are not 100% sure about the encoding part since we do not have the file itself to try. But codecs.open method may sometimes help some problematic issues, such as null byte error. You may have a look at here if you are interested docs.python.org/3/library/codecs.html

Panagiotis Kanavos Over a year ago

The null byte isn't an error in UTF8 or any other encoding. open already handles encodings and the page you linked to says

While the builtin open() and the associated io module are the recommended approach for working with encoded text files, this module provides additional utility functions and classes that allow the use of a wider range of codecs when working with binary files

. That's not the case with codecs.open, in fact there are plans to deprecate it

Panagiotis Kanavos Over a year ago

Check Difference between open and codecs.open in Python. Deprecation hasn't happened yet for compatibility reasons, not because there are missing features

Collectives™ on Stack Overflow

NULL in CSV file, what encoding to use [duplicate]

1 Answer 1

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Linked

Related