0

I have the following to extract data from a CSV into an SQLite db:

    elif file.endswith('.csv'):
    read_path = os.path.join(dir_read, file)
    with open(read_path,'r',encoding='utf-8') as fin:
        dr = csv.DictReader(fin)
        to_db = [(i['InvoiceNumber'],i['InvoiceType'], i['ChargeType'],i['SupplierID'], i['Net_Amount']) for i in dr]
    
    c.executemany("INSERT INTO mf (InvoiceNumber, InvoiceType, ChargeType, SupplierID, Net_Amount) VALUES (?, ?, ?, ?, ?) ;", to_db)
    con.commit()

This works for most of the files I am importing (200 files), but there are a couple with a NUL in one of the fields, this is not one of the fields in the to_db. I can see this in Notepad++, one row has a null all the others have an empty string. The error I get is:

_csv.Error: line contains NULL byte

I have tried different encoding (utf-16, le, be), but then they fail on all the other csv files.

Is there a way to remove these NULL values and replace with a blank string when importing?

3
  • This has nothing to do with encodings. Whoever created that file decided to emit a NUL for missing values instead of an empty string or the NULL string. You'll have to handle this explicitly Commented Dec 10, 2020 at 11:59
  • BTW you aren't extracting data from a CSV into SQLite, you're importing it. Something that can be done from the command line or the .import SQLite command. You can even handle that file as a CSV virtual table. Commented Dec 10, 2020 at 12:01
  • BTW NUL is valid in UTF8 Commented Dec 10, 2020 at 12:14

1 Answer 1

0

You can use codec library to handle null byte error.

import csv
import codecs
csv_reader = csv.reader(codecs.open(read_path, 'rU', 'utf-16'))
Sign up to request clarification or add additional context in comments.

4 Comments

How is this going to fix the problem? NUL is valid in UTF8 and the OP already tried different encodings, including utf-16. The file is definitely not UTF16, otherwise all English characters, including commas, would have a leading 0x00
Hi Panagiotis, thanks for your feedback. Actually the thing that supposed to work is the codec.open() function. And you may be right about the encoding because he says none of them worked. We are not 100% sure about the encoding part since we do not have the file itself to try. But codecs.open method may sometimes help some problematic issues, such as null byte error. You may have a look at here if you are interested docs.python.org/3/library/codecs.html
The null byte isn't an error in UTF8 or any other encoding. open already handles encodings and the page you linked to says While the builtin open() and the associated io module are the recommended approach for working with encoded text files, this module provides additional utility functions and classes that allow the use of a wider range of codecs when working with binary files. That's not the case with codecs.open, in fact there are plans to deprecate it
Check Difference between open and codecs.open in Python. Deprecation hasn't happened yet for compatibility reasons, not because there are missing features

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.