0

I'm trying to load a csv into Python, but the file keeps failing because one of the fields has a '\N' to represent null values in a field that is Integer. I can't figure out how to deal with this - I'd like to convert it on the way in.

It would be great if I could ignore error and insert the rest of the record into the table, but that doesn't seem to be a thing.

Any help would be much appreciated

So the following code

con.sql("INSERT INTO getNBBOtimes SELECT * FROM read_csv_auto('G:/temp/timeexport.csv')")

results in the following error

InvalidInputException                     Traceback (most recent call last)
<timed eval> in <module>

InvalidInputException: Invalid Input Error: Could not convert string '\N' to INT64 in column "column3", at line 856438.

Parser options:
  file=G:/temp/timeexport.csv
  delimiter=',' (auto detected)
  quote='"' (auto detected)
  escape='"' (auto detected)
  header=0 (auto detected)
  sample_size=20480
  ignore_errors=0
  all_varchar=0.

Consider either increasing the sample size (SAMPLE_SIZE=X [X rows] or SAMPLE_SIZE=-1 [all rows]), or skipping column conversion (ALL_VARCHAR=1)

I figured I would try to handle the error on the way in, but nothing seems to work

con.sql("CREATE TABLE test1 as seLECT NULLIF(column1,'\\N') , NULLIF(column2,'\\N'),NULLIF(column3,'\\N'),NULLIF(column4,'\\N'),NULLIF(column2,'\\N') FROM read_csv_auto('G:/temp/timeexport.csv')")

returns the following error:

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 46-47: malformed \N character escape

I tried this

con.sql("CREATE TABLE test1 as seLECT NULLIF(column1,repr('\\N')) , NULLIF(column2,repr('\\N')),NULLIF(column3,repr('\\N')),NULLIF(column4,(repr'\\N')),NULLIF(column2,repr('\\N')) FROM read_csv_auto('G:/temp/timeexport.csv')")

and got this error

CatalogException: Catalog Error: Scalar Function with name repr does not exist!
Did you mean "exp"?
1

1 Answer 1

1

You haven't provided any sample data, so let's assume you're starting with:

id,hours_worked
1,8
2,\N
3,10
4,\N

We start by creating our target table:

>>> con = duckdb.connect()
>>> con.sql('create table getnbbotimes (id int, hours_worked int64)')

We can use a SQL IF statement to read in the file:

>>> con.sql("INSERT INTO getNBBOtimes SELECT id,if(hours_worked == '\\N',NULL,hours_worked) FROM read_csv_auto('timeexport.csv')")

Which gets us:

>>> con.sql('select * from getnbbotimes')
┌───────┬──────────────┐
│  id   │ hours_worked │
│ int32 │    int64     │
├───────┼──────────────┤
│     1 │            8 │
│     2 │         NULL │
│     3 │           10 │
│     4 │         NULL │
└───────┴──────────────┘

...which is what I think you were after.


You can make your solution using NULLIF work if you're willing to treat all columns as VARCHAR:

>>> con.sql("CREATE TABLE test1 as select NULLIF(id,'\\N') 
... as id, NULLIF(hours_worked,'\\N') as hours_worked
... FROM read_csv_auto('timeexport.csv', all_varchar=1)")

Which gets us:

>>> con.sql('select * from test1')
┌─────────┬──────────────┐
│   id    │ hours_worked │
│ varchar │   varchar    │
├─────────┼──────────────┤
│ 1       │ 8            │
│ 2       │ NULL         │
│ 3       │ 10           │
│ 4       │ NULL         │
└─────────┴──────────────┘

You could then use a second select to convert those varchar values to int64.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.