0

I'm using pandas, and exporting data as json like this:

import pandas as pd
df = pd.DataFrame({'a': ['Têst']})

df.to_json(orient='records', lines=True)
> u'{"a":"T\\u00east"}'

This makes sense since we have a Unicode character 00ea prefixed with \u and it is escaped with \ when converted to JSON

But then I import the JSON strings into Postgres with COPY

buffer = cStringIO.StringIO()
buffer.write(df.to_json(orient='records', lines=True))
buffer.seek(0)

with connection.cursor() as cursor:
  cursor.copy_expert(sql="""
  COPY tmp (json_data) FROM STDIN WITH NULL AS '' ENCODING 'UTF8';
  """, file=buffer)

The problem is that the result in the database ends up being

{"a": "Tu00east"}

and as you can see the double \\ is gone.

I tried using CSV as the COPY mode, but it messes things up since there are commas in some of the data, and trying to set ESCAPE character and DELIMITER to something else always seem to cause failures.

The table column has a jsonb type. I read in the docs that PG doesn't like non-ASCII Unicode over \x003f unless the DB encoding is UTF8, which it is in my case, so that shouldn't be an issue.

I'm trying to figure out why the escaping characters are removed here, and how to import into Postgres and conserve the encoding.

1 Answer 1

3

Use the csv option for COPY, with DELIMITER e'\x01' QUOTE e'\x02'. I'm not sure whether this works for all possible valid JSON, but I've never had it fail.

$ psql -X testdb -c 'create table t(d jsonb)'
CREATE TABLE
$ cat foo.json
{"a":"Têst"}
$ cat foo.json | psql -X testdb -c "COPY t from stdin csv delimiter e'\x01' quote e'\x02'" 
COPY 1
$ psql -X testdb -c 'select * from t';                                                    
       d       
---------------
 {"a": "Têst"}
(1 row)
Sign up to request clarification or add additional context in comments.

4 Comments

ok, interesting. Can you explain why \x01 and \x02? I guess just anything that's not going to be found in the data, right?
Seems like the trick with CSV is to AVOID any common character. I was trying to set the quote, escape etc... to be what I have, but I see the trick here since instead PG needs to get the strings as is, since they're already escaped and encoded right.
what about encoding support at import ?
this is black magic but it works ! Works also with multiple fields in your csv in python csv.writer(csvfile, delimiter=chr(1), escapechar="'", quotechar=chr(2))

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.