Is it possible to store invalid UTF-8 byte sequences in a postgres text column?

Question

The encoding of my postgres database is UTF-8. In a certain table I have a text column into which I would like to insert some data. Now, the data is mostly valid UTF-8, but there are a number of instances of invalid byte sequences which I do not want to remove or substitute. My question is, is there any way of inserting the data into the text column without removing or substituting its invalid byte sequences?

Here's a simple example, executed from the shell (bash) command-line courtesy of psql:

psql main postgres <<<"create table t1 (a text); insert into t1 (a) values (E'a\xC0b');";
## CREATE TABLE
## ERROR:  invalid byte sequence for encoding "UTF8": 0xc0 0x62

I know this is probably a long shot, but is there any way of disabling postgres's validation of inserted text, perhaps on an ad hoc basis? I don't see how it would trouble postgres to have some byte sequences in text column data that happen to not be valid for the database's configured character encoding.

If this is not possible, I guess the only recourse is to store the data as straight binary data using the bytea data type, but please let me know if there's a better solution out there.

AFAIK text always has an encoding (otherwise the database wouldn't know how to convert the bytes to characters, especially with variable length encodings such as UTF-8). If you just have a stream of bytes then you have bytea data, not text. Of course, things like length will work differently (compare length('µ') and length('µ'::bytea) for an example) so you're left with a choice of which pain you want to suffer. — mu is too short
– mu is too short, Commented Aug 6, 2017 at 17:22

Craig Ringer · Accepted Answer · 2017-08-07 00:21:00Z

4

If you want to store invalidly encoded data, use bytea. As mu alludes to, you'll have to deal with the fact that substrings and lengths etc are now byte-oriented, not character-oriented.

It is a problem to have invalidly encoded text. How would left(n) know how many characters to grab? How would indexing determine a correct lexical sort order? etc. Not to mention that PostgreSQL can't do on the fly character encoding conversion (e.g. client_encoding = 'latin-1') if you have badly encoded data in a table.

You seem to want some kind of lax or forgiving mode for encodings, where it falls back to byte-based interpretation if the data isn't valid in the current encoding, or replaces it with ? or something. That's a valid thing to want, but is not supported by PostgreSQL.

answered Aug 7, 2017 at 0:21

Craig Ringer

329k84 gold badges742 silver badges820 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Is it possible to store invalid UTF-8 byte sequences in a postgres text column?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related