4

The encoding of my postgres database is UTF-8. In a certain table I have a text column into which I would like to insert some data. Now, the data is mostly valid UTF-8, but there are a number of instances of invalid byte sequences which I do not want to remove or substitute. My question is, is there any way of inserting the data into the text column without removing or substituting its invalid byte sequences?

Here's a simple example, executed from the shell (bash) command-line courtesy of psql:

psql main postgres <<<"create table t1 (a text); insert into t1 (a) values (E'a\xC0b');";
## CREATE TABLE
## ERROR:  invalid byte sequence for encoding "UTF8": 0xc0 0x62

I know this is probably a long shot, but is there any way of disabling postgres's validation of inserted text, perhaps on an ad hoc basis? I don't see how it would trouble postgres to have some byte sequences in text column data that happen to not be valid for the database's configured character encoding.

If this is not possible, I guess the only recourse is to store the data as straight binary data using the bytea data type, but please let me know if there's a better solution out there.

2
  • 3
    AFAIK text always has an encoding (otherwise the database wouldn't know how to convert the bytes to characters, especially with variable length encodings such as UTF-8). If you just have a stream of bytes then you have bytea data, not text. Of course, things like length will work differently (compare length('µ') and length('µ'::bytea) for an example) so you're left with a choice of which pain you want to suffer. Commented Aug 6, 2017 at 17:22
  • 2
    No idea why the downvotes? Commented Aug 7, 2017 at 0:26

1 Answer 1

4

If you want to store invalidly encoded data, use bytea. As mu alludes to, you'll have to deal with the fact that substrings and lengths etc are now byte-oriented, not character-oriented.

It is a problem to have invalidly encoded text. How would left(n) know how many characters to grab? How would indexing determine a correct lexical sort order? etc. Not to mention that PostgreSQL can't do on the fly character encoding conversion (e.g. client_encoding = 'latin-1') if you have badly encoded data in a table.

You seem to want some kind of lax or forgiving mode for encodings, where it falls back to byte-based interpretation if the data isn't valid in the current encoding, or replaces it with ? or something. That's a valid thing to want, but is not supported by PostgreSQL.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.