4

I have a PostgreSQL database with UTF8 encoding and LC_* en_US.UTF8. The database stores text columns in many different languages.

On some columns however, I am 100% sure there will never be any special characters, i.e. ISO country & currency codes.

I've tried doing something like:

"countryCode" char(3) CHARACTER SET "C" NOT NULL

and

 "countryCode" char(3) CHARACTER SET "SQL_ASCII" NOT NULL

but this comes back with the error

ERROR: type "pg_catalog.bpchar_C" does not exist
ERROR: type "pg_catalog.bpchar_SQL_ASCII" does not exist

What am I doing wrong?

More importantly, should I even bother with this? I'm coming from a MySQL background where doing this was a performance and space enhancement, is this also the case with PostgreSQL?

TIA

5
  • 1
    Storing ASCII text in a UTF-8 field will take exactly the same amount of space as storing it in a ASCII text (unless you have a system that actually uses only 7 bit for storing ASCII, but I don't think any of those exist in any relevant distribution these days). Commented Jun 28, 2012 at 11:54
  • In MySQL you do get a minor speed improvement. Commented Jun 28, 2012 at 11:57
  • For which operations? I can only think of sorting and case insensitive searching in the field that could possibly be influenced. Pure read/write operations should be exactly the same. Also note that I was commenting primarily on space used, not necessarily on performance. Commented Jun 28, 2012 at 11:59
  • Yes, mainly those functions, like when sorting a list of users by their country code. You're right though that in regards to disk space there should be no difference. Commented Jun 28, 2012 at 12:10
  • 1
    There is a size benefit when dealing with FIXED length chars. In UTF-8, these require a varying number of BYTES to store the fixed THREE characters, so a length has to be added that then requires an additional FOUR bytes. I would LOVE to be able to have an ASCII (7 or 8 bit) char(3), since this would not need to store a length. Commented Jan 22, 2014 at 4:40

1 Answer 1

2

Honestly, I do not see the purpose of such settings, as:

  • as @JoachimSauer mentions, ASCII subset in the UTF-8 encoding will occupy exactly the same number of bytes, as that was the main point of inventing UTF-8: keep ASCII unchanged. Therefore I see no size benefits;
  • all software that is capable of processing strings in different encoding will use a common internal encoding, which is UTF-8 by default for PostgreSQL nowadays. When some textual data comes in to the processing stage, database will convert it into the internal encoding if encodings do not match. Therefore, if you specify some columns as being non-UTF8, this will lead to the extra processing of the data, thus you will loose some cycles (don't think it will be notable performance hit though).

Given there's no space benefits and there's a potential performance hit, I think it is better to leave things as they are, i.e. keep all columns in the database's default encoding.

I think for the same arguments PostgreSQL do not allow to specify encodings for individual objects within the database. Character Set and Locale are set on the per-database level.

Sign up to request clarification or add additional context in comments.

1 Comment

No disagreement, but there can be a point to using a special collation when you have data of a particular language and want to order the rows according to the conventions of that language.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.