Specifying ASCII columns in a UTF-8 PostgreSQL database

Question

I have a PostgreSQL database with UTF8 encoding and LC_* en_US.UTF8. The database stores text columns in many different languages.

On some columns however, I am 100% sure there will never be any special characters, i.e. ISO country & currency codes.

I've tried doing something like:

"countryCode" char(3) CHARACTER SET "C" NOT NULL

and

 "countryCode" char(3) CHARACTER SET "SQL_ASCII" NOT NULL

but this comes back with the error

ERROR: type "pg_catalog.bpchar_C" does not exist
ERROR: type "pg_catalog.bpchar_SQL_ASCII" does not exist

What am I doing wrong?

More importantly, should I even bother with this? I'm coming from a MySQL background where doing this was a performance and space enhancement, is this also the case with PostgreSQL?

TIA

Storing ASCII text in a UTF-8 field will take exactly the same amount of space as storing it in a ASCII text (unless you have a system that actually uses only 7 bit for storing ASCII, but I don't think any of those exist in any relevant distribution these days). — Joachim Sauer
– Joachim Sauer, Commented Jun 28, 2012 at 11:54
For which operations? I can only think of sorting and case insensitive searching in the field that could possibly be influenced. Pure read/write operations should be exactly the same. Also note that I was commenting primarily on space used, not necessarily on performance. — Joachim Sauer
– Joachim Sauer, Commented Jun 28, 2012 at 11:59
Yes, mainly those functions, like when sorting a list of users by their country code. You're right though that in regards to disk space there should be no difference. — ianaré
– ianaré, Commented Jun 28, 2012 at 12:10
There is a size benefit when dealing with FIXED length chars. In UTF-8, these require a varying number of BYTES to store the fixed THREE characters, so a length has to be added that then requires an additional FOUR bytes. I would LOVE to be able to have an ASCII (7 or 8 bit) char(3), since this would not need to store a length. — dsz
– dsz, Commented Jan 22, 2014 at 4:40

vyegorov · Accepted Answer · 2012-06-28 12:19:29Z

2

Honestly, I do not see the purpose of such settings, as:

as @JoachimSauer mentions, ASCII subset in the UTF-8 encoding will occupy exactly the same number of bytes, as that was the main point of inventing UTF-8: keep ASCII unchanged. Therefore I see no size benefits;
all software that is capable of processing strings in different encoding will use a common internal encoding, which is UTF-8 by default for PostgreSQL nowadays. When some textual data comes in to the processing stage, database will convert it into the internal encoding if encodings do not match. Therefore, if you specify some columns as being non-UTF8, this will lead to the extra processing of the data, thus you will loose some cycles (don't think it will be notable performance hit though).

Given there's no space benefits and there's a potential performance hit, I think it is better to leave things as they are, i.e. keep all columns in the database's default encoding.

I think for the same arguments PostgreSQL do not allow to specify encodings for individual objects within the database. Character Set and Locale are set on the per-database level.

answered Jun 28, 2012 at 12:19

vyegorov

23k7 gold badges61 silver badges76 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

kgrittn Over a year ago

No disagreement, but there can be a point to using a special collation when you have data of a particular language and want to order the rows according to the conventions of that language.

Collectives™ on Stack Overflow

Specifying ASCII columns in a UTF-8 PostgreSQL database

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related