Storing multi-byte data in BLOB for single byte oracle deployments

Question

Based on the our client requirements we configure our oracle (version - 12c) deployments to support single or multi-byte data (through character set setting). There is a need to cache third party multi-byte data(json) for performance reasons. We found that we could encode the data in UTF-8 and persist it (after converting it to bytes) in a BLOB column of an Oracle table. This is a hack that allows us to store multi-byte data in single byte deployments. There are certain limitations that come with this approach like

The data cannot be queried or updated through SQL code (stored procedures).
Search operation using for e.g. LIKE operators could not be performed.
Marshaling and unmarshaling overhead for every operation at the application layer (java)

Assuming we compromise with these limitations are there any other drawbacks that we should be aware off?

Thanks.

1) NVARCHAR, NCLOB and NCHAR columns should be able to store multi byte data even on single-byte installations. Can't you just declare all columns that are expected to contain multi-byte data as Nxxxx columns? — Carlo Sirna
– Carlo Sirna, Commented Nov 22, 2018 at 6:17
2) if the requirement comes only for JSON columns: it is entirely possible to store json data as pure ASCII data by escaping all characters whose code is greater than 127. For example the Json string '{"UnicodeCharsTest":"ni\u00f1o"}' represents the very same object of this other one: '{"UnicodeCharsTest" : "niño"}'. you could re-encode all json strings this way and you could store them. And Oracle 12 has both the JSON_VALUE function and the "field is json" constraint that allows you to correctly query values stored in json objects (you don't have to decode yourself escape sequences) — Carlo Sirna
– Carlo Sirna, Commented Nov 22, 2018 at 6:25
Nowadays the default is NLS_CHARACTERSET=AL32UTF8, i.e. UTF-8. Of course UTF-8 supports also single byte characters. Why do you like to use a single-byte characters set still in 2018? — Wernfried Domscheit
– Wernfried Domscheit, Commented Nov 22, 2018 at 9:55
@AndyDufresne NCLOB/NCHAR/NVARCHAR has always been the official type to use for multi-byte character strings. It has always worked this way. Whay I am suggesting is not a hack. — Carlo Sirna
– Carlo Sirna, Commented Nov 22, 2018 at 17:47
@AndyDufresne: let me elaborate: any installation of oracle supports TWO character sets: the normal character set (which in old versions of oracle defaulted to the single byte character set that matched the language using the installation process... and this is used for all normal varchar, char and clob fields... and also for table names, column names, etc...) and the "national" character set used for storing string with weird characters (Nxxx columns). personally I have never found an oracle installation where the character set used for these other columns isn't a UNICODE charset. — Carlo Sirna
– Carlo Sirna, Commented Nov 22, 2018 at 17:58

Community · Accepted Answer · 2020-06-20 09:12:55Z

2

Ok, I am summarizing my comments in a proper answer.

You have two possible solutions:

store them in NVARCHAR2/NCLOB columns
re-encode JSON values in order to use only ASCII characters

1. NCLOB/NVARCHAR

The "N" character in "NVARCHAR2" stands for "National": this type of column has been introduced exactly to store characters that can't be represented in the "database character set".

Oracle actually supports TWO character sets:

"Database Character Set" it is the one used for regular varchar/char/clob fields and for the internal data-dictionary (in other words: it is the character set you can use for naming tables, triggers, columns, etc...)
"National Character Sets": the character set used for storing NCLOB/NCHAR/NVARCHAR values, which is supposed to be used to be able to store "weird" characters used in national languages.

Normally the second one is a UNICODE character set, so you can store any kind of data in there, even in older installations

2. encode JSON values using only ASCII characters

It is true that the JSON standard is designed with UNICODE in mind, but it is also true that it allows characters to be expressed as escape sequences using the exadecimal representation of their code points.. and if you do so for every character having a code point greater than 127, you can express ANY unicode object using only ASCII character.

This ASCII JSON string: '{"UnicodeCharsTest":"ni\u00f1o"}' represents the very same object of this other one: '{"UnicodeCharsTest" : "niño"}'.

Personally I prefer this second approach because it permits me to share easily these json strings also with systems using antiquate legacy protocols and also it allows me to be sure that the json strings are read correctly by any client regardless of its national settings (the oracle client protocol can try to translate strings into the character used by the client... and this is a complication I don't want to deal with. By the way: this might be the reason of the problems you are experiencing with SQL clients)

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Dec 4, 2018 at 7:46

Carlo Sirna

1,2517 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Akshay Trivedi Over a year ago

Can we search multibyte in single byte deployment after storing in NCLOB? I am unable to search. Can you please suggest a way forward?

Akshay Trivedi Over a year ago

Multibyte Storage in National character sets results in data loss like छत्रपती is stored as ¿¿¿¿¿¿¿. How to retain the multibyte data stored in NCLOB?

Collectives™ on Stack Overflow

Storing multi-byte data in BLOB for single byte oracle deployments

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related