Questions tagged [unicode]
Unicode is intended to be a universal character set for describing all the characters required for written text incorporating all writing systems, technical symbols and punctuation.
63 questions
9
votes
7
answers
3k
views
Are there historical problems with non-ASCII identifier characters in code?
I frequently encounter recommendations to specifically keep to ASCII characters in field and function names in documentation, even though non-ASCII (modern Unicode) generally works perfectly. An ...
6
votes
0
answers
790
views
How to OCR and/or recreate lines of Egyptian Hieroglyphs in Unicode/HTML?
I am wondering how to take these Hieroglyphs and make them into Unicode. I read through the Tesseract docs on how to create training data, but it seems largely tailored toward "traditional" ...
2
votes
3
answers
154
views
What is the name of the type of program to produce Unicode characters from ASCII combinations?
For example, in Vietnamese, there are Unicode characters like "â", "ê", "ô", "ư", v.v. To type them from keyboard, I need to type aa, ee, oo, w, then a program ...
10
votes
0
answers
268
views
Is there any guideline from Unicode on how to deal with graphemes that have no base character?
A valid sequence of code-points can begin with one or more combining mark, which form a grapheme cluster that has no base glyph.
I'm unsure how that should be handled, if at all.
For example, consider ...
1
vote
1
answer
87
views
Layout Behavior of Characters (question about unicode standard)
I've been reading Unicode's core specification (see https://www.unicode.org/versions/latest/). I mostly understood what the text was explaining in section 2.1 Architectural Context until it started ...
9
votes
3
answers
737
views
What was the first language to allow Unicode in function names?
People often get excited about JuliaLang supporting Unicode function names.
But it's not new at all,it's just that the Julia community decided that it was sometimes appropriate, and built tooling to ...
5
votes
1
answer
428
views
UTF-8 questions
When you encode a code point to code units based on UTF-8, then if the code point fits on 7 bits, the most significant bit is set to zero so that it tells you it is a character which is stored on 1 ...
2
votes
1
answer
443
views
Differentiating Between ASCII and Unicode in File Spec
I am developing against a file spec that lists the data type for certain fields as
CHAR(<length>)
The spec is for a fixed width flat file. In most cases, possible values to populate the fields ...
3
votes
4
answers
5k
views
How to align on both word size and cache lines in x86
From what it sounds like, a 64 bit processor means aligning to 64 bits, which means if you have unicode utf-8 stored in there, each 8-bit chunk would take up 64 bits of space. That doesn't really make ...
0
votes
2
answers
555
views
How does MS word renders different fonts?
My main goal is described here.
How can Microsoft Word or Wordpad or other word editing software render fonts when these fonts seems to not follow the same rules?
How do they render characters ...
0
votes
2
answers
2k
views
Why Unicode Encoding/Decoding is Necessary in JavaScript
I am wondering why unicode encoding is necessary in JavaScript. I am looking at utf8.js as an example. I am also looking at the utf8 spec, but am not really following the different pieces of data. ...
0
votes
1
answer
1k
views
Java takes 2 bytes to represent character?
In general a character is represented in 1 byte i.e. 8 bits . This is I believe true for all text editors even for databases like oracle. 1 byte
can represent 2^8 = 256 Characters.
My question is when ...
50
votes
4
answers
46k
views
Should UTF-8 CSV files contain a BOM (byte order mark)?
Our line-of-business software allows the user to save certain data as CSV. Since there are a lot of different formats (all called "CSV") in use in the wild, we are tying to decide what the &...
8
votes
1
answer
4k
views
Is the BOM optional for UTF-16 and UTF-32?
I used to think that the BOM is optional for UTF-8, but mandatory for UTF-16 and UTF-32.
But then I have read the following (in this article):
Let's look just at the ones that Notepad supports.
...
6
votes
3
answers
3k
views
Why does Unicode have separate codepoints for characters with identical glyphs?
(Not entirely sure whether this should go in the information-security StackExchange instead; feel free to move it there if that's where it belongs.)
Unicode has many, many instances of pairs or ...
1
vote
1
answer
390
views
Unicode Telugu language characters
I am developing a mobile app in android in which I use Telugu (Indian language) texts. On my mobile Telugu language alphabets are available. Therefore, I am not facing any problem for testing my app. ...
8
votes
1
answer
615
views
Do C++'s iterator categories forbid writing a UTF-8 iterator adapter?
I've been working on a UTF-8 iterator adapter. By which, I mean an adapter that turns an iterator to a char or unsigned char sequence into an iterator to a char32_t sequence. My work here was inspired ...
9
votes
2
answers
73k
views
I can type ⅓, ⅔ and ½ but can I type 3/3 and 2/2 using unicode? [closed]
I can type ⅓, ⅔ and ½ but can I type 3/3 and 2/2 using unicode? I know that from a mathematical point of view the fractions 2/2 = 3/3 = 1 but I am typing a list where I want to indicate that you have ...
2
votes
1
answer
222
views
What Unicode Transformation Format is being represented when just Unicode is written?
Many programs will supply one or more of the following as file encoding formats: UTF-8, UTF-16, UTF-32 and simply Unicode. How do I know what Unicode Transformation Format Unicode is referring to? I'm ...
9
votes
3
answers
5k
views
Compiling for string and wstring
I'm creating a library. I want to use it in multiple projects which may use multi-byte or unicode (std::string or std::wstring). I've adopted the old MS method of conditional compiling:
namespace ...
1
vote
1
answer
604
views
Cross-platform unicode support for Python command line tool
I am working on a large command line tool, written for Python 2.6+ and supported for Windows, OS X and Linux. The target users are developers but it is also being auto-invoked by CI-systems etc. In ...
10
votes
1
answer
2k
views
How do you mix left-to-right and right-to-left scripts without your files looking crazy?
Say your native language is Hebrew, and you're working in a programming language like Python 3, which lets you put Hebrew in source code. Good for you! You've got a dict:
d = {'a': 1}
and you want to ...
89
votes
5
answers
10k
views
Would UTF-8 be able to support the inclusion of a vast alien language with millions of new characters?
In the event an alien invasion occurred and we were forced to support their languages in all of our existing computer systems, is UTF-8 designed in a way to allow for their possibly vast amount of ...
4
votes
2
answers
417
views
Why does ISO 8859-1 contain letter-free diacritics?
ISO 8859-1 contains a few letter-free diacritics: The diaeresis (¨), the acute accent (´), the cedilla (¸) and the macron (¯).¹
Why were they included? As far as I know (please correct me if I am ...
4
votes
1
answer
8k
views
What is the difference between "Wide character" set and "Unicode character set"? [closed]
Today I was reading my favourite C++ Programming book (C++ Primer Plus) and the section which was about variables and character sets in C++,
however I got really confused about Unicode and Wide ...
0
votes
0
answers
160
views
How can I resolve Unicode Hex Value Mismatches between WordML and XSL:FO?
We have an important legal document that our app generates in WordML, with foreign characters represented via Unicode. These foreign characters vary widely, and include languages with special ...
1
vote
2
answers
360
views
Is there accepted decimal-based Unicode notation for technical audiences?
When writing for technical audiences, there are various ways to type Unicode representations, but they all seem to be Hexadecimal:
\uFFFF - From C# / Java Strings
\U0000FFFF - From C# / Java Strings
...
8
votes
4
answers
5k
views
Prime symbol in Python variable name
So I'm a terrible person and I want to name a variable in my mathy-python3 code s′ (that's U+2032 PRIME).
I was under the impression Unicode literals work as identifiers in Python 3, which is why my ɣ,...
12
votes
5
answers
6k
views
Why does "charset" really mean "encoding" in common usage?
Something that has long confused me is that so much software uses the terms "charset" and "encoding" as synonyms.
When people refer to a unicode "encoding", they always mean a ruleset for ...
0
votes
1
answer
345
views
Simple unicode application?
I want to create simple language learning applications to help friends in learning languages. A simple Java console application would do the trick, but the Windows console does not seem to handle ...
4
votes
4
answers
271
views
Technical reasons to prefer coding business logic to support Unicode (when not required)
I have a legacy application in which the UI and business logic are already reasonably well-separated. There is a proposal to separate them even further, turning the core application into a "service" (...
9
votes
1
answer
4k
views
Does it make sense to choose UTF-32, based on concern that some basic rule will be broken for UTF-8?
I'm working on an cross platform C++ project, which doesn't consider unicode, and need change to support unicode.
There is following two choices, and I need to decide which one to choose.
Using UTF-8 ...
5
votes
3
answers
3k
views
When should I *not* use Unicode? [duplicate]
Unicode seems that its becoming more and more ubiquitous these days if it's not already, but I have to wonder if there are any domains were Unicode isn't the best implementation choice. Are there any ...
8
votes
5
answers
5k
views
Using π, φ, λ etc. as variable names while programming? [duplicate]
This is a function in the d3.v3.js file (the graph library D3.js):
function d3_geo_areaRingStart() {
var λ00, φ00, λ0, cosφ0, sinφ0;
d3_geo_area.point = function(λ, φ) {
d3_geo_area....
1
vote
1
answer
197
views
How can I learn about typography, fonts, glyphs, etc.? [closed]
I know so little about this that I'm having trouble formulating the question.
Apparently due to technical limitations, nastaleeq style of writing Urdu is very difficult, perhaps impossible, given ...
5
votes
3
answers
603
views
Consequences of "naïve" vs "naive"?
While using IE autocorrect "naive" got transformed to "naïve"!
My regional settings are Au English, from a Unicode search point of view the two are nothing alike. I am not even sure whether there are ...
9
votes
2
answers
1k
views
Languages supporting unicode logic operators
Are there any programming languages that support the use of unicode logic operators? For example, many programming languages use "!=" as the "does not equal"
operator, but in mathematics the symbol ...
14
votes
3
answers
2k
views
A Unicode sentinel value I can use?
I am desiging a file format and I want to do it right. Since it is a binary format, the very first byte (or bytes) of the file should not form valid textual characters (just like in the PNG file ...
4
votes
2
answers
212
views
Strategy for website with international strings
What things need to be considered for a Website that contains International strings, for instance Simplified Chinese and English mixed.
UTF8 seems to me a natural choice, including a meta tag. Still,...
6
votes
3
answers
1k
views
Is O(1) random access into variable length encoding strings useful?
I remember reading that there are no existing data structures which allow for random-access into a variable length encoding, like UTF-8, without requiring additional lookup tables.
The main question ...
32
votes
2
answers
16k
views
Why does Java use UTF-16 for internal string representation?
I would imagine the reason was fast, array like access to the character at index, but some characters won't fit into 16 bits, so it wouldn't work...
So if you have to handle special cases anyways, ...
36
votes
2
answers
2k
views
Unicode license
The Unicode Terms of Use state that any software that uses their data files (or a modification of them) should carry the Unicode license references. It seems to me that most Unicode libraries have ...
3
votes
1
answer
4k
views
understanding the encoding scheme in python 3
I got this error in my program which grab data from different website and write them to a file:
'charmap' codec can't encode characters in position 151618-151624: character maps to <undefined>
...
3
votes
3
answers
22k
views
How does it matter if a character is 8 bit or 16 bit or 32 bit
Well, I am reading Programing Windows with MFC, and I came across Unicode and ASCII code characters. I understood the point of using Unicode over ASCII, but what I do not get is how and why is it ...
41
votes
3
answers
172k
views
Why do we need to put N before strings in Microsoft SQL Server?
I'm learning T-SQL. From the examples I've seen, to insert text in a varchar() cell, I can write just the string to insert, but for nvarchar() cells, every example prefix the strings with the letter N....
15
votes
2
answers
6k
views
Efficient Trie implementation for unicode strings
I have been looking for an efficient String trie implementation. Mostly I have found code like this:
Referential implementation in Java (per wikipedia)
I dislike these implementations for mostly two ...
7
votes
1
answer
2k
views
How in the earth CHRW produce unicode codes given that it only accept 65k possible input?
http://babelstone.blogspot.com/2005/11/how-many-unicode-characters-are-there.html says there are 1 million unicode characters and around 240k of which are already assigned.
1 million > 240k > 65k
...
4
votes
2
answers
217
views
Prerequisites for developing an application with Unicode support [closed]
What could be the necessary prerequisites to be taken when developing an application with Unicode support in the context of
Web applications
Desktop applications
Embedded applications
Prerequisites to ...
5
votes
3
answers
731
views
What limitation will we face if each user-perceived character is assigned to one codepoint?
What limitations will we have if Unicode standards had decided to assign one and only one codepoint to every user-perceived character?
Currently, Unicode has code-points that correspond to combining ...
16
votes
8
answers
3k
views
What's the point of adding Unicode identifier support to various language implementations?
I personally find reading code full of Unicode identifiers confusing. In my opinion, it also prevents the code from being easily maintained. Not to mention all the effort required for authors of ...