summaryrefslogtreecommitdiffstats
path: root/src/corelib/text/qunicodetables.cpp
Commit message (Collapse)AuthorAgeFilesLines
* QUnicodeTables: separate Properties::cases from the restMarc Mutz2025-10-271-3373/+3824
| | | | | | | | | | | | | | | | | | | | | | | | | | | | These entries are quite repetitive, esp. the all-zero entry for uncased characters (but not only: there are also 137 non-zero duplicates), and each one takes 8 bytes of the total 20 bytes of sizeof(Properties). Make a new array with these entries and only store an index into it in Properties. The new array happens to have a size of 448 entries (down from 3372 unique Properties), so 9 bits would suffice for the index, but a sizeof(Properties) == 14 is probably rather pointless, so add a reserved field to prop the struct up to 16. That sounds like the ideal size for rapid indexing and probably improves qGetProp() performance, esp. if case information is not needed. Theoretically, this should save 3372 * 4 - 448 * 8 = 9904 bytes. The TEXT size of libQtCore, however shrinks by a bit more, 10596 bytes, on optimized Linux AMD64 Clang 19 builds. Picking to all active branches, because the Unicode tables are still maintained in all of them. Fixes: QTBUG-139427 Pick-to: 6.10 6.9 6.8 6.5 Change-Id: If4dc47ef06c674ad0263f0623ec408a25b977b3a Reviewed-by: Edward Welbourne <edward.welbourne@qt.io> Reviewed-by: Ahmad Samir <a.samirh78@gmail.com>
* Mark QUnicodeTables as security-significant (= default)Marc Mutz2025-09-261-0/+1
| | | | | | | | | | | | | | | | | | | | This component does not process external data, so it is not security-critical. Yes, the characters presented to its functions may come from external sources, but that's not different from, say, a QRect being parsed from -geometry. The fact that there is code that parses a -geometry into a QRect doesn't make QRect a data-parser, or security-critical. It's just container for the result, and so is QUnicodeTables: a container for char32_t-indexed properties. This accompanying qt_attribution.json confirms that this component is not security-critical. Task-number: QTBUG-135195 Pick-to: 6.10 6.8 Change-Id: I565bd885220c0282ce7fb801411f12a80052465f Reviewed-by: Ivan Solovev <ivan.solovev@qt.io>
* QUnicodeTables: abstract access to Properties::casesMarc Mutz2025-09-181-3372/+3377
| | | | | | | | | | | | | | | | | | | This is in preparation for storing this information in a separate array to save space by removing the many duplicates in there. Pick to all active branches to have the same internal API going forward, even if we don't pick the storage optimization back as far. Qt 6.5 doesn't have QSpan, yet (not even as private API), but returning a reference to const std::array<.,4> will be an adequate replacement. To enable that without casting, convert `cases` from a C array to a std::array. For some reason, this requires extra parentheses, so add them. Task-number: QTBUG-139427 Pick-to: 6.10 6.9 6.8 6.5 Change-Id: I5331fd6d71a6a447b0445d8235b5eb8e38178e2e Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
* QChar: micro-optimization: force inlining of qGetProp()Thiago Macieira2025-09-071-11/+2
| | | | | | | | | | | | | | | | | | | | | | | | Clang was already doing it, but GCC (at least in LTO mode) wasn't and was repeatedly calling qGetProp(). This has the benefit that, in most cases, the input character whose property we seek is UTF-16, so dead code-elimination removes the extra branch - this can happen when QString functions go through the QChar front-end, like QChar::isSpace() or isSymbol(), which route through the char32_t overload. This forced inling allows us to remove the UCS2 overloads of qGetProp() and properties(), because the same const-propagation will apply to all but one of the places where UTF-16 code units were being compared. The 16-bit qGetProp() was only used in qstring.cpp's convertCase_helper(), whose 16-bit overload was only used in foldCase(). The one exception to this is qtextengine.cpp's QBidiAlgorithm::resolveN0(): const QUnicodeTables::Properties *p = QUnicodeTables::properties(char16_t{text[pos].unicode()}); This will now call the full UTF-32 overload. Pick-to: 6.10 Change-Id: Ifa4f2d77475877f26be2fffd9a987ff994dc8ef1 Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
* QUnicodeTools: fix attibute location on properties() functionsMarc Mutz2025-08-271-2/+2
| | | | | | | | | | | | | | The Q_DECL_CONST_FUNCTION needs to be on the declaration to have any effect on callers, but it was only on the (out-of-line) definition. Amends 2fe90a61bdf16bb1a08817ba544e2309b524a052. As a drive-by, also remove the export macros from the definitions; they, too, are only needed on the declaration. Pick-to: 6.10 6.9 6.8 6.5 Change-Id: Id69b58c50440b8b835f7be7ba873927d07b11219 Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
* QChar::isSpace: optimize by lowering the upper limit checkThiago Macieira2025-08-191-0/+1
| | | | | | | | | | | | | | | | Of all the Category categories, separators are the only to currently have assigned codepoints exclusively in the BMP. This allows us to lower the maximum check from the LastValidCodepoint to category-specific one. This will also cause the compiler to dead-code eliminate the check inside of qGetProperty and emit only the BMP check of the property tables: if (ucs4 < 0x11000) return uc_properties + uc_property_trie[uc_property_trie[ucs4 >> 5] + (ucs4 & 0x1f)]; Pick-to: 6.10 Change-Id: I31eda5d79cc2c3560d90fffd74a546d1e7cda7bb Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
* Update UCD to Unicode 16.0.0Mårten Nordheim2025-02-101-11193/+12267
| | | | | | | | | | | | | | | | | | | | | | | | They added some new scripts. There were a few changes to the line break algorithm, most notably there is more rules that require more context than before. While not major, there was some shuffling and additions to our implementation to match the new rules. IDNA test data now disallows the trailing dot/empty root label, technically to be toggled off by an option that controls a few things, but we don't have options. For test-data they changed the format a little - "" is used to mean empty string, while a blank segment is null/no string, update the parser to read this. [ChangeLog][Third-Party Code] Updated the Unicode Character Database to UCD revision 34/Unicode 16. Fixes: QTBUG-132902 Task-number: QTBUG-132851 Pick-to: 6.9 6.8 6.5 Change-Id: I4569703659f6fd0f20943110a03301c1cf8cc1ed Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
* Extract emoji data from Unicode filesEskil Abrahamsen Blomfeldt2024-08-061-9317/+9337
| | | | | | | | | | | | | | Expand unicode data to include information needed to parse emoji sequences. This is a pre-requisite for automatically preferring color fonts for emojis. As a drive-by, this also fixes a double space in the output of the uc_properties array. Task-number: QTBUG-111801 Change-Id: Icd993803c87c69ed278c7724377028f3706d0272 Reviewed-by: Eirik Aavitsland <eirik.aavitsland@qt.io> Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
* Revise UCD-generated data files' SPDX headersEdward Welbourne2024-04-221-1/+1
| | | | | | | | | | | | | | | | | | | The existing data comes under Unicode-DFS-2016 but future updates shall come under Unicode-3.0, so update the existing headers with the former and the generator script with the latter. Leave a note in the attribution file about this transitional state and how to resolve it. Replaced UNICODE_LICENSE.txt from src/corelib/text/ with LICENSES/Unicode-DFS-2016.txt, as fetched using reuse download. This doesn't look like a rename but only actually adds some irrelevant lines about where on the Unicode website the upstream files (to which we do not apply this license) come from and changes some spacing. Pick-to: 6.7 6.5 Fixes: QTBUG-121653 Change-Id: I50c9f4badc77a9aa402af946561aff58ae9e3e7a Reviewed-by: Volker Hilsheimer <volker.hilsheimer@qt.io> Reviewed-by: Kai Köhne <kai.koehne@qt.io>
* Unicode line breaking: Implement rules LB15a and LB15bIevgenii Meshcheriakov2024-02-081-3123/+3123
| | | | | | | | | | | | | | | | | | | | | | | | The new rules were added in Unicode 15.1 (TR #14, revision 51). The rules read: LB15a: (sot | BK | CR | LF | NL | OP | QU | GL | SP | ZW) [\p{Pi}&QU] SP* × LB15b: × [\p{Pf}&QU] (SP | GL | WJ | CL | QU | CP | EX | IS | SY | BK | CR | LF | NL | ZW | eot) Add two new line breaking classes LineBreak_QU_Pi and _QU_Pf to represent quotation characters with context that matches left side of LB15a and right side of LB15b respectively. This way it is still possible to use the line breaking classes table. Also add a coment about the original source of the line break table. Task-number: QTBUG-121529 Change-Id: Ib35f400e39e76819cd1c3299691f7b040ea37178 Reviewed-by: Edward Welbourne <edward.welbourne@qt.io> Reviewed-by: Mårten Nordheim <marten.nordheim@qt.io>
* unicode: Import version 15.1 (UCD version 32)Ievgenii Meshcheriakov2024-02-081-4384/+4458
| | | | | | | | | | | | | | | | | | | | | | Add enumerator for the new Unicode version to QChar::UnicodeVersion. Remap new line breaking classes to their Unicode 15.0 values: * AK, AP and AS to AL, * VI and VF to CM. These are classes for new line breaking support for Indic scripts that require more work. Blacklist failing tests for now: * tst_QUrlUts46::idnaTestV2 * tst_QTextBoundaryFinder::lineBoundariesDefault * tst_QTextBoundaryFinder::graphemeBoundariesDefault Regenerate the source files. Task-number: QTBUG-121529 Change-Id: I869cc9fbaa53765d8ae6265c22cdbef9f19d05bf Reviewed-by: Mårten Nordheim <marten.nordheim@qt.io> Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
* Update Unicode data version stringIevgenii Meshcheriakov2024-01-251-1/+1
| | | | | | | | | | This amends c4e550703c2bdc1ee710507b8df9c0c9a118402e. The data version update was just forgotten when updating to Unicode 15.0. Pick-to: 6.5 6.6 6.7 Change-Id: Ibb3e9cb81e9bbcb5d4aaf4e4df6231485531c128 Reviewed-by: Mårten Nordheim <marten.nordheim@qt.io> Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
* Update UCD to Revision 30Ievgenii Meshcheriakov2022-10-111-5285/+5587
| | | | | | | | | | | | | | | | | | This corresponds to Unicode version 15.0.0. Added the following scripts: * Kawi * Nag Mundari Full support of these scripts requires harfbuzz version 5.2.0, this version adds support for Unicode 15.0: https://github.com/harfbuzz/harfbuzz/releases/tag/5.2.0 Fixes: QTBUG-106810 Change-Id: Ib06c526e49b0f01ef9f21123bcf875c6b19f2601 Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
* Core: make Unicode Database constexprYuhang Zhao2022-05-261-11/+11
| | | | | | | | Task-number: QTBUG-100485 Pick-to: 6.3 6.2 Change-Id: I41480a34b14fd86a68a5c10b7e0f3d250e785d0f Reviewed-by: Marc Mutz <marc.mutz@qt.io> Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
* Unicode: Extract EastAsianWidth propertyIevgenii Meshcheriakov2022-05-241-9758/+9896
| | | | | | | | | | This property is needed to properly implement the line breaking algorithm from UAX #14. Task-number: QTBUG-97537 Pick-to: 6.3 Change-Id: Ia83cc553c9ef19fae33560721630849d2a95af84 Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
* Unicode: Remove obsolete word break classesIevgenii Meshcheriakov2022-05-241-5/+5
| | | | | | | | | | Remove E_Base, Glue_After_Zwj, E_Base_GAZ, and E_Modifier obsoleted by UTS #29, version 33 (Unicode 11.0.0). Task-number: QTBUG-97537 Pick-to: 6.2 6.3 Change-Id: If5dc36ae17cd8746bbe81b73bbcc0863181e4a7a Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
* Use SPDX license identifiersLucie Gérard2022-05-161-38/+2
| | | | | | | | | | | | | Replace the current license disclaimer in files by a SPDX-License-Identifier. Files that have to be modified by hand are modified. License files are organized under LICENSES directory. Task-number: QTBUG-67283 Change-Id: Id880c92784c40f3bbde861c0d93f58151c18b9f1 Reviewed-by: Qt CI Bot <qt_ci_bot@qt-project.org> Reviewed-by: Lars Knoll <lars.knoll@qt.io> Reviewed-by: Jörg Bornemann <joerg.bornemann@qt.io>
* Update UCD to Revision 28Ievgenii Meshcheriakov2021-10-181-6588/+7011
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | This corresponds to Unicode version 14.0.0. Added the following scripts: * CyproMinoan * OldUyghur * Tangsa * Toto * Vithkuqi Full support of these scripts requires harfbuzz version 3.0.0, this version adds support for Unicode 14.0: https://github.com/harfbuzz/harfbuzz/releases/tag/3.0.0 With this release 10 test cases in tst_qurluts46 were fixed, one additional test case is failing in tst_qtextboundaryfinder and is commented out. In total 62 line break test cases and 44 word break test cases are failing. A comment in src/corelib/text/qt_attribution.json was updated to include the URL of the page containing UCD version number. Fixes: QTBUG-94359 Change-Id: Iefc9ff13f3df279f91cbdb1246d56f75b20ecb35 Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
* unicode: Regenerate qunicodetables{.cpp,_p.h}Ievgenii Meshcheriakov2021-09-031-5762/+5845
| | | | | | | | | | Run unicode utility to regenerate the Unicode tables. This reduces size of the IDNA mapping tables. Adjust the QUrl client code to use the new API. Task-number: QTBUG-85323 Change-Id: Iaa8d6932e611f7aa4009a3fae2972de87b875cf8 Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
* unicode: Regenerate Unicode tablesIevgenii Meshcheriakov2021-08-261-9404/+15286
| | | | | | | | | Re-run unicode utility to update the Unicode tables. This adds properties and mappings needed to implement UTS #46 (IDNA). Task-number: QTBUG-85323 Change-Id: Id1de91caddd82095f8f8f2301bfd7bb2ee3fcafd Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
* Unicode: fix the extended grapheme cluster algorithmGiuseppe D'Angelo2021-04-161-6230/+6340
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | UAX #29 in Unicode 11 changed the EGC algorithm to its current form. Although Qt has upgraded the Unicode tables all the way up to Unicode 13, the algorithm has never been adapted; in other words, it has been working by chance for years. Luckily, MOST of the cases were dealt with correctly, but emoji handling actually manages to break it. This commit: * Adds parsing of emoji-data.txt into the unicode table generator. That is necessary to extract the Extended_Pictographic property, which is used by the EGC algorithm. * Regenerates the tables. * Removes some obsoleted grapheme cluster break properties, and adds the ones added in the meanwhile. * Rewrites the EGC algorithm according to Unicode 13. This is done by simplifying a lot the lookup table. Some rules (GB11, GB12, GB13) can't be done by the table alone so some hand-rolled code is necessary in that case. * Thanks to these fixes, the complete upstream GraphemeBreakTest now passes. Remove the "edited" version that ignored some rows (because they were failing). Change-Id: Iaa07cb2e6d0ab9deac28397f46d9af189d2edf8b Pick-to: 6.1 6.0 5.15 Fixes: QTBUG-92822 Reviewed-by: Thiago Macieira <thiago.macieira@intel.com> Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
* Use checked string iteration in case conversionsEdward Welbourne2020-08-291-0/+1
| | | | | | | | | | | | | | | | | | The Unicode table code can only be safely called on valid code-points. So code that calls it must only pass it valid Unicode data. The string iterator's Unchecked Unchecked methods only provide this guarantee when the string being iterated is guaranteed to be valid UTF-16; while client code should only use QString, QStringView and friends on valid UTF-16 data, we have no way to be sure they have respected that. So take the few extra cycles to actually check validity in the course of iterating strings, when the resulting code-points are to be passed to the Unicode table look-ups. Add tests that case mapping doesn't access Unicode tables out of range (it'll trigger the new assertion). Added some comments to qchar.h that helped me understand surrogates. Change-Id: Iec2c3106bf1a875bdaa1d622f6cf94d7007e281e Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
* Inline two macros in the unicode tablesEdward Welbourne2020-08-121-10/+6
| | | | | | | | They were only used by one function each, in unicodetables.cpp, so don't need to be macros. Change-Id: I3e7f9f661568862d0a0d265bb8f657a8e0782b13 Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
* Tidy up unicode table generationEdward Welbourne2020-08-051-11/+8
| | | | | | | | | | Eliminate some needless parentheses, tidy up some spacing and indentation and split some long lines. Change first += after declaration to initializer. Change-Id: I05ff2a6337b7ed14e0a2dc9c03fc784c92b63515 Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com> Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
* QChar/QString: centralize case folding in qchar.cppMarc Mutz2020-05-091-0/+2
| | | | | | | | | | | | | | | | | | | | | There are (at least) two implementations of the low-level case-folding algorithm, one of which (for QChar::toLower()) seems to be wrong (it doesn't deal with special cases which expand to more than one code point). The algoithm hidden in QString and entangled with the QString detaching code makes reusing the code much harder. At the same time, the dependency of the algorithm on the unicode tables makes exposing a non-allocating result type in the public API hard. std::u16string would be an alternative if we can assure that all implementations use SSO with at least four characters. So, for the time being, leave this as internal API for use in an upcoming QStringView::toLower() as well as case-insensitive hashing. Change-Id: Iabb2611846f6176776aa20e634f44d8464f3305c Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
* QUnicodeTables: port to charNN_tMarc Mutz2020-04-271-18/+8
| | | | | | | | | | | This makes existing calls passing uint or ushort ambiguous, so fix all the callers. There do not appear to be callers outside QtBase. In fact, the ...BreakClass() functions appear to be utterly unused. Change-Id: I1c2251920beba48d4909650bc1d501375c6a3ecf Reviewed-by: Lars Knoll <lars.knoll@qt.io> Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
* QChar: port low-level functions from uint/ushort to char32/16_tMarc Mutz2020-04-241-0/+10
| | | | | | | | | | | | | | | | | | Now that the standard gives us proper types for UTF-16 and UTF-32 characters, use them. Will eventually make the code much easier to read than today, where uint could be an index as well as a char32_t. It also ensures that the result of e.g. QChar::highSurrogate() can still be implicitly converted to a QChar now that the QChar(non-characater-integral-types) ctors are being made explicit. [ChangeLog][QtCore][QChar] All low-level functions (e.g. highSurrogate()) now take and return char16_t instead of ushort and char32_t instead of uint. Change-Id: I9cd8ebf6fb998fe1075dae96c7c4484a057f0b91 Reviewed-by: Lars Knoll <lars.knoll@qt.io> Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
* Update UCD to Revision 26Edward Welbourne2020-03-141-6336/+6731
| | | | | | | | | | | | | | Include WordBreakTest.html, since a test uses sample strings from it, albeit without actually reading the file. Had to comment out more of the new tests, as at Revision 24, pending an update to harfbuzz and the text boundary detection code. Task-number: QTBUG-79631 Task-number: QTBUG-79418 Task-number: QTBUG-82747 Change-Id: I0082294b09d67ffdc6a9b5c15acf77ad3b86f65f Reviewed-by: Lars Knoll <lars.knoll@qt.io>
* Update UCD data to Unicode 12.1.0's Revision 24Edward Welbourne2019-10-301-7341/+7849
| | | | | | | | | | | | | | | | Had to teach the update program to accept category Lm as for Joining_Transparent, for the sake of a new ArabicShaping.txt entry. Added three new Unicode versions, several new scripts and a new word-break class. Updated UCD's test data for tst_QTextBoundaryFinder. This left 57 tests failing; I have commented out the data rows for those tests, pending someone with more knowledge addressing this. Task-number: QTBUG-79631 Task-number: QTBUG-79418 Change-Id: Ic33d3b3551195d47a84d98e84020f57a68f0b201 Reviewed-by: Eskil Abrahamsen Blomfeldt <eskil.abrahamsen-blomfeldt@qt.io>
* QUnicodeTables: use array for case folding tablesMarc Mutz2019-09-041-2652/+2652
| | | | | | | | | | | | | | | | | Instead of four pairs of :1 :15 bit fields, use an array of four :1, :15 structs. This allows to replace the case folding traits classes with a simple enum that indexes into said array. I don't know what the WASM #ifdef'ed code is supposed to effect (a :0 bit-field is only useful to separate adjacent bit-field into separate memory locations for multi-threading), but I thought it safer to leave it in, and that means the array must be a 64-bit block of its own, so I had to move two fields around. Saves ~4.5KiB in text size on optimized GCC 10 LTO Linux AMD64 builds. Change-Id: Ib52cd7706342d5227b50b57545d073829c45da9a Reviewed-by: Lars Knoll <lars.knoll@qt.io>
* QUnicodeTables: pack Properties structMarc Mutz2019-09-041-2638/+2638
| | | | | | | | | | | | | | | | | | | | | | GCC doesn't like the sequence : 5 : 5 : 8 : 6 : 8 and inserts a :6 padding between the :5 and the :8 and a :2 padding between the :6 and the :8, growing the bitfield by 8 bits of embedded padding and another byte to bring the struct back to sizeof % 2 == 0. Fix by reshuffling the elements and adding a static_assert for the next round. Saves ~5KiB in QtCore executable size. Change-Id: I4758a6f48ba389abc2aee92f60997d42ebb0e5b8 Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
* Move text-related code out of corelib/tools/ to corelib/text/Edward Welbourne2019-07-101-0/+13446
This includes byte array, string, char, unicode, locale, collation and regular expressions. Change-Id: I8b125fa52c8c513eb57a0f1298b91910e5a0d786 Reviewed-by: Volker Hilsheimer <volker.hilsheimer@qt.io>