| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
These entries are quite repetitive, esp. the all-zero entry for
uncased characters (but not only: there are also 137 non-zero
duplicates), and each one takes 8 bytes of the total 20 bytes of
sizeof(Properties).
Make a new array with these entries and only store an index into it in
Properties. The new array happens to have a size of 448 entries (down
from 3372 unique Properties), so 9 bits would suffice for the index,
but a sizeof(Properties) == 14 is probably rather pointless, so add a
reserved field to prop the struct up to 16. That sounds like the ideal
size for rapid indexing and probably improves qGetProp() performance,
esp. if case information is not needed.
Theoretically, this should save 3372 * 4 - 448 * 8 = 9904 bytes. The
TEXT size of libQtCore, however shrinks by a bit more, 10596 bytes, on
optimized Linux AMD64 Clang 19 builds.
Picking to all active branches, because the Unicode tables are
still maintained in all of them.
Fixes: QTBUG-139427
Pick-to: 6.10 6.9 6.8 6.5
Change-Id: If4dc47ef06c674ad0263f0623ec408a25b977b3a
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
Reviewed-by: Ahmad Samir <a.samirh78@gmail.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This component does not process external data, so it is not
security-critical.
Yes, the characters presented to its functions may come from external
sources, but that's not different from, say, a QRect being parsed from
-geometry. The fact that there is code that parses a -geometry into a
QRect doesn't make QRect a data-parser, or security-critical. It's
just container for the result, and so is QUnicodeTables: a container
for char32_t-indexed properties.
This accompanying qt_attribution.json confirms that this component is
not security-critical.
Task-number: QTBUG-135195
Pick-to: 6.10 6.8
Change-Id: I565bd885220c0282ce7fb801411f12a80052465f
Reviewed-by: Ivan Solovev <ivan.solovev@qt.io>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is in preparation for storing this information in a separate
array to save space by removing the many duplicates in there.
Pick to all active branches to have the same internal API going
forward, even if we don't pick the storage optimization back as far.
Qt 6.5 doesn't have QSpan, yet (not even as private API), but
returning a reference to const std::array<.,4> will be an adequate
replacement. To enable that without casting, convert `cases` from a C
array to a std::array. For some reason, this requires extra
parentheses, so add them.
Task-number: QTBUG-139427
Pick-to: 6.10 6.9 6.8 6.5
Change-Id: I5331fd6d71a6a447b0445d8235b5eb8e38178e2e
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Clang was already doing it, but GCC (at least in LTO mode) wasn't and
was repeatedly calling qGetProp(). This has the benefit that, in most
cases, the input character whose property we seek is UTF-16, so dead
code-elimination removes the extra branch - this can happen when QString
functions go through the QChar front-end, like QChar::isSpace() or
isSymbol(), which route through the char32_t overload.
This forced inling allows us to remove the UCS2 overloads of qGetProp()
and properties(), because the same const-propagation will apply to all
but one of the places where UTF-16 code units were being compared. The
16-bit qGetProp() was only used in qstring.cpp's convertCase_helper(),
whose 16-bit overload was only used in foldCase(). The one exception to
this is qtextengine.cpp's QBidiAlgorithm::resolveN0():
const QUnicodeTables::Properties *p = QUnicodeTables::properties(char16_t{text[pos].unicode()});
This will now call the full UTF-32 overload.
Pick-to: 6.10
Change-Id: Ifa4f2d77475877f26be2fffd9a987ff994dc8ef1
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
The Q_DECL_CONST_FUNCTION needs to be on the declaration to have any
effect on callers, but it was only on the (out-of-line) definition.
Amends 2fe90a61bdf16bb1a08817ba544e2309b524a052.
As a drive-by, also remove the export macros from the definitions;
they, too, are only needed on the declaration.
Pick-to: 6.10 6.9 6.8 6.5
Change-Id: Id69b58c50440b8b835f7be7ba873927d07b11219
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Of all the Category categories, separators are the only to currently
have assigned codepoints exclusively in the BMP. This allows us to lower
the maximum check from the LastValidCodepoint to category-specific
one. This will also cause the compiler to dead-code eliminate the check
inside of qGetProperty and emit only the BMP check of the property
tables:
if (ucs4 < 0x11000)
return uc_properties + uc_property_trie[uc_property_trie[ucs4 >> 5] + (ucs4 & 0x1f)];
Pick-to: 6.10
Change-Id: I31eda5d79cc2c3560d90fffd74a546d1e7cda7bb
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
They added some new scripts.
There were a few changes to the line break algorithm,
most notably there is more rules that require more context than before.
While not major, there was some shuffling and additions to our
implementation to match the new rules.
IDNA test data now disallows the trailing dot/empty root label,
technically to be toggled off by an option that controls a few things,
but we don't have options. For test-data they changed the format a
little - "" is used to mean empty string, while a blank segment is
null/no string, update the parser to read this.
[ChangeLog][Third-Party Code] Updated the Unicode Character Database to
UCD revision 34/Unicode 16.
Fixes: QTBUG-132902
Task-number: QTBUG-132851
Pick-to: 6.9 6.8 6.5
Change-Id: I4569703659f6fd0f20943110a03301c1cf8cc1ed
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
Expand unicode data to include information needed to
parse emoji sequences. This is a pre-requisite for
automatically preferring color fonts for emojis.
As a drive-by, this also fixes a double space in the
output of the uc_properties array.
Task-number: QTBUG-111801
Change-Id: Icd993803c87c69ed278c7724377028f3706d0272
Reviewed-by: Eirik Aavitsland <eirik.aavitsland@qt.io>
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The existing data comes under Unicode-DFS-2016 but future updates
shall come under Unicode-3.0, so update the existing headers with the
former and the generator script with the latter. Leave a note in the
attribution file about this transitional state and how to resolve it.
Replaced UNICODE_LICENSE.txt from src/corelib/text/ with
LICENSES/Unicode-DFS-2016.txt, as fetched using reuse download.
This doesn't look like a rename but only actually adds some irrelevant
lines about where on the Unicode website the upstream files (to which
we do not apply this license) come from and changes some spacing.
Pick-to: 6.7 6.5
Fixes: QTBUG-121653
Change-Id: I50c9f4badc77a9aa402af946561aff58ae9e3e7a
Reviewed-by: Volker Hilsheimer <volker.hilsheimer@qt.io>
Reviewed-by: Kai Köhne <kai.koehne@qt.io>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The new rules were added in Unicode 15.1 (TR #14, revision 51).
The rules read:
LB15a: (sot | BK | CR | LF | NL | OP | QU | GL | SP | ZW)
[\p{Pi}&QU] SP* ×
LB15b: × [\p{Pf}&QU] (SP | GL | WJ | CL | QU | CP | EX
| IS | SY | BK | CR | LF | NL | ZW | eot)
Add two new line breaking classes LineBreak_QU_Pi and _QU_Pf to
represent quotation characters with context that matches left
side of LB15a and right side of LB15b respectively. This way
it is still possible to use the line breaking classes table.
Also add a coment about the original source of the line
break table.
Task-number: QTBUG-121529
Change-Id: Ib35f400e39e76819cd1c3299691f7b040ea37178
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
Reviewed-by: Mårten Nordheim <marten.nordheim@qt.io>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add enumerator for the new Unicode version to QChar::UnicodeVersion.
Remap new line breaking classes to their Unicode 15.0 values:
* AK, AP and AS to AL,
* VI and VF to CM.
These are classes for new line breaking support for Indic scripts
that require more work.
Blacklist failing tests for now:
* tst_QUrlUts46::idnaTestV2
* tst_QTextBoundaryFinder::lineBoundariesDefault
* tst_QTextBoundaryFinder::graphemeBoundariesDefault
Regenerate the source files.
Task-number: QTBUG-121529
Change-Id: I869cc9fbaa53765d8ae6265c22cdbef9f19d05bf
Reviewed-by: Mårten Nordheim <marten.nordheim@qt.io>
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
|
| |
|
|
|
|
|
|
|
|
| |
This amends c4e550703c2bdc1ee710507b8df9c0c9a118402e. The data version
update was just forgotten when updating to Unicode 15.0.
Pick-to: 6.5 6.6 6.7
Change-Id: Ibb3e9cb81e9bbcb5d4aaf4e4df6231485531c128
Reviewed-by: Mårten Nordheim <marten.nordheim@qt.io>
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This corresponds to Unicode version 15.0.0.
Added the following scripts:
* Kawi
* Nag Mundari
Full support of these scripts requires harfbuzz version 5.2.0,
this version adds support for Unicode 15.0:
https://github.com/harfbuzz/harfbuzz/releases/tag/5.2.0
Fixes: QTBUG-106810
Change-Id: Ib06c526e49b0f01ef9f21123bcf875c6b19f2601
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
|
| |
|
|
|
|
|
|
| |
Task-number: QTBUG-100485
Pick-to: 6.3 6.2
Change-Id: I41480a34b14fd86a68a5c10b7e0f3d250e785d0f
Reviewed-by: Marc Mutz <marc.mutz@qt.io>
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
|
| |
|
|
|
|
|
|
|
|
| |
This property is needed to properly implement the line breaking
algorithm from UAX #14.
Task-number: QTBUG-97537
Pick-to: 6.3
Change-Id: Ia83cc553c9ef19fae33560721630849d2a95af84
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
|
| |
|
|
|
|
|
|
|
|
| |
Remove E_Base, Glue_After_Zwj, E_Base_GAZ, and E_Modifier obsoleted by
UTS #29, version 33 (Unicode 11.0.0).
Task-number: QTBUG-97537
Pick-to: 6.2 6.3
Change-Id: If5dc36ae17cd8746bbe81b73bbcc0863181e4a7a
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
Replace the current license disclaimer in files by
a SPDX-License-Identifier.
Files that have to be modified by hand are modified.
License files are organized under LICENSES directory.
Task-number: QTBUG-67283
Change-Id: Id880c92784c40f3bbde861c0d93f58151c18b9f1
Reviewed-by: Qt CI Bot <qt_ci_bot@qt-project.org>
Reviewed-by: Lars Knoll <lars.knoll@qt.io>
Reviewed-by: Jörg Bornemann <joerg.bornemann@qt.io>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This corresponds to Unicode version 14.0.0.
Added the following scripts:
* CyproMinoan
* OldUyghur
* Tangsa
* Toto
* Vithkuqi
Full support of these scripts requires harfbuzz version 3.0.0,
this version adds support for Unicode 14.0:
https://github.com/harfbuzz/harfbuzz/releases/tag/3.0.0
With this release 10 test cases in tst_qurluts46 were fixed, one
additional test case is failing in tst_qtextboundaryfinder and
is commented out. In total 62 line break test cases and 44 word
break test cases are failing.
A comment in src/corelib/text/qt_attribution.json was updated to
include the URL of the page containing UCD version number.
Fixes: QTBUG-94359
Change-Id: Iefc9ff13f3df279f91cbdb1246d56f75b20ecb35
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
|
| |
|
|
|
|
|
|
|
|
| |
Run unicode utility to regenerate the Unicode tables. This reduces
size of the IDNA mapping tables. Adjust the QUrl client code to use
the new API.
Task-number: QTBUG-85323
Change-Id: Iaa8d6932e611f7aa4009a3fae2972de87b875cf8
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
|
| |
|
|
|
|
|
|
|
| |
Re-run unicode utility to update the Unicode tables. This adds
properties and mappings needed to implement UTS #46 (IDNA).
Task-number: QTBUG-85323
Change-Id: Id1de91caddd82095f8f8f2301bfd7bb2ee3fcafd
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
UAX #29 in Unicode 11 changed the EGC algorithm to its current form.
Although Qt has upgraded the Unicode tables all the way up to
Unicode 13, the algorithm has never been adapted; in other words,
it has been working by chance for years. Luckily, MOST
of the cases were dealt with correctly, but emoji handling
actually manages to break it.
This commit:
* Adds parsing of emoji-data.txt into the unicode table generator.
That is necessary to extract the Extended_Pictographic property,
which is used by the EGC algorithm.
* Regenerates the tables.
* Removes some obsoleted grapheme cluster break properties, and
adds the ones added in the meanwhile.
* Rewrites the EGC algorithm according to Unicode 13. This is
done by simplifying a lot the lookup table. Some rules (GB11,
GB12, GB13) can't be done by the table alone so some hand-rolled
code is necessary in that case.
* Thanks to these fixes, the complete upstream GraphemeBreakTest
now passes. Remove the "edited" version that ignored some rows
(because they were failing).
Change-Id: Iaa07cb2e6d0ab9deac28397f46d9af189d2edf8b
Pick-to: 6.1 6.0 5.15
Fixes: QTBUG-92822
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The Unicode table code can only be safely called on valid code-points.
So code that calls it must only pass it valid Unicode data. The string
iterator's Unchecked Unchecked methods only provide this guarantee
when the string being iterated is guaranteed to be valid UTF-16; while
client code should only use QString, QStringView and friends on valid
UTF-16 data, we have no way to be sure they have respected that.
So take the few extra cycles to actually check validity in the course
of iterating strings, when the resulting code-points are to be passed
to the Unicode table look-ups. Add tests that case mapping doesn't
access Unicode tables out of range (it'll trigger the new assertion).
Added some comments to qchar.h that helped me understand surrogates.
Change-Id: Iec2c3106bf1a875bdaa1d622f6cf94d7007e281e
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
|
| |
|
|
|
|
|
|
| |
They were only used by one function each, in unicodetables.cpp, so
don't need to be macros.
Change-Id: I3e7f9f661568862d0a0d265bb8f657a8e0782b13
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
|
| |
|
|
|
|
|
|
|
|
| |
Eliminate some needless parentheses, tidy up some spacing and
indentation and split some long lines. Change first += after
declaration to initializer.
Change-Id: I05ff2a6337b7ed14e0a2dc9c03fc784c92b63515
Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are (at least) two implementations of the low-level case-folding
algorithm, one of which (for QChar::toLower()) seems to be wrong (it
doesn't deal with special cases which expand to more than one code
point).
The algoithm hidden in QString and entangled with the QString
detaching code makes reusing the code much harder.
At the same time, the dependency of the algorithm on the unicode
tables makes exposing a non-allocating result type in the public API
hard. std::u16string would be an alternative if we can assure that all
implementations use SSO with at least four characters.
So, for the time being, leave this as internal API for use in an
upcoming QStringView::toLower() as well as case-insensitive hashing.
Change-Id: Iabb2611846f6176776aa20e634f44d8464f3305c
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
|
| |
|
|
|
|
|
|
|
|
|
| |
This makes existing calls passing uint or ushort ambiguous, so
fix all the callers. There do not appear to be callers outside
QtBase. In fact, the ...BreakClass() functions appear to be
utterly unused.
Change-Id: I1c2251920beba48d4909650bc1d501375c6a3ecf
Reviewed-by: Lars Knoll <lars.knoll@qt.io>
Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Now that the standard gives us proper types for UTF-16 and UTF-32
characters, use them. Will eventually make the code much easier to
read than today, where uint could be an index as well as a char32_t.
It also ensures that the result of e.g. QChar::highSurrogate() can
still be implicitly converted to a QChar now that the
QChar(non-characater-integral-types) ctors are being made explicit.
[ChangeLog][QtCore][QChar] All low-level functions
(e.g. highSurrogate()) now take and return char16_t instead of ushort
and char32_t instead of uint.
Change-Id: I9cd8ebf6fb998fe1075dae96c7c4484a057f0b91
Reviewed-by: Lars Knoll <lars.knoll@qt.io>
Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
Include WordBreakTest.html, since a test uses sample strings from it,
albeit without actually reading the file.
Had to comment out more of the new tests, as at Revision 24, pending
an update to harfbuzz and the text boundary detection code.
Task-number: QTBUG-79631
Task-number: QTBUG-79418
Task-number: QTBUG-82747
Change-Id: I0082294b09d67ffdc6a9b5c15acf77ad3b86f65f
Reviewed-by: Lars Knoll <lars.knoll@qt.io>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Had to teach the update program to accept category Lm as for
Joining_Transparent, for the sake of a new ArabicShaping.txt entry.
Added three new Unicode versions, several new scripts and a new
word-break class.
Updated UCD's test data for tst_QTextBoundaryFinder. This left 57
tests failing; I have commented out the data rows for those tests,
pending someone with more knowledge addressing this.
Task-number: QTBUG-79631
Task-number: QTBUG-79418
Change-Id: Ic33d3b3551195d47a84d98e84020f57a68f0b201
Reviewed-by: Eskil Abrahamsen Blomfeldt <eskil.abrahamsen-blomfeldt@qt.io>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Instead of four pairs of :1 :15 bit fields, use an array of four :1,
:15 structs. This allows to replace the case folding traits classes
with a simple enum that indexes into said array.
I don't know what the WASM #ifdef'ed code is supposed to effect (a :0
bit-field is only useful to separate adjacent bit-field into separate
memory locations for multi-threading), but I thought it safer to leave
it in, and that means the array must be a 64-bit block of its own, so
I had to move two fields around.
Saves ~4.5KiB in text size on optimized GCC 10 LTO Linux AMD64 builds.
Change-Id: Ib52cd7706342d5227b50b57545d073829c45da9a
Reviewed-by: Lars Knoll <lars.knoll@qt.io>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
GCC doesn't like the sequence
: 5
: 5
: 8
: 6
: 8
and inserts a :6 padding between the :5 and the :8 and a :2 padding
between the :6 and the :8, growing the bitfield by 8 bits of embedded
padding and another byte to bring the struct back to sizeof % 2 == 0.
Fix by reshuffling the elements and adding a static_assert for the
next round.
Saves ~5KiB in QtCore executable size.
Change-Id: I4758a6f48ba389abc2aee92f60997d42ebb0e5b8
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
|
|
|
This includes byte array, string, char, unicode, locale, collation and
regular expressions.
Change-Id: I8b125fa52c8c513eb57a0f1298b91910e5a0d786
Reviewed-by: Volker Hilsheimer <volker.hilsheimer@qt.io>
|