Collect chars from strings and print their unicode

Question

Context (skip, if you don't care; read, if you suspect I'm totally on the wrong track)

For an embedded system with small memory, I want to generate fonts which contain only those glyphs actually needed. so at build time, I need to scan the language files, extract the chars from the strings and use their codes as argument to the font generation tool.

Translation file with relevant strings (Just an example, of course, but at least it covers some unicode stuff)

TEXT_1=Foo
TEXT_2=Bar
TEXT_3=Baz
TEXT_4=Ünicødé
TEXT_5=ελληνικά

Expected output

0x42,0x61,0x72,0x42,0x61,0x7A,0x46,0x6F,0x6F,0xDC,0x6E,0x69,0x63,0xF8,0x64,0xE9,0x3B5,0x3BB,0x3BB,0x3B7,0x3BD,0x3B9,0x3BA,0x3AC

My approach so far

This script simply does what I described: sed reads the file, extracts the strings, and prepares them to be formatted by printf, sort -u removes duplicates:

for char in $(sed "s/[^=]*=//;s/./'& /g" myLang.translation|sort -u); do
  printf "0x%02X\n" $char
done

This works for the example, but it feels ugly, unreliable, wrong, probably slow for real files, so can you name a better tool, better approach, better whatever?

What are your normalization requirements? Typically you won't normalize things like file names. Do you need to recover the original bytes at some point? — jubilatious1
– jubilatious1, Commented Apr 24, 2023 at 21:00
@jubilatious1 Actually, I don't understand what you mean by »normalization«. I don't need to normalize anything and I don't need to recover the bytes. The firmware will have access to the original translation table. — Philippos
– Philippos, Commented Apr 25, 2023 at 5:17
Here's the reference page on Unicode Normalization Forms. The different normalization forms include NFC and NFG, and different languages do it differently, (e.g. Raku). — jubilatious1
– jubilatious1, Commented Apr 25, 2023 at 10:09
@jubilatious1 Thank you for helping me understand. The input file is supposed to be normalized by the translation database tool and no further changes of the code is allowed. The codes are needed as they appear in the input file. — Philippos
– Philippos, Commented Apr 25, 2023 at 11:21

Stéphane Chazelas · Accepted Answer · 2023-04-26 16:15:42Z

With perl:

perl -C -lne '
  if (/=(.*)/) {$c{$_}++ for split //, $1}
  END{print join ",", map {sprintf "0x%X", ord$_} sort keys %c}
  ' your-file

Gives:

0x42,0x46,0x61,0x63,0x64,0x69,0x6E,0x6F,0x72,0x7A,0xDC,0xE9,0xF8,0x3AC,0x3B5,0x3B7,0x3B9,0x3BA,0x3BB,0x3BD

-C does UTF-8 I/O if the locale uses UTF-8 as its charmap
-ln sed -n mode, where the code is run on each line of the input. -l removes the line delimiter from the input, and adds it back on output (does a $\ = $/)
-e 'code' to specify the code to run on the command line instead of from a script.
/=(.*)/ to match on lines contains at least one = capturing what's after the first occurrence in $1 (the first capture group).
split //, $1 splits it with an empty separator, so into the individual characters
$c{$_}++ for that-above loops over that list of characters and increments a corresponding associate array element. %c maps characters to their count of occurrence. We don't use that count here.
END{code}, code only run in the end.
sort keys %c sorts the keys of that associative array lexically
map { code } @list to transform a list by applying the code on each element.
ord$_ gets the numeric value of the character.
sprintf "0x%X" formats it as hex (with capital ABCDEF, but 0x in lower case).
join ",", @list joins the list with ,
print prints it followed by $\ (newline).

In zsh (likely a lot less efficient):

$ set -o cbases -o extendedglob
$ LC_COLLATE=C
$ echo ${(j[,])${(ous[])"$(<your-file cut -sd= -f2- | tr -d '\n')"}/(#m)?/$(([#16]#MATCH))}
0x42,0x46,0x61,0x63,0x64,0x69,0x6E,0x6F,0x72,0x7A,0xDC,0xE9,0xF8,0x3AC,0x3B5,0x3B7,0x3B9,0x3BA,0x3BB,0x3BD

Or without using external utilities:

$ set -o cbases -o extendedglob
$ LC_COLLATE=C
$ echo ${(j[,])${(@ous[])${(f)"$(<your-file)"}#*=}/(#m)?/$(([#16]#MATCH))}
0x42,0x46,0x61,0x63,0x64,0x69,0x6E,0x6F,0x72,0x7A,0xDC,0xE9,0xF8,0x3AC,0x3B5,0x3B7,0x3B9,0x3BA,0x3BB,0x3BD

"$(<you-file)" contents of the file, with trailing newline characters removed, quoted so it's not IFS-split
${(f)param} splits on line-feed to get the lines as a list
${array#*=} removes the shortest leading part matching *= from array elements.
@ flag to ensure list processing
o order lexically (based on code point in the C locale)
unique removes duplicates
s[] splits into individual characters.
${array/(#m)?/$(([#16]#MATCH))} substitutes the character (?) captured in $MATCH thanks to (#m) with it's value (#MATCH in arithmetic expression) formatted in base 16 [#16]. With the cbases option, that as 0xBEEF instead of 16#BEEF
j[,] joins with ,.

Breaking it down into the individual steps would make it more legible:

set -o cbases -o extendedglob
LC_COLLATE=C
contents=$(<your-file)
lines=( ${(f)contents} )
values=( ${lines#*=} )
chars=( ${(@ous[])values} )
codepoints=( ${chars/(#m)?/$(( [#16] #MATCH ))} )
echo ${(j[,])codepoints}

LL3 · Accepted Answer · 2023-04-20 22:46:50Z

Should be doable with an iconv | hexdump.

Terse proof-of-concept for your sample input:

cut -d= -f2- | iconv -t UTF-32LE | hexdump -ve '"0x%02X,"'

NOTE: the above command works as expected if run on a little-endian CPU architecture such as the x86 family. More on this caveat below.

To uniquify the codepoints, as well as strip the spurious comma and 0x0A:

cut -d= -f2- | sed 's/./&\
/g' | sort -u | tr -d '\n' | iconv -t UTF-32LE | hexdump -ve '",0x%02X"' | cut -d, -f2-

NOTE: In this latter example the sed command has a newline character embedded in the replacement part of its s/// command, because I wanted to provide a more portable example. If you're using a shell that supports the $'...' syntax for literals, you may replace that entire sed command with sed $'s/./&\\\n/g' in order to handily embed that newline character on the same line. The $'...' syntax for literals should normally be available to a version of bash that supports the '<char> argument to the %X conversion of the printf builtin you're using in your example.

A few notes about this solution:

CPU endiannes: because we're feeding hexdump, which can only work with fixed size integers of the CPU's own endiannes, we need iconv to convert into a characters stream compatible to such requirements. UTF-32LE is a fixed 4-byte-per-character encoding of the entire Unicode space, expressed in the little-endian flavor. Should you rather need to run that command on a big-endian CPU, you'd want iconv -t UTF-32BE instead.
The examples above assume that the input is encoded with the same charset encoding used by that execution current locale, just like your own example does. Should you want to mismatch the encodings involved, you might be tempted to use the -f option of iconv to explicitly specify the encoding for the input data, but it is safest to switch the entire pipeline to a locale carrying the same encoding as the input data, because the cut command above needs to detect the = character while the sed needs to correctly detect each character entity
conversion availability: whether your host's iconv is actually able to convert from your input data's encoding to UTF32 is system-dependent. At least GNU's (glibc) iconv is very capable when all its "gconv modules" are installed on the host system. For the commands above you need at least your input data's encoding and the UTF32 encoding. Note that UTF8 is typically builtin glibc's own main libc file, while the UTF32 flavors are in the specific gconv module's .so file. A typical full blown operating system based on GNU's glibc normally carries the entire set of loadable gconv modules, which comprise pretty much every encoding in the world.
conversion speed: GNU's iconv most typically "triangulates", i.e. converts from the input encoding to iconv's own internal representation and then from this latter to the wanted output encoding. AFAIK very very few iconv's gconv modules provide direct conversion from an encoding to another skipping the intermediate step. Whether this "triangulating" behavior, plus the auxiliary transformations (cut | sed | sort etc.) performed by the pipeline as whole, is quicker or slower than non-GNU iconv (or non-iconv at all) conversions, I don't know.

Note finally that none of what is said above pertains to your target embedded system, of course. This solution should be run on a normally powerful and fully featured system.

jubilatious1 · Accepted Answer · 2023-04-27 23:11:19Z

Using Raku (formerly known as Perl_6)

From here: https://docs.raku.org/language/faq.html#String:_How_can_I_get_the_hexadecimal_representation_of_a_string%3F

~$ raku -ne 'BEGIN my %chars; 
             %chars{$_.encode.gist}++ for .split("=", limit => 2, :skip-empty)[1..*].comb; 
             END put .keys for %chars.sort;'  file

#OR

~$ raku -ne 'BEGIN my %chars; 
             %chars{$_.encode.gist}++ for .comb(/<?after TEXT_ \d+ \= > .+ $/).comb; 
             END put .keys for %chars.sort;'  file

Raku (a.k.a. Perl_6) has functions like ord, ords, unique, printf, sprintf, but the above code is adapted directly from the Docs--and therefore is presumably recommended.

Sample Input (extra blank line at bottom):

TEXT_1=Foo
TEXT_2=Bar
TEXT_3=Baz
TEXT_4=Ünicødé
TEXT_5=ελληνικά

Sample Output:

utf8:0x<42>
utf8:0x<46>
utf8:0x<61>
utf8:0x<63>
utf8:0x<64>
utf8:0x<69>
utf8:0x<6E>
utf8:0x<6F>
utf8:0x<72>
utf8:0x<7A>
utf8:0x<C3 9C>
utf8:0x<C3 A9>
utf8:0x<C3 B8>
utf8:0x<CE AC>
utf8:0x<CE B5>
utf8:0x<CE B7>
utf8:0x<CE B9>
utf8:0x<CE BA>
utf8:0x<CE BB>
utf8:0x<CE BD>

Raku uses utf8 by default, and that's what you see above. But you can .encode("utf-16") and return those .values to get results identical to the Perl(5) answer from @StephaneChazelas:

~$ raku -e 'my @a = lines>>.subst(:global, / ^^ <(TEXT_ \d+ \= )> /).join.comb.unique;  \
            print join ",", map {sprintf("0x%X", .encode("utf-16").values) }, @a[].sort;'
0x42,0x46,0x61,0x63,0x64,0x69,0x6E,0x6F,0x72,0x7A,0xDC,0xE9,0xF8,0x3AC,0x3B5,0x3B7,0x3B9,0x3BA,0x3BB,0x3BD

Should it be required, there's a fairly extensive discussion of Unicode normalization (in Raku) at the second link below. From that page, if you need to recover the original bytes--you use an encoding of utf8-c8 (i.e. "utf8-clean8").

https://docs.raku.org/language/faq.html#String:_How_can_I_get_the_hexadecimal_representation_of_a_string%3F
https://docs.raku.org/language/unicode
https://raku.org

Philippos · Accepted Answer · 2023-05-05 13:48:06Z

Answers to my original problem were given (and I'll accept one of them), but for completeness' sake I add that I finally ended with a python script for several reasons after some changes (format definition changed, adjectant glyphs shall be grouped like 0x30-0x39

#!/bin/python3
# glycol is the GLYph COLlector
# it collects all used glyphs from the translation json files given as command line arguments
# and prints them as a string formatted to be used as -r argument for lv_font_conv
# Usage: lv_font_conv -r $(glycol de.json en.json fr.json) ...

import sys
Glyphs=[]
# Loop over all files
sys.argv.pop(0)
for file in sys.argv:
    # Sorry, low level tool without error handling
    with open(file, 'r', encoding="utf-8") as f:
        for line in f:
            parts = line.split('"')
            if len(parts) == 5:
                # expect format _"key":"string" -- No json parsing
                Glyphs.extend(ord(c) for c in parts[3])
Glyphs.sort()
# Now loop over the sorted glyph list, skip duplicates, join regions
last=0
region=0
for glyph in Glyphs:
    if (last == 0):
        print(hex(glyph), end='')
    elif (glyph == last + 1):
        region = glyph
    elif (glyph > last):
        if (region == last):
            print('-'+ hex(region), end='')
        print(','+ hex(glyph), end='')
    last = glyph
if (region == last):
    print('-'+ hex(region), end='')
print()

Stack Exchange Network

Collect chars from strings and print their unicode

4 Answers 4

You must log in to answer this question.

Hot Network Questions

Collect chars from strings and print their unicode

4 Answers 4

You must log in to answer this question.

Related

Hot Network Questions