4

How can a text string be turned into UTF-8 encoded bytes using Bash and/or common Linux command line utilities? For example, in Python one would do:

"Six of one, ½ dozen of the other".encode('utf-8')
b'Six of one, \xc2\xbd dozen of the other'

Is there a way to do this in pure Bash:

STR="Six of one, ½ dozen of the other"
<utility_or_bash_command_here> --encoding='utf-8' $STR
'Six of one, \xc2\xbd dozen of the other'
2
  • Avoid answering questions in comments. Commented May 16, 2018 at 15:17
  • bash doesn't have a clear "text string" vs. "bytes" distinction. When you use STR="Six of one, ½ dozen of the other", it's already basically a list of bytes (more accurately, a C string), maybe in UTF-8 encoding, maybe in something else. Try echo "$STR" | od -x, and you'll probably see "bdc2" in the results. So I'm not really clear what you're trying to accomplish here. Commented May 16, 2018 at 15:47

3 Answers 3

5

Python to the rescue!

alias encode='python3 -c "from sys import stdin; print(stdin.read().encode(\"utf-8\"))"'
root@kali-linux:~# echo "½ " | encode
b'\xc2\xbd \n'

Also, you can remove b'' with some sed/awk thingy if you want.

Sign up to request clarification or add additional context in comments.

2 Comments

I appreciate the effort, and unvoted for trying, but IIRC I specifically couldn't use Python to solve this problem and had to rely on bash utilities, which I guess I didn't make clear in the question. If Python is available, another option is encode() { python3 -c "print('$1'.encode())" ; } and calling encode "½ "
I too didn't realize you specifically asked for bash utils/pure bash. I saw perl answer so gave python answer. My bad
4

Perl to the rescue!

echo "$STR" | perl -pe 's/([^x\0-\x7f])/"\\x" . sprintf "%x", ord $1/ge'

The /e modifier allows to include code into the replacement part of the s/// substitution, which in this case converts ord to hex via sprintf.

Comments

0

I adapted Machinexa's nice answer a little for my needs

  • encoding="utf-8" is the default so no need to pass
  • more concise to just import sys and use directly
  • Here I'm looking to make a unique set not a list or concatenated bytestring
alias encode='python3 -c "import sys; enc = sys.stdin.read().encode(); print(set(enc))"'

So then I can get a set without repetition:

printf "hell0\x0\nworld\n:-)\x0:-(\n" | \
  grep -a "[[:cntrl:]]" -o | \
  perl -pe 's/([^x\0-\x7f])/"\\x" . sprintf "%x", ord $1/ge' | \
  encode

{b'\x00'}

and then if you wanted to drop the Python byte repr b'' and the backslash:

alias encode='python3 -c "from sys import stdin; encoded = stdin.read().encode(\"utf-8\"); s = set(encoded.splitlines()[:-1]); print({repr(char)[3:-1] for char in s})"'

which for the previous command gives {'x00'} instead

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.