Finding second encoding of base64 XML string

Question

I have some base64 encoded text fields in some XML data.

To get all the characters showing correctly, I think I need to find an additional encoding used on this text, which is not UTF-8 by the look of it. ?And maybe some other encoding aspect too, not sure..

I am not sure what order I should be encoding and decoding here - following https://www.geeksforgeeks.org/encoding-and-decoding-base64-strings-in-python/ I tried to first:

Encode the whole string with every possible Python2.7 encoding, then
decode with base64

(same result each time, no standard representation of problem characters)

Then I tried:

encode string with utf8
decode with base64
decode the bytes string with every possible Python2.7 encoding

However, none of these answer strings seem to get any standard representation of the problem characters, which should display as 'é' and 'ü'.

I enclose this example string, where I am sure what the final correct text should be. Original base64 string: b64_encoded_bytes = 'R3KfbmRlciBGco5kjnJpYyBKb3Vzc2V0JiMxMzsmIzEzO3NlbGVjdGlvbiBjb21taXR0ZWUgZm9yIGFydGlzdCByZWNpZGVuY3k6IFZpbmNpYW5jZSBEZXNwcmV0LCBLb3lvIEtvdW9oLCBDaHJpc3RpbmUgbWFjZWwsIEhhbnMtVWxyaWNoIE9icmlzdCwgTmF0YT9hIFBldHJlP2luLUJhY2hlbGV6LCBQaGlsaXBwZSBWZXJnbmU='

Text string with correct 'é' and 'ü' characters at beginning, deduced from European language knowledge:

'Gründer Frédéric Joussetselection committee for artist recidency: Vinciance Despret, Koyo Kouoh, Christine macel, Hans-Ulrich Obrist, Nata?a Petre?in-Bachelez, Philippe Vergne'

Note the '' is HTML encoding of apparently new line character used in Windows, and '?' might also resolve to another correct character with correct encoding, or possibly '?' is actual display in original data.

snakecharmerb · Accepted Answer · 2021-05-10 16:31:37Z

1

It seems to be encoded with mac_roman:

>>> b64 = 'R3KfbmRlciBGco5kjnJpYyBKb3Vzc2V0JiMxMzsmIzEzO3NlbGVjdGlvbiBjb21taXR0ZWUgZm9yIGFydGlzdCByZWNpZGVuY3k6IFZpbmNpYW5jZSBEZXNwcmV0LCBLb3lvIEtvdW9oLCBDaHJpc3RpbmUgbWFjZWwsIEhhbnMtVWxyaWNoIE9icmlzdCwgTmF0YT9hIFBldHJlP2luLUJhY2hlbGV6LCBQaGlsaXBwZSBWZXJnbmU='
>>> bs = base64.b64decode(b64)
>>> bs
b'Gr\x9fnder Fr\x8ed\x8eric Jousset&#13;&#13;selection committee for artist recidency: Vinciance Despret, Koyo Kouoh, Christine macel, Hans-Ulrich Obrist, Nata?a Petre?in-Bachelez, Philippe Vergne'
>>> print(bs.decode('mac_roman'))
Gründer Frédéric Jousset&#13;&#13;selection committee for artist recidency: Vinciance Despret, Koyo Kouoh, Christine macel, Hans-Ulrich Obrist, Nata?a Petre?in-Bachelez, Philippe Vergne

The question marks in "Nata?a Petre?in-Bachelez" are present in the original data, presumably the result of a previous encoding/decoding problem.

answered May 10, 2021 at 16:31

snakecharmerb

57.2k13 gold badges137 silver badges200 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Will Croxford Over a year ago

Ah thanks kindly, I printed out just the unicode strings in Python2, saw ü as u'\xfc' and didnt realise this was the same... As I understand from just checking Wikipedia, Mac Roman was default Mac OS encoding before Mac OS X, so maybe this data was dumped on an old Mac system, otherwise while in theory, this same answer could be obtained using 'mac-greek', 'mac-latin2' etc, Occam's razor would suggest Mac-Roman is best thing to go for rest of this data! Most grateful to learn something new about encoding.

Collectives™ on Stack Overflow

Finding second encoding of base64 XML string

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related