Shell-script-internal encoding differs from redirected output

Question

Currently I'm dealing with a bunch of old files that have seen a lot of machines, OSes and file systems during their lifetime. A couple of them contain german Umlauts (ä, ö, ü), and apparently these have caused some of the filenames to break in one of the moving processes. A file originally named

München.txt

appears as

M?nchen.txt (invalid encoding)

on the ubuntu system, where they are currently hosted.

So now I'm trying to bulk repair them. On looping through the files with the initial draft, I stumbled across this phenomenon:

Echoing to the screen gives me the filename with the question mark, which I understand is a sign of interpretation of an illegal character within the filename:
```
 ./list_files.sh path_to_files

 M?nchen.txt
 K?ln.txt
```
If however I save the output to a file, it will give me a binary file that still contains the invalid characters:
```
 ./list_files.sh path_to_files > file_list

 less file_list
 M<FC>nchen.txt
 K<F6>ln.txt
```

This is the code:

#!/bin/bash

rootdir=$1

find "$rootdir" -print0 | while IFS= read -r -d '' broken_file_name; do
    echo $broken_file_name
done

I'm trying to understand:

Why is the screen output different from the one in the file? Where does the character replacement happen and where is the question-mark-thing created?
How can I prevent the interpretation of illegal characters with the question-mark-thing within the process of the script? It prevents me from selectively replacing an illegal character with the corresponding correct one.

The different behavior with less is probably a less thing. From the manual: "Otherwise, the character is displayed as a hex number in angle brackets. This format can be changed by setting the LESSBINFMT environment variable." (the Otherwise is for binary characters that are not control characters). While with echo, printf, cat... unprintable characters are shown as ?. — Renaud Pacalet
– Renaud Pacalet, Commented Nov 29, 2021 at 10:34
Are the question marks still there for ./list_files.sh path_to_files | cat? If so, they may be printed by the terminal for an unrecognised byte. If not, I would guess that iconv is being used, but only if stdout is a terminal (it prints ? for unrecognised). The bytes \xFC and \xF6 are not valid ASCII or UTF-8, as they are between 128-255 (in base 10). less prints binary byte \xF6 like <F6>. You can use hexdump (or hd) for a clearer view. Also look at iconv and its //TRANSLIT flag, for help converting. — dan
– dan, Commented Nov 29, 2021 at 11:09

tripleee · Accepted Answer · 2021-11-29 11:14:44Z

2

The question mark replacement probably happens in Bash itself, as long as you are using Bash echo and try to output characters which cannot be represented in the current locale. It could also be a feature of the terminal driver.

We can only speculate about the original encoding, but the symptoms are consistent with Latin-1 (ISO-8859-1).

Assuming I guessed the encoding correctly, and assuming your current locale is a UTF-8 one, try something like

while IFS= read -r original; do
    dest=$(iconv -f iso-8859-1 <<<"$original")
    mv -- "$original" "$dest"
done <file_list

edited Nov 29, 2021 at 11:14

answered Nov 29, 2021 at 10:55

tripleee

192k37 gold badges318 silver badges367 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

tripleee Over a year ago

If Latin-1 was not a correct guess and you have more samples to infer from, maybe look them up at tripleee.github.io/8bit

dan Over a year ago

Do you think iconv's //TRANSLIT flag would be helpful with this character set?

tripleee Over a year ago

@dan I don't understand why you would want that. The destination character set is (presumably) UTF-8 so there should be no characters in the original character set which cannot be represented.

dan Over a year ago

Ok cool. I was thinking there might be some obscure characters with no equivalent, but I guess umlauts are covered.

tripleee Over a year ago

Not only that; Latin-1 has the curious property that its code points are identical to Unicode in the 8-bit range.

|

Renaud Pacalet · Accepted Answer · 2021-11-29 10:52:57Z

0

The different behavior with less is probably a less thing. From the manual:

Control and binary characters are displayed in standout (reverse video). Each such character is displayed in caret notation if possible (e.g. ^A for control-A). Caret notation is used only if inverting the 0100 bit results in a normal printable character. Otherwise, the character is displayed as a hex number in angle brackets. This format can be changed by setting the LESSBINFMT environment variable.

But as what you want is rename your files, the way the names are displayed by various utilities is not that important. In your script you can use, e.g., tr to compute the new name by replacing the characters you do not like by others. Example if you want to replace ö and ü by o and u, respectively:

new=$(tr '\366\374' 'ou' <<< "$old")
if [ "$new" != "$old" ]; then
  mv "$old" "$new"
fi

(366 and 374 are the ascii codes of ö and ü, in octal).

answered Nov 29, 2021 at 10:52

Renaud Pacalet

30.7k3 gold badges42 silver badges60 bronze badges

8 Comments

tripleee Over a year ago

This will drop the accents rather than actually try to fix them.

Renaud Pacalet Over a year ago

Yes. I thought it was what the OP want. Am I wrong?

tripleee Over a year ago

I would assume they want the original character back.

tripleee Over a year ago

Oh, and 366 and 374 are not ASCII codes at all, the range of ASCII ends at octal 177. The character codes you have are valid in Latin-1 and some other 8-bit encodings.

Renaud Pacalet Over a year ago

Oh, I got it, thanks! It was a character encoding issue. Let's wait until the OP confirms and I'll delete my answer. Suggestion: you could copy-paste my explanation why less behaves differently in your own answer, for completeness.

|

Collectives™ on Stack Overflow

Shell-script-internal encoding differs from redirected output

2 Answers 2

7 Comments

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related