0

Currently I'm dealing with a bunch of old files that have seen a lot of machines, OSes and file systems during their lifetime. A couple of them contain german Umlauts (ä, ö, ü), and apparently these have caused some of the filenames to break in one of the moving processes. A file originally named

München.txt

appears as

M?nchen.txt (invalid encoding)

on the ubuntu system, where they are currently hosted.

So now I'm trying to bulk repair them. On looping through the files with the initial draft, I stumbled across this phenomenon:

  • Echoing to the screen gives me the filename with the question mark, which I understand is a sign of interpretation of an illegal character within the filename:

     ./list_files.sh path_to_files
    
     M?nchen.txt
     K?ln.txt
    
  • If however I save the output to a file, it will give me a binary file that still contains the invalid characters:

     ./list_files.sh path_to_files > file_list
    
     less file_list
     M<FC>nchen.txt
     K<F6>ln.txt
    

This is the code:

#!/bin/bash

rootdir=$1

find "$rootdir" -print0 | while IFS= read -r -d '' broken_file_name; do
    echo $broken_file_name
done

I'm trying to understand:

  1. Why is the screen output different from the one in the file? Where does the character replacement happen and where is the question-mark-thing created?
  2. How can I prevent the interpretation of illegal characters with the question-mark-thing within the process of the script? It prevents me from selectively replacing an illegal character with the corresponding correct one.
2
  • The different behavior with less is probably a less thing. From the manual: "Otherwise, the character is displayed as a hex number in angle brackets. This format can be changed by setting the LESSBINFMT environment variable." (the Otherwise is for binary characters that are not control characters). While with echo, printf, cat... unprintable characters are shown as ?. Commented Nov 29, 2021 at 10:34
  • Are the question marks still there for ./list_files.sh path_to_files | cat? If so, they may be printed by the terminal for an unrecognised byte. If not, I would guess that iconv is being used, but only if stdout is a terminal (it prints ? for unrecognised). The bytes \xFC and \xF6 are not valid ASCII or UTF-8, as they are between 128-255 (in base 10). less prints binary byte \xF6 like <F6>. You can use hexdump (or hd) for a clearer view. Also look at iconv and its //TRANSLIT flag, for help converting. Commented Nov 29, 2021 at 11:09

2 Answers 2

2

The question mark replacement probably happens in Bash itself, as long as you are using Bash echo and try to output characters which cannot be represented in the current locale. It could also be a feature of the terminal driver.

We can only speculate about the original encoding, but the symptoms are consistent with Latin-1 (ISO-8859-1).

Assuming I guessed the encoding correctly, and assuming your current locale is a UTF-8 one, try something like

while IFS= read -r original; do
    dest=$(iconv -f iso-8859-1 <<<"$original")
    mv -- "$original" "$dest"
done <file_list
Sign up to request clarification or add additional context in comments.

7 Comments

If Latin-1 was not a correct guess and you have more samples to infer from, maybe look them up at tripleee.github.io/8bit
Do you think iconv's //TRANSLIT flag would be helpful with this character set?
@dan I don't understand why you would want that. The destination character set is (presumably) UTF-8 so there should be no characters in the original character set which cannot be represented.
Ok cool. I was thinking there might be some obscure characters with no equivalent, but I guess umlauts are covered.
Not only that; Latin-1 has the curious property that its code points are identical to Unicode in the 8-bit range.
|
0

The different behavior with less is probably a less thing. From the manual:

Control and binary characters are displayed in standout (reverse video). Each such character is displayed in caret notation if possible (e.g. ^A for control-A). Caret notation is used only if inverting the 0100 bit results in a normal printable character. Otherwise, the character is displayed as a hex number in angle brackets. This format can be changed by setting the LESSBINFMT environment variable.

But as what you want is rename your files, the way the names are displayed by various utilities is not that important. In your script you can use, e.g., tr to compute the new name by replacing the characters you do not like by others. Example if you want to replace ö and ü by o and u, respectively:

new=$(tr '\366\374' 'ou' <<< "$old")
if [ "$new" != "$old" ]; then
  mv "$old" "$new"
fi

(366 and 374 are the ascii codes of ö and ü, in octal).

8 Comments

This will drop the accents rather than actually try to fix them.
Yes. I thought it was what the OP want. Am I wrong?
I would assume they want the original character back.
Oh, and 366 and 374 are not ASCII codes at all, the range of ASCII ends at octal 177. The character codes you have are valid in Latin-1 and some other 8-bit encodings.
Oh, I got it, thanks! It was a character encoding issue. Let's wait until the OP confirms and I'll delete my answer. Suggestion: you could copy-paste my explanation why less behaves differently in your own answer, for completeness.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.