Currently I'm dealing with a bunch of old files that have seen a lot of machines, OSes and file systems during their lifetime. A couple of them contain german Umlauts (ä, ö, ü), and apparently these have caused some of the filenames to break in one of the moving processes. A file originally named
München.txt
appears as
M?nchen.txt (invalid encoding)
on the ubuntu system, where they are currently hosted.
So now I'm trying to bulk repair them. On looping through the files with the initial draft, I stumbled across this phenomenon:
Echoing to the screen gives me the filename with the question mark, which I understand is a sign of interpretation of an illegal character within the filename:
./list_files.sh path_to_files M?nchen.txt K?ln.txtIf however I save the output to a file, it will give me a binary file that still contains the invalid characters:
./list_files.sh path_to_files > file_list less file_list M<FC>nchen.txt K<F6>ln.txt
This is the code:
#!/bin/bash
rootdir=$1
find "$rootdir" -print0 | while IFS= read -r -d '' broken_file_name; do
echo $broken_file_name
done
I'm trying to understand:
- Why is the screen output different from the one in the file? Where does the character replacement happen and where is the question-mark-thing created?
- How can I prevent the interpretation of illegal characters with the question-mark-thing within the process of the script? It prevents me from selectively replacing an illegal character with the corresponding correct one.
lessis probably alessthing. From the manual: "Otherwise, the character is displayed as a hex number in angle brackets. This format can be changed by setting the LESSBINFMT environment variable." (the Otherwise is for binary characters that are not control characters). While withecho,printf,cat... unprintable characters are shown as?../list_files.sh path_to_files | cat? If so, they may be printed by the terminal for an unrecognised byte. If not, I would guess thaticonvis being used, but only if stdout is a terminal (it prints?for unrecognised). The bytes\xFCand\xF6are not valid ASCII or UTF-8, as they are between 128-255 (in base 10).lessprints binary byte\xF6like<F6>. You can usehexdump(orhd) for a clearer view. Also look aticonvand its//TRANSLITflag, for help converting.