there are about 28000 articles in our institution and their encoding is not utf-8. I was asked to find a way to change their encoding to utf-8. is there any linux or windows command that changes the encoding of file without opening the file? clearly it is not a good idea to open 28000 files and changing them one by one!
-
2If you don't even open the file, you can't read the data, much less rewrite it…abarnert– abarnert2013-10-06 06:57:42 +00:00Commented Oct 6, 2013 at 6:57
-
but I know what their encoding isM a m a D– M a m a D2013-10-06 06:59:09 +00:00Commented Oct 6, 2013 at 6:59
-
This is not a programming question, and is off-topic here. "Is there any linux or windows command" is a question for Super User. Voting to migrate there. Good luck.Ken White– Ken White2013-10-06 07:08:16 +00:00Commented Oct 6, 2013 at 7:08
-
2this is about shell programming so it is programming.M a m a D– M a m a D2013-10-06 07:09:22 +00:00Commented Oct 6, 2013 at 7:09
-
And you also know the contents of all the files you want to recode without opening and reading the files?abarnert– abarnert2013-10-06 07:09:31 +00:00Commented Oct 6, 2013 at 7:09
2 Answers
iconv can be used to convert text files from one encoding to another. Most linux distros should have it—usually as part of glibc; if not, then as a separate installable package.
So, if they're, say, Latin-1 (ISO-8859-1), you can do something like this:
$ iconv -f ISO-8859-1 -t UTF-8 foo.txt >foo-utf8.txt
You can wrap this up in a one-liner with find, something like:
$ tmpdir=$(mktemp -d -t $tempXXXXXX); find . -type f -exec iconv -f ISO-8859-1 -t UTF-8 {} >${tmpdir}/temp \; -exec mv ${tmpdir}/temp {} \; ; rmdir ${tmpdir}
But you can probably make it more readable and more robust in a half-dozen lines of bash/python/perl/whatever.
1 Comment
you can change the encoding of a file easily by using basic shell commands.
$filesDir = Get-ChildItem "D:\Code"
$OutputDir="D:\programability\"
for ($j=0; $j -lt $filesDir.Count; $j++)
{
$SubDir=$filesDir[$j].FullName
[system.io.directory]::CreateDirectory($OutputDir+$filesDir[$j].name)
$files = Get-ChildItem $SubDir
for ($i=0; $i -lt $files.Count; $i++) {
$outfile = $OutputDir+$filesDir[$j].name+"\"+$files[$i].name
$files[$i].name
Get-Content $files[$i].FullName | Set-Content -Encoding UTF8 $outfile
}
}
This will change the file encoding to UTF-8, including files in subfolders