1

I have few text files which contain text in Hindi language in a folder. But those text files are in UTF-16 LE Encoding. I want to change the encoding to UTF-8 without changing text in it. How can I do that?

I wrote two python files but none of them are working proprely. When I run any of them, along with changing the encoding, they clear the file content. These are code in my Python files:

File 1:

import os
for root, dirs, files in os.walk("."):  
    for filename in files:
        #print(filename[-4:])
        if(filename[-3:] == "txt"):
            f= open(filename,"w+")
            x = f.read()
            print(x)
            f.close()
            f1= open(filename, "w+", encoding="utf-8")
            f1.write(x)
            f1.close()

File 2:

import codecs
BLOCKSIZE = 1048576
with codecs.open("ee.txt", "r", "utf-16-le") as sourceFile:
    with codecs.open("ee.txt", "w", "utf-8") as targetFile:
        while True:
            contents = sourceFile.read(BLOCKSIZE)
            print(contents)
            if not contents:
                break
            targetFile.write(contents)

2 Answers 2

2

You are not specifying the files are in utf-16 LE when reading the contents - that, and there is this confusion of trying to read and write to the same file at the same time, which won't work.

Also, unless you are running this code in a server where an attack attempt may be made by sending you an inordinately big text file, you should not worry about file size, and just read all file contents at once. (For you to have an idea, the Bible which is a big book is on the order of 3 MB in size (with 8bit encoding) - and even small VPS servers will have at on the order of 200MB of memory available to your program - that is, you could convert a book the size of 30+ bibles at once). Typical desktop computers will have several times this amount of memory.

Also, the relatively recent "pathlib" Python library can ease terating through all your text files, and its Path.read_text and Path.write_text methods will open a file, read or write the contents in the correct encoding, and close it in a single expression. Since when using this method, at time of writting the file the reading will be already done, we can simply do it with two calls:

import pathlib
for filepath in pathlib.Path(".").glob("**/*.txt"):
   data = filepath.read_text(encoding="utf-16 LE")
   filepath.write_text(data, encoding="utf-8")

If you prefer to be on the safe side, on the very, very unlikely of a catastrophic computer crash on the middle of a file conversion, you could write to a diffrently named file, and do the deleting/rename afterwards - so the code is like this:

import pathlib
for filepath in pathlib.Path(".").glob("**/*.txt"):
   data = filepath.read_text(encoding="utf-16 LE")
   tmp_name = filepath.name + ".tmp"
   filepath.with_name(tmp_name).write_text(data, encoding="utf-8")
   filepath.unlink()
   filepath.with_name(tmp_name).rename(filepath.name)
Sign up to request clarification or add additional context in comments.

2 Comments

Getting this while compiling, LookupError: unknown encoding: utf-8 LE
Ah, s your files where not UTF-16 - the code above is the simplest possible snippet. A real app for that would first try to detect the encoding (trying several and picking the one that does not error is a way to do so).
0

Before to explain you what it is wrong two useful tips:

I think you should remove the print. It will just confuse you, and it depends on the operating system and environment what encoding it will print.

Try with a very short file (few character) and check the input and output of both files either as text and as bytes.

Now the solution:

On the first example: you should open the first file as read (r).

On second example: you open the same file, first step to read but before you read the file you open it to write, so you truncate the file, and you will have no characters to read.

Use a ee.txt.tmp file to write, and at the end, if there are no error, you can move the tmp file removing the .tmp prefix.

In general: never read and write on the same file.

3 Comments

In first file, I changed the file as "r" and removed the print statement. It didn't change the encoding and instead the text is replaced by some random language text.
Are you sure your original text in in UTF16-LE?
Yes, original text in in UTF16-LE

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.