How to convert a file to utf-8 in Python?

Question

I need to convert a bunch of files to utf-8 in Python, and I have trouble with the "converting the file" part.

I'd like to do the equivalent of:

iconv -t utf-8 $file > converted/$file # this is shell code

Thanks!

Deelaka · Accepted Answer · 2016-09-17 13:21:16Z

66

You can use the codecs module, like this:

import codecs
BLOCKSIZE = 1048576 # or some other, desired size in bytes
with codecs.open(sourceFileName, "r", "your-source-encoding") as sourceFile:
    with codecs.open(targetFileName, "w", "utf-8") as targetFile:
        while True:
            contents = sourceFile.read(BLOCKSIZE)
            if not contents:
                break
            targetFile.write(contents)

EDIT: added BLOCKSIZE parameter to control file chunk size.

edited Sep 17, 2016 at 13:21

Deelaka

13.8k9 gold badges38 silver badges67 bronze badges

answered Oct 10, 2008 at 13:59

Dzinx

58.2k10 gold badges63 silver badges78 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Brian Over a year ago

read() will always read the whole file - you probably want .read(BLOCKSIZE), where BLOCKSIZE is some suitable amount to read/write at once.

Rafael-WO Over a year ago

When in Python 3: Consider using open instead of codecs.open (see here)

Just Me Over a year ago

I run the code, into my test folder. I get this error: Traceback (most recent call last): File "D:\2022_12_02\TEST\convert txt to UTF-8 - versiune 2.py", line 3, in <module> with codecs.open(sourceFileName, "r", "d:\\2022_12_02\\TEST") as sourceFile: NameError: name 'sourceFileName' is not defined

Staale · Accepted Answer · 2008-10-10 14:07:07Z

35

This worked for me in a small test:

sourceEncoding = "iso-8859-1"
targetEncoding = "utf-8"
source = open("source")
target = open("target", "w")

target.write(unicode(source.read(), sourceEncoding).encode(targetEncoding))

answered Oct 10, 2008 at 14:07

Staale

28.1k23 gold badges69 silver badges85 bronze badges

4 Comments

Arafangion Over a year ago

Even better would be to specify binary mode.

Honghe.Wu Over a year ago

@Arafangion Why binary mode would be better? Thanks!

Arafangion Over a year ago

@Honghe.Wu: On windows, text mode is the default, and that means that your line endings will be mangled by the operating system, something you don't want if you're unsure about the encoding on disk.

The Bndr Over a year ago

@Arafangion How would the example look like, if I like to specify binary mode? target = open("target", "wb") are there some more changes?

Foon · Accepted Answer · 2015-07-31 23:30:38Z

17

Thanks for the replies, it works!

And since the source files are in mixed formats, I added a list of source formats to be tried in sequence (sourceFormats), and on UnicodeDecodeError I try the next format:

from __future__ import with_statement

import os
import sys
import codecs
from chardet.universaldetector import UniversalDetector

targetFormat = 'utf-8'
outputDir = 'converted'
detector = UniversalDetector()

def get_encoding_type(current_file):
    detector.reset()
    for line in file(current_file):
        detector.feed(line)
        if detector.done: break
    detector.close()
    return detector.result['encoding']

def convertFileBestGuess(filename):
   sourceFormats = ['ascii', 'iso-8859-1']
   for format in sourceFormats:
     try:
        with codecs.open(fileName, 'rU', format) as sourceFile:
            writeConversion(sourceFile)
            print('Done.')
            return
      except UnicodeDecodeError:
        pass

def convertFileWithDetection(fileName):
    print("Converting '" + fileName + "'...")
    format=get_encoding_type(fileName)
    try:
        with codecs.open(fileName, 'rU', format) as sourceFile:
            writeConversion(sourceFile)
            print('Done.')
            return
    except UnicodeDecodeError:
        pass

    print("Error: failed to convert '" + fileName + "'.")


def writeConversion(file):
    with codecs.open(outputDir + '/' + fileName, 'w', targetFormat) as targetFile:
        for line in file:
            targetFile.write(line)

# Off topic: get the file list and call convertFile on each file
# ...

(EDIT by Rudro Badhon: this incorporates the original try multiple formats until you don't get an exception as well as an alternate approach that uses chardet.universaldetector)

edited Jul 31, 2015 at 23:30

Foon

6,51811 gold badges42 silver badges45 bronze badges

answered Oct 10, 2008 at 16:14

Sébastien RoccaSerra

17.3k8 gold badges53 silver badges54 bronze badges

4 Comments

itsadok Over a year ago

For tough cases you can try to detect encoding with the chardet module from feedparser.org, but in your case it's an overkill.

physicalattraction Over a year ago

My Python 3.5 doesn't recognize the function file. Where does that come from?

Sébastien RoccaSerra Over a year ago

Yes, this answer was posted 8 years ago, so it's a piece of old Python 2 code.

Just Me Over a year ago

I try this code, I tun it, but it doesn't convert ANSI text files to UTF-8...

Sole Sensei · Accepted Answer · 2018-12-19 13:04:47Z

17

Answer for unknown source encoding type

based on @Sébastien RoccaSerra

python3.6

import os    
from chardet import detect

# get file encoding type
def get_encoding_type(file):
    with open(file, 'rb') as f:
        rawdata = f.read()
    return detect(rawdata)['encoding']

from_codec = get_encoding_type(srcfile)

# add try: except block for reliability
try: 
    with open(srcfile, 'r', encoding=from_codec) as f, open(trgfile, 'w', encoding='utf-8') as e:
        text = f.read() # for small files, for big use chunks
        e.write(text)

    os.remove(srcfile) # remove old encoding file
    os.rename(trgfile, srcfile) # rename new encoding
except UnicodeDecodeError:
    print('Decode Error')
except UnicodeEncodeError:
    print('Encode Error')

edited Dec 19, 2018 at 13:04

answered Dec 19, 2018 at 12:59

Sole Sensei

4094 silver badges8 bronze badges

Comments

Cesc · Accepted Answer · 2021-04-26 14:43:59Z

10

You can use this one liner (assuming you want to convert from utf16 to utf8)

    python -c "from pathlib import Path; path = Path('yourfile.txt') ; path.write_text(path.read_text(encoding='utf16'), encoding='utf8')"

Where yourfile.txt is a path to your $file.

For this to work you need python 3.4 or newer (probably nowadays you do).

Below a more readable version of the code above

from pathlib import Path
path = Path("yourfile.txt")
path.write_text(path.read_text(encoding="utf16"), encoding="utf8")

answered Apr 26, 2021 at 14:43

Cesc

1,2501 gold badge12 silver badges23 bronze badges

1 Comment

david Over a year ago

Depending on your operating system this may change the line break control characters. Great answer nevertheless, thank you. It needs more upvotes. Simple as that and no need to care about managing resources according to the documentation of Path.write_text: Open the file in text mode, write to it, and close the file.

MojiProg · Accepted Answer · 2017-01-08 17:58:58Z

5

This is a Python3 function for converting any text file into the one with UTF-8 encoding. (without using unnecessary packages)

def correctSubtitleEncoding(filename, newFilename, encoding_from, encoding_to='UTF-8'):
    with open(filename, 'r', encoding=encoding_from) as fr:
        with open(newFilename, 'w', encoding=encoding_to) as fw:
            for line in fr:
                fw.write(line[:-1]+'\r\n')

You can use it easily in a loop to convert a list of files.

answered Jan 8, 2017 at 17:58

MojiProg

2,4392 gold badges18 silver badges8 bronze badges

2 Comments

anon Over a year ago

this worked great for converting from is0-8859-1 to utf-8!

fskoras Over a year ago

Instead "line[:-1]" it would be better to use line.rstrip('\r\n'). This way no matter what line ending you encounter you will get correct results.

Ricardo · Accepted Answer · 2012-02-08 19:44:05Z

2

To guess what's the source encoding you can use the file *nix command.

Example:

$ file --mime jumper.xml

jumper.xml: application/xml; charset=utf-8

answered Feb 8, 2012 at 19:44

Ricardo

6499 silver badges11 bronze badges

1 Comment

Arthur Julião Over a year ago

It does not answer the question.

jamlee · Accepted Answer · 2021-12-18 15:07:39Z

convert all file in a dir to utf-8 encode. it is recursive and can filter file by suffix. thanks @Sole Sensei

# pip install -i https://pypi.tuna.tsinghua.edu.cn/simple chardet
import os
import re
from chardet import detect


def get_file_list(d):
    result = []
    for root, dirs, files in os.walk(d):
        dirs[:] = [d for d in dirs if d not in ['venv', 'cmake-build-debug']]
        for filename in files:
            # your filter
            if re.search(r'(\.c|\.cpp|\.h|\.txt)$', filename):
                result.append(os.path.join(root, filename))
    return result


# get file encoding type
def get_encoding_type(file):
    with open(file, 'rb') as f:
        raw_data = f.read()
    return detect(raw_data)['encoding']


if __name__ == "__main__":
    file_list = get_file_list('.')
    for src_file in file_list:
        print(src_file)
        trg_file = src_file + '.swp'
        from_codec = get_encoding_type(src_file)
        try:
            with open(src_file, 'r', encoding=from_codec) as f, open(trg_file, 'w', encoding='utf-8') as e:
                text = f.read()
                e.write(text)
            os.remove(src_file)
            os.rename(trg_file, src_file)
        except UnicodeDecodeError:
            print('Decode Error')
        except UnicodeEncodeError:
            print('Encode Error')

DEX Data Explorers · Accepted Answer · 2018-11-30 07:35:07Z

This is my brute force method. It also takes care of mingled \n and \r\n in the input.

    # open the CSV file
    inputfile = open(filelocation, 'rb')
    outputfile = open(outputfilelocation, 'w', encoding='utf-8')
    for line in inputfile:
        if line[-2:] == b'\r\n' or line[-2:] == b'\n\r':
            output = line[:-2].decode('utf-8', 'replace') + '\n'
        elif line[-1:] == b'\r' or line[-1:] == b'\n':
            output = line[:-1].decode('utf-8', 'replace') + '\n'
        else:
            output = line.decode('utf-8', 'replace') + '\n'
        outputfile.write(output)
    outputfile.close()
except BaseException as error:
    cfg.log(self.outf, "Error(18): opening CSV-file " + filelocation + " failed: " + str(error))
    self.loadedwitherrors = 1
    return ([])
try:
    # open the CSV-file of this source table
    csvreader = csv.reader(open(outputfilelocation, "rU"), delimiter=delimitervalue, quoting=quotevalue, dialect=csv.excel_tab)
except BaseException as error:
    cfg.log(self.outf, "Error(19): reading CSV-file " + filelocation + " failed: " + str(error))

Just Me · Accepted Answer · 2023-03-05 10:48:03Z

import codecs
import glob

import chardet

ALL_FILES = glob.glob('*.txt')

def kira_encoding_function():
    """Check encoding and convert to UTF-8, if encoding no UTF-8."""
    for filename in ALL_FILES:

        # Not 100% accuracy:
        # https://stackoverflow.com/a/436299/5951529
        # Check:
        # https://chardet.readthedocs.io/en/latest/usage.html#example-using-the-detect-function
        # https://stackoverflow.com/a/37531241/5951529
        with open(filename, 'rb') as opened_file:
            bytes_file = opened_file.read()
            chardet_data = chardet.detect(bytes_file)
            fileencoding = (chardet_data['encoding'])
            print('fileencoding', fileencoding)

            if fileencoding in ['utf-8', 'ascii']:
                print(filename + ' in UTF-8 encoding')
            else:
                # Convert file to UTF-8:
                # https://stackoverflow.com/q/19932116/5951529
                cyrillic_file = bytes_file.decode('cp1251')
                with codecs.open(filename, 'w', 'utf-8') as converted_file:
                    converted_file.write(cyrillic_file)
                print(filename +
                      ' in ' +
                      fileencoding +
                      ' encoding automatically converted to UTF-8')


kira_encoding_function()

SOURCE HERE:

Collectives™ on Stack Overflow

How to convert a file to utf-8 in Python?

10 Answers 10

3 Comments

4 Comments

4 Comments

Comments

1 Comment

2 Comments

1 Comment

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

10 Answers 10

3 Comments

4 Comments

4 Comments

Comments

1 Comment

2 Comments

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related