Encoding any file type with python

Question

I am trying to go through all files within a folder, read the file data encoded using utf-8, then rewriting that data to a new file which should create a copy of that file. However when doing so the new copy of the file gets corrupted.

-Should i be using utf-8 text encoding to encode all file types (.py, .txt, .docx, .jpg)?

-Is there one standard text encoding format that works for all file types?

def read_files():
    files = ["program.py", "letter.docx", "cat.jpg", "hello_world.py"]
    for file in files:
        #open exsting file
        f = open(file, encoding="utf-8")
        file_content = f.read()

        #get file name info
        file_extension = file.split(".")[1]
        file_name = file.split(".")[0]

        #write encoded data to new file
        f = open(file_name + "_converted." + file_extension , "wb")
        f.write(bytes(file_content, encoding="utf-8"))
        f.close()

read_files()

UTF-8 is a text encoding. So, no, you should definitely not be using it to decode or encode binary data. This will fail. — Konrad Rudolph
– Konrad Rudolph, Commented Oct 12, 2022 at 13:22

Ahmed AEK · Accepted Answer · 2022-10-12 13:11:55Z

1

proper way to copy files with shutil:

import shutil
source = file
destination = file_name + "_converted." + file_extension
shutil.copy(source, destination)

bad and slow way to copy files:

def read_files():
    files = ["program.py", "letter.docx", "cat.jpg", "hello_world.py"]
    for file in files:
        #open exsting file
        f = open(file,'rb')  # read file in binary mode
        file_content = f.read()
        f.close()  # don't forget to close the file !

        #get file name info
        file_extension = file.split(".")[1]
        file_name = file.split(".")[0]

        #write raw data to new file
        f = open(file_name + "_converted." + file_extension , "wb")
        f.write(file_content)
        f.close()

read_files()

if you don't need to decode them to text then you should only open them in binary mode, as things like jpg and docx will break in text mode and should be opened in binary mode.

alternatively if you actually need to do some work on the docx or jpg files then you should use the proper modules to do so like Pillow for jpg and docx module for docx files.

edited Oct 12, 2022 at 13:11

answered Oct 12, 2022 at 12:42

Ahmed AEK

23.2k3 gold badges19 silver badges50 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

logan_9997 Over a year ago

your method works, but i would like to use a method using text encoding, just to learn more about how it works . If it is actually possible

Ahmed AEK Over a year ago

@logan_9997 decoding the jpg file WILL raise a decode error so don't even attempt it, it's not text, its binary.

logan_9997 Over a year ago

is there a way to just get the binary data?

Ahmed AEK Over a year ago

@logan_9997 it's in the second part of the answer.

Ahmed AEK Over a year ago

@logan_9997 text files can be encoded using a lot of different encodings, like UTF-8/16/32 , Latin-1 and ASCII and lastly binary, using the wrong encoding to decode a file will raise a decode error, as it will result in invalid characters, and decoding binary files as anything but binary will result in a lot of these invalid characters, like how docx and jpg will.

Collectives™ on Stack Overflow

Encoding any file type with python

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related