0

I am trying to go through all files within a folder, read the file data encoded using utf-8, then rewriting that data to a new file which should create a copy of that file. However when doing so the new copy of the file gets corrupted.

-Should i be using utf-8 text encoding to encode all file types (.py, .txt, .docx, .jpg)?

-Is there one standard text encoding format that works for all file types?

def read_files():
    files = ["program.py", "letter.docx", "cat.jpg", "hello_world.py"]
    for file in files:
        #open exsting file
        f = open(file, encoding="utf-8")
        file_content = f.read()

        #get file name info
        file_extension = file.split(".")[1]
        file_name = file.split(".")[0]

        #write encoded data to new file
        f = open(file_name + "_converted." + file_extension , "wb")
        f.write(bytes(file_content, encoding="utf-8"))
        f.close()

read_files()
1
  • UTF-8 is a text encoding. So, no, you should definitely not be using it to decode or encode binary data. This will fail. Commented Oct 12, 2022 at 13:22

1 Answer 1

1

proper way to copy files with shutil:

import shutil
source = file
destination = file_name + "_converted." + file_extension
shutil.copy(source, destination)

bad and slow way to copy files:

def read_files():
    files = ["program.py", "letter.docx", "cat.jpg", "hello_world.py"]
    for file in files:
        #open exsting file
        f = open(file,'rb')  # read file in binary mode
        file_content = f.read()
        f.close()  # don't forget to close the file !

        #get file name info
        file_extension = file.split(".")[1]
        file_name = file.split(".")[0]

        #write raw data to new file
        f = open(file_name + "_converted." + file_extension , "wb")
        f.write(file_content)
        f.close()

read_files()

if you don't need to decode them to text then you should only open them in binary mode, as things like jpg and docx will break in text mode and should be opened in binary mode.

alternatively if you actually need to do some work on the docx or jpg files then you should use the proper modules to do so like Pillow for jpg and docx module for docx files.

Sign up to request clarification or add additional context in comments.

5 Comments

your method works, but i would like to use a method using text encoding, just to learn more about how it works . If it is actually possible
@logan_9997 decoding the jpg file WILL raise a decode error so don't even attempt it, it's not text, its binary.
is there a way to just get the binary data?
@logan_9997 it's in the second part of the answer.
@logan_9997 text files can be encoded using a lot of different encodings, like UTF-8/16/32 , Latin-1 and ASCII and lastly binary, using the wrong encoding to decode a file will raise a decode error, as it will result in invalid characters, and decoding binary files as anything but binary will result in a lot of these invalid characters, like how docx and jpg will.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.