0

When I read a whole file, my script works fine without a problem

fst = 0
with open(in_ckfile, 'rb', 0) as file:
    with open(outfile_namepath, mode='wb') as outfile:
        while True:
            #buf = file.read(204800)
            buf = file.read()
                    
            if buf: 
                fst += 1
                print('read no., len of buf ......: ', fst, len(buf))

                buf = buf.decode()
                xbytes = bytearray()
                xbytes.extend(map(ord, buf))  
                buf = xbytes

                print('read no., len of decode buf: ', fst, len(buf))

And, the result of the process is as shown below::

read no., len of buf ......:  1 26848013
read no., len of decode buf:  1 18546777
len of in string ..........:  18546777
len of output str, checked :  18546777 370130 

However, when I divide the reading by units as: buf = file.read(204800) it gives an error:

read no., len of buf ......:  1 204800
read no., len of decode buf:  1 141406
len of in string ..........:  141406
len of output str, checked :  141406 2827 

read no., len of buf ......:  2 204800
read no., len of decode buf:  2 141606
len of in string ..........:  141606
len of output str, checked :  141606 2800 

read no., len of buf ......:  3 204800
Traceback (most recent call last):
  File "<pyshell#155>", line 1, in <module>
  ...
  buf = buf.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 204799: unexpected end of data

How do I fix the issue

4
  • Why do you need to pass the argument to the read function? Maybe your file doesnt have enough byte-data to read? Thats probably why you're getting the "unexpected end of data" message Commented Sep 5, 2021 at 14:14
  • My file size is 26MB, I don't want to read the entire file at once, I want to read different sizes that may be up to 400MB Commented Sep 5, 2021 at 14:19
  • Try passing it 204798, it shouldn't be too problematic for your file and tell the result Commented Sep 5, 2021 at 14:20
  • This time the same error appeared on unit no. 4. read no., len of buf ......: 4 204798 Commented Sep 6, 2021 at 0:09

1 Answer 1

2

In UTF-8, many characters are encoded as multi-byte sequences. When you read blocks with a fixed number of bytes, you will sometimes end up with the beginning of a sequence in one block and the remainder in the next one. This is the situation in the error you post.

How to solve it - two options:

  • Use one of the built-in ways to handle it, eg. opening the file as a utf-8-encoded text file, or using a stream decoder, and let the standard library handle it. This is usually the better approach.
  • If you need to handle it manually: On blocks other than the last, check the end of the block, removing any incomplete multi-byte sequence (or simply a multi-byte sequence, which will be easier to detect), then putting it at the beginning of the next block.
Sign up to request clarification or add additional context in comments.

2 Comments

After a long of testing, the first option is better, as it requires some changes to be made. The second option is great if it works without any errors. Thanks
Yeah, for the second option you'd have to read up on the UTF-8 start characters vs continuation characters and fiddle with it to get it right; quite a lot of hassle for something that built-in functions already handle satisfactorily

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.