UnicodeDecodeError in reading a file

Question

When I read a whole file, my script works fine without a problem

fst = 0
with open(in_ckfile, 'rb', 0) as file:
    with open(outfile_namepath, mode='wb') as outfile:
        while True:
            #buf = file.read(204800)
            buf = file.read()
                    
            if buf: 
                fst += 1
                print('read no., len of buf ......: ', fst, len(buf))

                buf = buf.decode()
                xbytes = bytearray()
                xbytes.extend(map(ord, buf))  
                buf = xbytes

                print('read no., len of decode buf: ', fst, len(buf))

And, the result of the process is as shown below::

read no., len of buf ......:  1 26848013
read no., len of decode buf:  1 18546777
len of in string ..........:  18546777
len of output str, checked :  18546777 370130

However, when I divide the reading by units as: buf = file.read(204800) it gives an error:

read no., len of buf ......:  1 204800
read no., len of decode buf:  1 141406
len of in string ..........:  141406
len of output str, checked :  141406 2827 

read no., len of buf ......:  2 204800
read no., len of decode buf:  2 141606
len of in string ..........:  141606
len of output str, checked :  141606 2800 

read no., len of buf ......:  3 204800
Traceback (most recent call last):
  File "<pyshell#155>", line 1, in <module>
  ...
  buf = buf.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 204799: unexpected end of data

How do I fix the issue

Why do you need to pass the argument to the read function? Maybe your file doesnt have enough byte-data to read? Thats probably why you're getting the "unexpected end of data" message — SLDem
– SLDem, Commented Sep 5, 2021 at 14:14
My file size is 26MB, I don't want to read the entire file at once, I want to read different sizes that may be up to 400MB — user8597915
– user8597915, Commented Sep 5, 2021 at 14:19
Try passing it 204798, it shouldn't be too problematic for your file and tell the result — SLDem
– SLDem, Commented Sep 5, 2021 at 14:20
This time the same error appeared on unit no. 4. read no., len of buf ......: 4 204798 — user8597915
– user8597915, Commented Sep 6, 2021 at 0:09

Jiří Baum · Accepted Answer · 2021-09-05 14:28:24Z

2

In UTF-8, many characters are encoded as multi-byte sequences. When you read blocks with a fixed number of bytes, you will sometimes end up with the beginning of a sequence in one block and the remainder in the next one. This is the situation in the error you post.

How to solve it - two options:

Use one of the built-in ways to handle it, eg. opening the file as a utf-8-encoded text file, or using a stream decoder, and let the standard library handle it. This is usually the better approach.
If you need to handle it manually: On blocks other than the last, check the end of the block, removing any incomplete multi-byte sequence (or simply a multi-byte sequence, which will be easier to detect), then putting it at the beginning of the next block.

edited Sep 5, 2021 at 14:28

answered Sep 5, 2021 at 14:22

Jiří Baum

6,9882 gold badges19 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user8597915 Over a year ago

After a long of testing, the first option is better, as it requires some changes to be made. The second option is great if it works without any errors. Thanks

Jiří Baum Over a year ago

Yeah, for the second option you'd have to read up on the UTF-8 start characters vs continuation characters and fiddle with it to get it right; quite a lot of hassle for something that built-in functions already handle satisfactorily

Collectives™ on Stack Overflow

UnicodeDecodeError in reading a file

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related