'utf-8' codec cant decode byte & IndexError: list index out of range errors

Question

import sys
dataset = open('file-00.csv','r')
dataset_l = dataset.readlines()

When opening the above file, I get the following error:

**UnicodeDecodeError: 'utf-8' codec cant decode byte 0xfe in position 156: invalide start byte**

So I changed code to below

import sys
dataset = open('file-00.csv','r', errors='replace')
dataset_l = dataset.readlines()

I also tried errors='ignore' but for both the initial error now dissapears but later in my code i get another error:

def find_class_1(row):
    global file_l_sp
    for line in file_l_sp:
        if line[0] == row[2] and line[1] == row[4] and line[2] == row[5]:
            return line[3].strip()
    return 'other'

File "Label_Classify_Dataset.py", line 56, in

dataset_w_label += dataset_l[it].strip() + ',' + find_class_1(l) + ',' + find_class_2(l) + '\n'

File "Label_Classify_Dataset.py", line 40, in find_class_1

if line[0] == row[2] and line[1] == row[4] and line[2] == row[5]:strong text



IndexError: list index out of range

How can I either fix the first or the second error ?

UPDATE....

I have used readline to enumerate and print each line, and have managed to work out which line is causing the error. It is indeed some random character but tshark must have substituted. Deleting this removes the error, but obviously I would rather skip over the lines rather than delete them

with open('file.csv') as f:
    for i, line in enumerate(f):
        print('{} = {}'.format(i+1, line.strip()))

Im sure there is a better way to do enumerate lol

Try to open the file with 'rb', like dataset = open('file-00.csv','rb') — Raunaq Jain
– Raunaq Jain, Commented Aug 27, 2018 at 11:29
Without seeing the data it's quite hard to guess. What encoding does it really have? — Thomas Weller
– Thomas Weller, Commented Aug 27, 2018 at 11:29
Don't ignore encoding errors. Open the file with the right encoding. Obviously utf8 is not the right encoding. Also, don't use .readlines() and .split() for CSV files, use the csv module. Thirdly, avoid global variables. They are not necessary for what you do here. — Tomalak
– Tomalak, Commented Aug 27, 2018 at 11:29
@RaunaqJain thanks I tried that see comment below for new error lol — Bat
– Bat, Commented Aug 27, 2018 at 12:30
@ThomasWeller the data should be utf8 as it was a pcap file which was converted to csv using LibreOffice with utf8 specified — Bat
– Bat, Commented Aug 27, 2018 at 12:31

Nordle · Accepted Answer · 2018-08-27 11:29:30Z

0

Try the following;

dataset = open('file-00.csv','rb')

That b in the mode specifier in the open() states that the file shall be treated as binary, so the contents will remain as bytes. No decoding will be performed like this.

answered Aug 27, 2018 at 11:29

Nordle

3,0013 gold badges18 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Bat Over a year ago

ok thanks. I tried 'rb' but again this then causes a later error since a new column (string) is added once it has been classified so I get the error: Type Error: cant concat str to bytes

Collectives™ on Stack Overflow

'utf-8' codec cant decode byte & IndexError: list index out of range errors

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related