0
import sys
dataset = open('file-00.csv','r')
dataset_l = dataset.readlines()

When opening the above file, I get the following error:

**UnicodeDecodeError: 'utf-8' codec cant decode byte 0xfe in position 156: invalide start byte**

So I changed code to below

import sys
dataset = open('file-00.csv','r', errors='replace')
dataset_l = dataset.readlines() 

I also tried errors='ignore' but for both the initial error now dissapears but later in my code i get another error:

def find_class_1(row):
    global file_l_sp
    for line in file_l_sp:
        if line[0] == row[2] and line[1] == row[4] and line[2] == row[5]:
            return line[3].strip()
    return 'other'

File "Label_Classify_Dataset.py", line 56, in

dataset_w_label += dataset_l[it].strip() + ',' + find_class_1(l) + ',' + find_class_2(l) + '\n'

File "Label_Classify_Dataset.py", line 40, in find_class_1

if line[0] == row[2] and line[1] == row[4] and line[2] == row[5]:strong text



IndexError: list index out of range

How can I either fix the first or the second error ?

UPDATE....

I have used readline to enumerate and print each line, and have managed to work out which line is causing the error. It is indeed some random character but tshark must have substituted. Deleting this removes the error, but obviously I would rather skip over the lines rather than delete them

with open('file.csv') as f:
    for i, line in enumerate(f):
        print('{} = {}'.format(i+1, line.strip()))

Im sure there is a better way to do enumerate lol

9
  • Try to open the file with 'rb', like dataset = open('file-00.csv','rb') Commented Aug 27, 2018 at 11:29
  • Without seeing the data it's quite hard to guess. What encoding does it really have? Commented Aug 27, 2018 at 11:29
  • Don't ignore encoding errors. Open the file with the right encoding. Obviously utf8 is not the right encoding. Also, don't use .readlines() and .split() for CSV files, use the csv module. Thirdly, avoid global variables. They are not necessary for what you do here. Commented Aug 27, 2018 at 11:29
  • @RaunaqJain thanks I tried that see comment below for new error lol Commented Aug 27, 2018 at 12:30
  • @ThomasWeller the data should be utf8 as it was a pcap file which was converted to csv using LibreOffice with utf8 specified Commented Aug 27, 2018 at 12:31

1 Answer 1

0

Try the following;

dataset = open('file-00.csv','rb')

That b in the mode specifier in the open() states that the file shall be treated as binary, so the contents will remain as bytes. No decoding will be performed like this.

Sign up to request clarification or add additional context in comments.

1 Comment

ok thanks. I tried 'rb' but again this then causes a later error since a new column (string) is added once it has been classified so I get the error: Type Error: cant concat str to bytes

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.