0

I have converted a kdb query into a dataframe and then uploaded that dataframe to a csv file. This caused an encoding error which I easily fixed by decoding to utf-8. However, there is one column which this did not work for.

"nameFid" is the column which isn't working correctly, it outputs on the CSV file as " b'STRING' "

I am running Python 3.7, any other information needed I will be happy to provide.

Here is my code which decodes the data in the dataframe I get from kdb

  for ba in df.dtypes.keys():
        if df.dtypes[ba] == 'O':
            try:
                df[ba] = df[ba].apply(lambda x: x.decode('UTF-8'))
            except Exception as e:
                print(e)
return df

This worked for every column except "nameFid"

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdc in position 6: invalid continuation byte -

This is one error I get but I thought this suggests that the data isn't encoded using UTF-8, which would surely mean all the columns wouldn't work?

When using the try except, it instead prints "'Series' object has no attribute 'decode'".

My goal is to remove the "b''" from the column values, which currently show

" b'STRING' "

I'm not sure what else i need to add. Let me know if you need anything.

Also sorry I am quite new to all of this.

2
  • 2
    which would surely mean all the columns wouldn't work?. Not really. Many encodings are partially compatible from one other. Specifically the ASCII range is often supported as is for backward compatibility. So all your other columns may be ASCII-only and don't generate any error and only that column contains some characters that are not valid in utf-8. You can try to guess the encoding of the file using the chardet package. Commented Jul 31, 2019 at 9:54
  • Ahhhh, thank you so much; managed to sort it because of this comment. Really appreciate it. Commented Jul 31, 2019 at 10:47

1 Answer 1

1

Many encodings are partially compatible from one other. This is mostly due to the prevalence of ASCII so a ton of them will be backward compatible with ASCII but extend it differently. Hence if your other columns only contain stuff like numbers etc they are likely ASCII-only and will work with a lot of different encodings.

The column that raises an error however contains some character outside the normal ASCII range and thus the encoding starts to matter. If you don't know the encoding of the file you can use chardet to try to guess it. Keep in mind that this is just guessing. Decoding using a different encoding may not raise any error however it could result in the wrong characters appearing in the final text so you should always know which encoding to use.

This said, if you are on Linux the standard file utility is often able to give you a rough guess of the encoding used, however for more advanced use cases something like chardet is necessary.

Once you have found the correct encoding, say you found it is latin-1 simply replace the decode('utf-8') with decode('latin-1').

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.