1

I have a tsv file with a column containing utf-8 encoded byte strings (e.g., b'La croisi\xc3\xa8re'). I am trying to read this file with the pandas method read_csv, but what I get is a column of strings, not byte strings (e.g., "b'La croisi\xc3\xa8re'").

How can I read that column as byte strings instead of regular strings in Python 3? I tried to use dtype={'my_bytestr_col': bytes} in read_csv with no luck.

Another way to put it: How can I go from something like "b'La croisi\xc3\xa8re'" to b'La croisi\xc3\xa8re'?

2
  • to go from something like "b'La croisi\xc3\xa8re'" to b'La croisi\xc3\xa8re' you can do data[2:-1].encode() Commented Mar 12, 2018 at 20:26
  • Not really, this returns the following wrongly encoded byte string: b'La croisi\xc3\x83\xc2\xa8re' Commented Mar 12, 2018 at 20:36

1 Answer 1

1

sample file:

    First Name  Last Name   bytes
0   foo          bar        b'La croisi\xc3\xa8re' 

then try this:

import pandas as pd
import ast
df = pd.read_csv('file.tsv', sep='\t')
df['bytes'].apply(ast.literal_eval)

Out:

0    b'La croisi\xc3\xa8re'
Name: bytes, dtype: object
Sign up to request clarification or add additional context in comments.

4 Comments

This doesn't work because it opens the utf-8 encoded byte string column from my tsv file into a simple string. Maybe there's a set of arguments I could pass to this function that would fix it?
bytes(str(df['my_bytestr_col']),'utf-8')
This returns b"b'La croisi\\xc3\\xa8re'" in Python 3
OK, this should work for you: df['bytes'].apply(ast.literal_eval)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.