0

I have a pandas dataframe which has byte strings as elements in a column: E.g. b'hey'.

When I write this dataframe to a csv and read if afterwards, pandas will return a string with the following form "b'hey'". This is a problem, because when calling tf.data.Dataset.from_tensor_slices the string will be casted to a byte string again and will have the following form: b"b'hey'". Specifying the dtype when reading the csv with dtype = {"COLUMN_NAME":bytes} didn't to anything.

Has anyone a solution to this without manually changing the string and removing the b?

1

1 Answer 1

0

The solution is to apply ast.literal_eval first before decode with 'utf-8'.

To read and convert whole column with byte string:

import pandas as pd
import ast
df = pd.read_csv(<YOUR_DATA_FILE>, sep='\t')
df['text'].apply(ast.literal_eval) # assume the column is named with 'text'
df['text'] = df['text'].apply(lambda x: ast.literal_eval(x).decode("utf-8"))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.