3

I'm working with a csv file that has the following format:

"Id","Sequence"
3,"1,3,13,87,1053,28576,2141733,508147108,402135275365,1073376057490373,9700385489355970183,298434346895322960005291,31479360095907908092817694945,11474377948948020660089085281068730"
7,"1,2,1,5,5,1,11,16,7,1,23,44,30,9,1,47,112,104,48,11,1,95,272,320,200,70,13,1,191,640,912,720,340,96,15,1,383,1472,2464,2352,1400,532,126,17,1,767,3328,6400,7168,5152,2464,784,160,19,1,1535,7424"
8,"1,2,4,5,8,10,16,20,32,40,64,80,128,160,256,320,512,640,1024,1280,2048,2560,4096,5120,8192,10240,16384,20480,32768,40960,65536,81920,131072,163840,262144,327680,524288,655360,1048576,1310720,2097152"
11,"1,8,25,83,274,2275,132224,1060067,3312425,10997342,36304451,301432950,17519415551,140456757358,438889687625,1457125820233,4810267148324,39939263006825,2321287521544174,18610239435360217"

I'd like to read this into a data frame with the type of df['Id'] to be integer-like and the type of df['Sequence'] to be list-like.

I currently have the following kludgy code:

def clean(seq_string):
    return list(map(int, seq_string.split(',')))

# Read data
training_data_file = "data/train.csv"    
train = pd.read_csv(training_data_file)
train['Sequence'] = list(map(clean, train['Sequence'].values))

This appears to work, but I feel like the same could be achieved natively using pandas and numpy.

Does anyone have a recommendation?

3 Answers 3

5

You can specify a converter for the Sequence column:

converters: dict, default None

Dict of functions for converting values in certain columns. Keys can either be integers or column labels

train = pd.read_csv(training_data_file, converters={'Sequence': clean})
Sign up to request clarification or add additional context in comments.

1 Comment

Beautiful. Thought it would be something simple like this. :) Cheers!
1

This also works, except that the Sequence is list of string instead of list of int:

df = pd.read_csv(training_data_file)
df['Sequence'] = df['Sequence'].str.split(',')

To convert each element to int:

df = pd.read_csv(training_data_file)
df['Sequence'] = df['Sequence'].str.split(',').apply(lambda s: list(map(int, s)))

2 Comments

And if I wanted to convert it to a list of int, I could just append .convert_objects(convert_numeric=True), right?
It seems that the command has been deprecated, need to loop through the list and convert manually. But this gets back to original solution somehow.
1

An alternative solution is to use literal_eval from the ast module. literal_eval evaluates the string as input to the Python interpreter and should give you back the list as expected.

def clean(x):
    return literal_eval(x)

train = pd.read_csv(training_data_file, converters={'Sequence': clean})

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.