I'm trying to create code that checks if the value in the index column of a CSV is equivalent in different rows, and if so, find the most occurring values in the other columns and use those as the final data. Not a very good explanation, basically I want to take this data.csv:
customer_ID,month,time,A,B,C
1003,Jan,2:00,1,1,4
1003,Jul,2:00,1,1,3
1003,Jan,2:00,1,1,4
1004,Feb,8:00,2,5,1
1004,Jul,8:00,2,4,1
And create a new answer.csv that recognizes that there are multiple rows for the same customer, so it finds the values that occur the most in each column and outputs those into one row:
customer_ID,month,ABC
1003,Jan,114
1004,Feb,251
I'd also like to learn that if there are values with the same number of occurrences (Month and B for customer 1004) how can I choose which one I want to be outputted?
I've currently written (thanks to Andy Hayden on a previous question I just asked):
import pandas as pd
df = pd.read_csv('data.csv', index_col='customer_ID')
res = df[list('ABC')].astype(str).sum(1)
print df
res.to_frame(name='answer').to_csv('answer.csv')
All this does, however, is create this (I was ignoring month previously, but now I'd like to incorporate it so that I can learn how to not only find the mode of a column of numbers, but also the most occurring string):
customer_ID,ABC
1003,114.0
1003,113.0
1003,114.0
1004,251.0
1004,241.0
Note: I don't know why it is outputting the .0 at the end of the ABC, it seems to be in the wrong variable format. I want each column to be outputted as just the 3 digit number.
Edit: I'm also having an issue that if the value in column A is 0 then the output becomes 2 digits and does not incorporate the leading 0.