0

I'm trying to create code that checks if the value in the index column of a CSV is equivalent in different rows, and if so, find the most occurring values in the other columns and use those as the final data. Not a very good explanation, basically I want to take this data.csv:

customer_ID,month,time,A,B,C
1003,Jan,2:00,1,1,4
1003,Jul,2:00,1,1,3
1003,Jan,2:00,1,1,4
1004,Feb,8:00,2,5,1
1004,Jul,8:00,2,4,1

And create a new answer.csv that recognizes that there are multiple rows for the same customer, so it finds the values that occur the most in each column and outputs those into one row:

customer_ID,month,ABC
1003,Jan,114
1004,Feb,251

I'd also like to learn that if there are values with the same number of occurrences (Month and B for customer 1004) how can I choose which one I want to be outputted?

I've currently written (thanks to Andy Hayden on a previous question I just asked):

import pandas as pd
df = pd.read_csv('data.csv', index_col='customer_ID')
res = df[list('ABC')].astype(str).sum(1)
print df
res.to_frame(name='answer').to_csv('answer.csv')

All this does, however, is create this (I was ignoring month previously, but now I'd like to incorporate it so that I can learn how to not only find the mode of a column of numbers, but also the most occurring string):

customer_ID,ABC
1003,114.0
1003,113.0
1003,114.0
1004,251.0
1004,241.0

Note: I don't know why it is outputting the .0 at the end of the ABC, it seems to be in the wrong variable format. I want each column to be outputted as just the 3 digit number.

Edit: I'm also having an issue that if the value in column A is 0 then the output becomes 2 digits and does not incorporate the leading 0.

6
  • Why do you need such specific formating? What is the end goal? Commented Mar 2, 2014 at 2:54
  • The end goal is a submissions file for a contest. The format they want is customer_ID,ABC. They only want one row for each customer_ID so I was wondering if there was a way to combine multiple rows with the same customer_ID and use the most occurring data in those rows as the final single row output for that customer Commented Mar 2, 2014 at 2:58
  • What do you mean by "most occurring data"? Commented Mar 2, 2014 at 3:10
  • The final outcome I want is the 2nd code block in my question. By most occurring data I mean I want it to recognize "customer_ID 1003 is on 3 rows. For month, the data is Jan,Jul,Jan." It recognizes that Jan occurred twice and Jul occurred once so it outputs 1003,Jan. Commented Mar 2, 2014 at 3:13
  • I think the csv bit here is noise, you should really try and ask what you want to do with the pandas DataFrame! Commented Mar 2, 2014 at 5:59

1 Answer 1

1

What about something like this? This is not using Pandas though, I am not a Pandas expert.

from collections import Counter

dataDict = {}

# Read the csv file, line by line
with open('data.csv', 'r') as dataFile:
    for line in dataFile:

        # split the line by ',' since it is a csv file...
        entry = line.split(',')

        # Check to make sure that there is data in the line
        if entry and len(entry[0])>0:

            # if the customer_id is not in dataDict, add it
            if entry[0] not in dataDict:
                dataDict[entry[0]] = {'month':[entry[1]],
                                   'time':[entry[2]],
                                   'ABC':[''.join(entry[3:])],
                                   }
            # customer_id is already in dataDict, add values
            else:
                dataDict[entry[0]]['month'].append(entry[1])
                dataDict[entry[0]]['time'].append(entry[2])
                dataDict[entry[0]]['ABC'].append(''.join(entry[3:]))


# Now write the output file
with open('out.csv','w') as f:

    # Loop through sorted customers
    for customer in sorted(dataDict.keys()):

        # use Counter to find the most common entries
        commonMonth = Counter(dataDict[customer]['month']).most_common()[0][0]
        commonTime = Counter(dataDict[customer]['time']).most_common()[0][0]
        commonABC = Counter(dataDict[customer]['ABC']).most_common()[0][0]

        # Write the line to the csv file
        f.write(','.join([customer, commonMonth, commonTime, commonABC, '\n']))

It generates a file called out.csv that looks like this:

1003,Jan,2:00,114,
1004,Feb,8:00,251,
customer_ID,month,time,ABC,
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.