3

I am trying to make a dictionary from a csv file in python. Let's say the CSV contains:

Student   food      amount
John      apple       15
John      banana      20
John      orange      1
John      grape       3
Ben       apple       2
Ben       orange      4
Ben       strawberry  8
Andrew    apple       10
Andrew    watermelon  3

what i'm envisioning is a dictionary whose key will be the student name and a list as the value where each entry corresponds to a different food. I would have to count the number of unique food items in the second column and that would be the length of the vector. For example:

The value of [15,20,1,3,0,0] would correspond to [apple, banana, orange, grape, strawberry, watermelon] for  'John'. 
The value of [2,0,4,0,8,0] would correspond to [apple, banana, orange, grape, strawberry, watermelon] for 'Ben'.
The value of [10,0,0,0,0,3] would correspond to [apple, banana, orange, grape, strawberry, watermelon] for 'Andrew'

The expected output of the dict would look like this:

dict={'John':{[15,20,1,3,0,0]}, 'Ben': {[2,0,4,0,8,0]}, 'Andrew': {[10,0,0,0,0,3]}}

I'm having trouble creating the dictionary to begin with or if a dictionary is even the right approach. What I have to begin with:

import csv
data_file=open('data.csv','rU')
reader=csv.DictReader(data_file)
data={}
for row in reader:
    data[row['Student']]=row
data_file.close()

thanks for taking the time to read. any help would be greatly appreciated.

4 Answers 4

3

Here is a version using regular dictionary. Defaultdict is definitely better though.

import csv
data_file=open('data.csv','rU')
reader=csv.DictReader(data_file)
data={}
for row in reader:
    if row['Student'] in data:
        data[row['Student']].append(row['amount'])
    else:
        data[row['Student']] = [row['amount']]
data_file.close()

EDIT:

For matching indicies
import csv
from collections import defaultdict

data_file=open('data.csv','rU')
reader=csv.DictReader(data_file)
data=defaultdict(lambda:[0,0,0,0])
fruit_to_index = defaultdict(lambda:None,{'apple':0,'banana':1,'orange':2,'grape':3})
for row in reader:
    if fruit_to_index[row['food']] != None:
        data[row['Student']][fruit_to_index[row['food']]] = int(row['amount'])
data_file.close()

print data would be

defaultdict(<function <lambda> at address>, 
{'John':  [15, 20, 1, 3], 
'Ben':    [2 , 0 , 0, 0], 
'Andrew': [10, 0 , 0, 0]})

I think this is what you want.

EDIT2: Did this when the list of fruits didn't include strawberry and watermelon, but should be very easy to add. If the list is too large

to generate the fruit to index mapping

set_of_fruits = set()
for row in reader:
    set_of_fruits.add(row['food'])
c = 0
for e in set_of_fruits:
    fruit_to_index[e] = c
    c += 1

Note that the order of set_of_fruits is not generated.

data = defaultdict(lambda:[0,0,0,0]) becomes

data = defaultdict(lambda:[0 for x in range(len(set_of_fruits))])

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks. However, this only adds to the list but does not match indices to the food names. For example since Ben did not eat an orange, the amount would be populated with a 0.
i want to try to avoid hardcoding the index of each fruit because unfortunately, there are ~200 unique fruits in my csv file.
Read edit 2. You can just do use row['food'] to generate a list of fruits
1

Try this, I think this what you want. Notice the usage of defaultdict, it could be done with a regular dictionary but defaultdict is very handy in such cases:

import csv
from collections import defaultdict
data=defaultdict(list)
with open('data.csv','rb') as data_file:
    reader=csv.DictReader(data_file)
    for row in reader:
        data[row['Student']].append(row['amount'])

2 Comments

Thanks. This only adds to the list but does not match indices to the food names.
That's because you were not very precise describing your problem. Please correct the example expected output.
0

You probably actually want a nested dictionary structure; keeping a list and then trying to match indices to food names will get hairy fast.

import csv
from collections import defaultdict
data = defaultdict(dict)
with open('data.csv', 'r') as file:
    reader = csv.DictReader(file)
    for row in reader:
        data[row['Student']][row['food']] = row['amount']

This will give you a structure like so:

{'John': {'apple': 15, 'banana': 20, 'orange': 1}, 
 'Ben': {'apple': 2, 'watermelon': 4}, #etc.
}

That lets you look up particular foods without having to try to cross-reference another list to figure out where to find the counts, and supports any number of food items without having to fill your lists with zeros for all the missing ones.

If you want to be extra-fancy, you can use a nested defaultdict, so that looking up foods that didn't get entered will return zeros automatically, instead of giving KeyErrors; just change the second line to:

data = defaultdict(lambda: defaultdict(int))

1 Comment

Thanks. I guess I should mention what the end goal is. I'm trying to do a cosine similarity of the vector of amounts between various students so all I need to ensure is that the indices to the food names match for each student and if they don't have that food name, then the amount would be populated with a 0
0

Use the setdefault method of the dict.

import csv
data_file=open('data.csv','rU')
reader=csv.DictReader(data_file)
data={}
for row in reader:
    data.setdefault(row['Student'], []).append(row['amount'])
data_file.close()

If the key, eg. "John", doesn't exist, it creates it with the supplied default value. In this case an empty list is the default.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.