Python analyzing csv file

Question

I'm trying to find the three most populated city parts(BYDEL) in 1992

I have a csv file looking like this: http://data.kk.dk/dataset/9070067f-ab57-41cd-913e-bc37bfaf9acd/resource/9fbab4aa-1ee0-4d25-b2b4-b7b63537d2ec/download/befkbhalderkoencivst.csv>

The csv file can be explained as:

AAR: Which year the observation was made

BYDEL: Which part of the city, described by an integer contained in following dict; 1=Indre By, 2=Østerbro, 3=Nørrebro, 4=Vesterbro/Kgs. Enghave, 5=Valby, 6=Vanløse, 7=Brønshøj-Husum, 8=Bispebjerg, 9=Amager Øst, 10=Amager Vest, 99=Udenfor inddeling

ALDER: The age of the observed people

PERSONER: Number of observations with the given features of the row

I have a solution but it is very repetitive and i think that it could be done smarter but I don't have enough experience with python. Could anyone point me in the right direction?

My code/solution looks like this:

df = pd.read_csv('befkbh.csv',quotechar='"',skipinitialspace=True, delimiter=',', encoding='latin1').fillna(0)
data = df.as_matrix()
Q31 = collections.defaultdict(list)
Q32 = collections.defaultdict(list)
Q33 = collections.defaultdict(list)
Q34 = collections.defaultdict(list)
Q35 = collections.defaultdict(list)
Q36 = collections.defaultdict(list)
Q37 = collections.defaultdict(list)
Q38 = collections.defaultdict(list)
Q39 = collections.defaultdict(list)
Q310 = collections.defaultdict(list)
Q399 = collections.defaultdict(list)
for row in data:
    key = row[0]
    if key == "" or key == 0: continue
    if key == 1992:
        if row[2] == 1:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q31.setdefault(key,[]).append(val)
        if row[2] == 2:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q32.setdefault(key,[]).append(val)
        if row[2] == 3:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q33.setdefault(key,[]).append(val)
        if row[2] == 4:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q34.setdefault(key,[]).append(val)
        if row[2] == 5:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q35.setdefault(key,[]).append(val)
        if row[2] == 6:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q36.setdefault(key,[]).append(val)
        if row[2] == 7:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q37.setdefault(key,[]).append(val)
        if row[2] == 8:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q38.setdefault(key,[]).append(val)
        if row[2] == 9:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q39.setdefault(key,[]).append(val)
        if row[2] == 10:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q310.setdefault(key,[]).append(val)
        if row[2] == 99:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q399.setdefault(key,[]).append(val)

Q312 = {}
for k, v in Q31.items(): Q312[k] = sum(v)
for k, v in Q312.items(): print ("{}:{}".format(k,v))
Q322 = {}
for k, v in Q32.items(): Q322[k] = sum(v)
for k, v in Q322.items(): print ("{}:{}".format(k,v))
Q332 = {}
for k, v in Q33.items(): Q332[k] = sum(v)
for k, v in Q332.items(): print ("{}:{}".format(k,v))
Q342 = {}
for k, v in Q34.items(): Q342[k] = sum(v)
for k, v in Q342.items(): print ("{}:{}".format(k,v))
Q352 = {}
for k, v in Q35.items(): Q352[k] = sum(v)
for k, v in Q352.items(): print ("{}:{}".format(k,v))
Q362 = {}
for k, v in Q36.items(): Q362[k] = sum(v)
for k, v in Q362.items(): print ("{}:{}".format(k,v))
Q372 = {}
for k, v in Q37.items(): Q372[k] = sum(v)
for k, v in Q372.items(): print ("{}:{}".format(k,v))
Q382 = {}
for k, v in Q38.items(): Q382[k] = sum(v)
for k, v in Q382.items(): print ("{}:{}".format(k,v))
Q392 = {}
for k, v in Q39.items(): Q392[k] = sum(v)
for k, v in Q392.items(): print ("{}:{}".format(k,v))
Q3102 = {}
for k, v in Q310.items(): Q3102[k] = sum(v)
for k, v in Q3102.items(): print ("{}:{}".format(k,v))
Q3992 = {}
for k, v in Q399.items(): Q3992[k] = sum(v)
for k, v in Q3992.items(): print ("{}:{}".format(k,v))

DSM · Accepted Answer · 2017-04-01 01:36:39Z

5

It's actually a pretty good sign that you've recognized that there has to be an easier way! Whenever you find yourself violating the DRY principle (Don't Repeat Yourself) you should ask if you've taken a misstep.

While you could remove a lot of your duplication simply by using a dictionary of dictionaries instead of all those named variables, since you're using pandas, I would take advantage of groupby and nlargest instead, which gives me:

In [47]: dg = df.groupby(["AAR", "BYDEL"], as_index=False)["PERSONER"].sum()

In [48]: dg[dg.AAR == 1992].nlargest(3, "PERSONER")
Out[48]: 
    AAR  BYDEL  PERSONER
2  1992      3     67251
1  1992      2     62221
3  1992      4     47854

First, we group on the AAR and BYDEL columns, and in each group, we take the PERSONER values and sum them. This gives us a frame which begins

n [51]: dg.head(15)
Out[51]: 
     AAR  BYDEL  PERSONER
0   1992      1     40595
1   1992      2     62221
2   1992      3     67251
3   1992      4     47854
4   1992      5     43688
5   1992      6     34303
6   1992      7     36746
7   1992      8     41668
8   1992      9     45305
9   1992     10     42748
10  1992     99      2187
11  1993      1     40925
12  1993      2     62583
13  1993      3     67783
14  1993      4     47589

then we select the rows where AAR == 1992, and the rows with the 3 largest PERSONER values.

I'd strongly recommend reading through a pandas tutorial if you're going to be doing this type of data processing, otherwise you'll find yourself reinventing wheels.

answered Apr 1, 2017 at 1:36

DSM

355k67 gold badges606 silver badges504 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Rainoa Over a year ago

Exactly the kind of heads up i was looking for. Thanks a ton, pandas will save me a lot of time and energy in the future :)

Rainoa Over a year ago

I'm trying to get top 3 for all the different instances of years and pd.unique() only gives me an array of all the different years. What would be the smartest pandas way around this? Thanks in advanced!

cco · Accepted Answer · 2017-04-01 22:40:02Z

A more pythonic solution would use a dictionary instead of many (most) of your named variables. You are also using setdefault with defaultdict instances - either one is a good choice, but using both is unnecessary.

My alternative version (without using pandas, since @DSM covered that well):

df = pd.read_csv('befkbh.csv',quotechar='"',skipinitialspace=True, delimiter=',', encoding='latin1').fillna(0)
data = df.as_matrix()
areas = { k : collections.defaultdict(list) for k in range(1,11) }
areas[99] = collections.defaultdict(list)

for row in data:
    key = row[0]
    if key == 1992 and row[1] in areas:
       areas[row[1]][key].append(0 if(row[5]) ==""  else float(row[5]))

for area in sorted(areas):
    for k, v in areas[area].items():
        print ("{}:{}".format(k, sum(v)))

I'm assuming that row[2] in the question should have been row[1], since BYDEL is the second column, not the third.

To get the top 3 areas by year, I'd organize things a little differently, with the outer dict being the year, not the area.

That version looks like this:

years = collections.defaultdict(lambda : collections.defaultdict(list))

for row in data:
    years[row[0]][row[1]].append(0 if(row[5]) ==""  else float(row[5]))

for year in sorted(years):
    for n, area in sorted((sum(v), k) for k, v in years[year].items())[:-4:-1]:
        print ("{} {:4} {:9}".format(year, area, n))

Collectives™ on Stack Overflow

Python analyzing csv file

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related