0

I'm trying to find the three most populated city parts(BYDEL) in 1992

I have a csv file looking like this: http://data.kk.dk/dataset/9070067f-ab57-41cd-913e-bc37bfaf9acd/resource/9fbab4aa-1ee0-4d25-b2b4-b7b63537d2ec/download/befkbhalderkoencivst.csv>

The csv file can be explained as:

AAR: Which year the observation was made

BYDEL: Which part of the city, described by an integer contained in following dict; 1=Indre By, 2=Østerbro, 3=Nørrebro, 4=Vesterbro/Kgs. Enghave, 5=Valby, 6=Vanløse, 7=Brønshøj-Husum, 8=Bispebjerg, 9=Amager Øst, 10=Amager Vest, 99=Udenfor inddeling

ALDER: The age of the observed people

PERSONER: Number of observations with the given features of the row

I have a solution but it is very repetitive and i think that it could be done smarter but I don't have enough experience with python. Could anyone point me in the right direction?

My code/solution looks like this:

df = pd.read_csv('befkbh.csv',quotechar='"',skipinitialspace=True, delimiter=',', encoding='latin1').fillna(0)
data = df.as_matrix()
Q31 = collections.defaultdict(list)
Q32 = collections.defaultdict(list)
Q33 = collections.defaultdict(list)
Q34 = collections.defaultdict(list)
Q35 = collections.defaultdict(list)
Q36 = collections.defaultdict(list)
Q37 = collections.defaultdict(list)
Q38 = collections.defaultdict(list)
Q39 = collections.defaultdict(list)
Q310 = collections.defaultdict(list)
Q399 = collections.defaultdict(list)
for row in data:
    key = row[0]
    if key == "" or key == 0: continue
    if key == 1992:
        if row[2] == 1:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q31.setdefault(key,[]).append(val)
        if row[2] == 2:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q32.setdefault(key,[]).append(val)
        if row[2] == 3:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q33.setdefault(key,[]).append(val)
        if row[2] == 4:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q34.setdefault(key,[]).append(val)
        if row[2] == 5:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q35.setdefault(key,[]).append(val)
        if row[2] == 6:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q36.setdefault(key,[]).append(val)
        if row[2] == 7:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q37.setdefault(key,[]).append(val)
        if row[2] == 8:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q38.setdefault(key,[]).append(val)
        if row[2] == 9:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q39.setdefault(key,[]).append(val)
        if row[2] == 10:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q310.setdefault(key,[]).append(val)
        if row[2] == 99:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q399.setdefault(key,[]).append(val)

Q312 = {}
for k, v in Q31.items(): Q312[k] = sum(v)
for k, v in Q312.items(): print ("{}:{}".format(k,v))
Q322 = {}
for k, v in Q32.items(): Q322[k] = sum(v)
for k, v in Q322.items(): print ("{}:{}".format(k,v))
Q332 = {}
for k, v in Q33.items(): Q332[k] = sum(v)
for k, v in Q332.items(): print ("{}:{}".format(k,v))
Q342 = {}
for k, v in Q34.items(): Q342[k] = sum(v)
for k, v in Q342.items(): print ("{}:{}".format(k,v))
Q352 = {}
for k, v in Q35.items(): Q352[k] = sum(v)
for k, v in Q352.items(): print ("{}:{}".format(k,v))
Q362 = {}
for k, v in Q36.items(): Q362[k] = sum(v)
for k, v in Q362.items(): print ("{}:{}".format(k,v))
Q372 = {}
for k, v in Q37.items(): Q372[k] = sum(v)
for k, v in Q372.items(): print ("{}:{}".format(k,v))
Q382 = {}
for k, v in Q38.items(): Q382[k] = sum(v)
for k, v in Q382.items(): print ("{}:{}".format(k,v))
Q392 = {}
for k, v in Q39.items(): Q392[k] = sum(v)
for k, v in Q392.items(): print ("{}:{}".format(k,v))
Q3102 = {}
for k, v in Q310.items(): Q3102[k] = sum(v)
for k, v in Q3102.items(): print ("{}:{}".format(k,v))
Q3992 = {}
for k, v in Q399.items(): Q3992[k] = sum(v)
for k, v in Q3992.items(): print ("{}:{}".format(k,v))

2 Answers 2

5

It's actually a pretty good sign that you've recognized that there has to be an easier way! Whenever you find yourself violating the DRY principle (Don't Repeat Yourself) you should ask if you've taken a misstep.

While you could remove a lot of your duplication simply by using a dictionary of dictionaries instead of all those named variables, since you're using pandas, I would take advantage of groupby and nlargest instead, which gives me:

In [47]: dg = df.groupby(["AAR", "BYDEL"], as_index=False)["PERSONER"].sum()

In [48]: dg[dg.AAR == 1992].nlargest(3, "PERSONER")
Out[48]: 
    AAR  BYDEL  PERSONER
2  1992      3     67251
1  1992      2     62221
3  1992      4     47854

First, we group on the AAR and BYDEL columns, and in each group, we take the PERSONER values and sum them. This gives us a frame which begins

n [51]: dg.head(15)
Out[51]: 
     AAR  BYDEL  PERSONER
0   1992      1     40595
1   1992      2     62221
2   1992      3     67251
3   1992      4     47854
4   1992      5     43688
5   1992      6     34303
6   1992      7     36746
7   1992      8     41668
8   1992      9     45305
9   1992     10     42748
10  1992     99      2187
11  1993      1     40925
12  1993      2     62583
13  1993      3     67783
14  1993      4     47589

then we select the rows where AAR == 1992, and the rows with the 3 largest PERSONER values.

I'd strongly recommend reading through a pandas tutorial if you're going to be doing this type of data processing, otherwise you'll find yourself reinventing wheels.

Sign up to request clarification or add additional context in comments.

2 Comments

Exactly the kind of heads up i was looking for. Thanks a ton, pandas will save me a lot of time and energy in the future :)
I'm trying to get top 3 for all the different instances of years and pd.unique() only gives me an array of all the different years. What would be the smartest pandas way around this? Thanks in advanced!
1

A more pythonic solution would use a dictionary instead of many (most) of your named variables. You are also using setdefault with defaultdict instances - either one is a good choice, but using both is unnecessary.

My alternative version (without using pandas, since @DSM covered that well):

df = pd.read_csv('befkbh.csv',quotechar='"',skipinitialspace=True, delimiter=',', encoding='latin1').fillna(0)
data = df.as_matrix()
areas = { k : collections.defaultdict(list) for k in range(1,11) }
areas[99] = collections.defaultdict(list)

for row in data:
    key = row[0]
    if key == 1992 and row[1] in areas:
       areas[row[1]][key].append(0 if(row[5]) ==""  else float(row[5]))

for area in sorted(areas):
    for k, v in areas[area].items():
        print ("{}:{}".format(k, sum(v)))

I'm assuming that row[2] in the question should have been row[1], since BYDEL is the second column, not the third.

To get the top 3 areas by year, I'd organize things a little differently, with the outer dict being the year, not the area.

That version looks like this:

years = collections.defaultdict(lambda : collections.defaultdict(list))

for row in data:
    years[row[0]][row[1]].append(0 if(row[5]) ==""  else float(row[5]))

for year in sorted(years):
    for n, area in sorted((sum(v), k) for k, v in years[year].items())[:-4:-1]:
        print ("{} {:4} {:9}".format(year, area, n))

1 Comment

Done. Thanks for the push.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.