Creating a Function with multiple operations in Python

Question

I am currently doing a project with baby name data. I am looking at the most popular male and female baby names in each decade starting with the 1950s. I am trying to create a function that will print out the top name for the data set that I input.

So far I have successfully created two datasets for each decade (one male and the other female)

This is the code that I have for the function but I can't seem to figure out how to make it work...

def getTopName(data):
    (data
        .drop(columns =['sex', 'prop'])
        .pivot(index = 'name', columns = 'year', values = 'n')
        .sum(axis=1) = data['decade']
        .sort_values(by = 'decade', ascending = False))
    print data[0:1]

Any suggestions on how to accomplish this?

My data looks like this:

Its currently in longform. Can i create a middle function that converts it to wide form and builds a new column where the totals from each year (1960, 1961, ... 1969) can be added together?

The data is 5 columns (name, sex, year, number, and proportion).. there are over a million rows, that's why I want to convert it to a wide data frame — D45
– D45, Commented Nov 11, 2018 at 20:03

benvdh · Accepted Answer · 2018-11-12 21:52:34Z

1

Question 1 - Name with highest n per year

df.groupby(by='name', as_index=False)
      .count()
      .nlargest(1, 'number')
      .iloc[0]["name"]

Sample data

Question 2 - Transform data to wideform

Sample data on which this was tested

Pivot in pandas does not do aggregations. So I split up the steps in getting totals per year and totals per decade. Finally, I join those two to get the desired result:

import pandas as pd

df = pd.read_csv('set2.csv')

# add decade column
df["decade"] = df["year"] - (df["year"] % 10)

# add decade_title column to prevent join clashes
df["decade_total"] = df["decade"]
                       .apply(lambda decade_num: f"{str(decade_num)}_total")

# first pivot with n per year
per_year_df = df.pivot(index="name", columns="year", values="n")

# pivot cannot aggregate so we first aggregate and then pivot
per_decade_df = df\
    .groupby(by=["decade_total", "name"], as_index=False)\
    .agg({"n": 'sum'})\
    .pivot(index="name", columns="decade_total", values="n")

# finally we join the decade totals to the yearly counts
joined_df = per_year_df.join(per_decade_df)

edited Nov 12, 2018 at 21:52

answered Nov 11, 2018 at 20:14

benvdh

6337 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

D45 Over a year ago

But if I do this I'll have do it for each decade still. I'm trying to create one general way so that when I specifically use the 1950s data set I can just run the method to get the top name.

benvdh Over a year ago

Ah, I misread the apart about the datasets being split already per decade and gender. Will update my answer in a few minutes.

Jon Clements Over a year ago

Can get rid of sorting here by using df.groupby('name', as_index=False).nlargest(1, 'number') which is functionally equivalent but without sorting...

benvdh Over a year ago

@JonClements: Thanks! I have updated the answer accordingly.

D45 Over a year ago

I think I was unclear. I have my data all sorted by decade but I want a method that further sorts it (by creating a new column with the totals from each individual year.

|

Collectives™ on Stack Overflow

Creating a Function with multiple operations in Python

1 Answer 1

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related