3

I am currently doing a project with baby name data. I am looking at the most popular male and female baby names in each decade starting with the 1950s. I am trying to create a function that will print out the top name for the data set that I input.

So far I have successfully created two datasets for each decade (one male and the other female)

This is the code that I have for the function but I can't seem to figure out how to make it work...

def getTopName(data):
    (data
        .drop(columns =['sex', 'prop'])
        .pivot(index = 'name', columns = 'year', values = 'n')
        .sum(axis=1) = data['decade']
        .sort_values(by = 'decade', ascending = False))
    print data[0:1]

Any suggestions on how to accomplish this?

My data looks like this: my data

Its currently in longform. Can i create a middle function that converts it to wide form and builds a new column where the totals from each year (1960, 1961, ... 1969) can be added together?

3
  • 1
    Structure of your dataset? Any sample data? Commented Nov 11, 2018 at 19:47
  • 1
    All the lines of a function have to be indented. Commented Nov 11, 2018 at 19:47
  • The data is 5 columns (name, sex, year, number, and proportion).. there are over a million rows, that's why I want to convert it to a wide data frame Commented Nov 11, 2018 at 20:03

1 Answer 1

1

Question 1 - Name with highest n per year

df.groupby(by='name', as_index=False)
      .count()
      .nlargest(1, 'number')
      .iloc[0]["name"]

Sample data

Question 2 - Transform data to wideform

Sample data on which this was tested

Pivot in pandas does not do aggregations. So I split up the steps in getting totals per year and totals per decade. Finally, I join those two to get the desired result:

import pandas as pd

df = pd.read_csv('set2.csv')

# add decade column
df["decade"] = df["year"] - (df["year"] % 10)

# add decade_title column to prevent join clashes
df["decade_total"] = df["decade"]
                       .apply(lambda decade_num: f"{str(decade_num)}_total")

# first pivot with n per year
per_year_df = df.pivot(index="name", columns="year", values="n")

# pivot cannot aggregate so we first aggregate and then pivot
per_decade_df = df\
    .groupby(by=["decade_total", "name"], as_index=False)\
    .agg({"n": 'sum'})\
    .pivot(index="name", columns="decade_total", values="n")

# finally we join the decade totals to the yearly counts
joined_df = per_year_df.join(per_decade_df)

Joined result table

Sign up to request clarification or add additional context in comments.

8 Comments

But if I do this I'll have do it for each decade still. I'm trying to create one general way so that when I specifically use the 1950s data set I can just run the method to get the top name.
Ah, I misread the apart about the datasets being split already per decade and gender. Will update my answer in a few minutes.
Can get rid of sorting here by using df.groupby('name', as_index=False).nlargest(1, 'number') which is functionally equivalent but without sorting...
@JonClements: Thanks! I have updated the answer accordingly.
I think I was unclear. I have my data all sorted by decade but I want a method that further sorts it (by creating a new column with the totals from each individual year.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.