0

I am looking for a faster way to load data from my json object into a multiindex dataframe.

My JSON is like:

    {
        "1990-1991": {
            "Cleveland": {
                "salary": "$14,403,000",
                "players": {
                    "Hot Rod Williams": "$3,785,000",
                    "Danny Ferry": "$2,640,000",
                    "Mark Price": "$1,400,000",
                    "Brad Daugherty": "$1,320,000",
                    "Larry Nance": "$1,260,000",
                    "Chucky Brown": "$630,000",
                    "Steve Kerr": "$548,000",
                    "Derrick Chievous": "$525,000",
                    "Winston Bennett": "$525,000",
                    "John Morton": "$350,000",
                    "Milos Babic": "$200,000",
                    "Gerald Paddio": "$120,000",
                    "Darnell Valentine": "$100,000",
                    "Henry James": "$75,000"
                },
                "url": "https://hoopshype.com/salaries/cleveland_cavaliers/1990-1991/"
            },

I am making the dataframe like:

    df = pd.DataFrame(columns=["year", "team", "player", "salary"])
    
    for year in nbaSalaryData.keys():
        for team in nbaSalaryData[year]:
            for player in nbaSalaryData[year][team]['players']:
                df = df.append({
                        "year": year,
                        "team": team,
                        "player": player,
                        "salary": nbaSalaryData[year][team]['players'][player]
                    }, ignore_index=True)
    
    df = df.set_index(['year', 'team', 'player']).sort_index()
    df

Which results in:

                                              salary 
    year       team     player
    1990-1991  Atlanta  Doc Rivers          $895,000
                        Dominique Wilkins   $2,065,000
                        Gary Leonard        $200,000
                        John Battle         $590,000
                        Kevin Willis        $685,000
    ... ... ... ...
    2020-2021   Washington  Robin Lopez     $7,300,000
                        Rui Hachimura       $4,692,840
                        Russell Westbrook   $41,358,814
                        Thomas Bryant       $8,333,333
                        Troy Brown          $3,372,840

This is the form I want - year, team, and player as indexes and salary as a column. I know using append is slow but I cannot figure out an alternative. I tried to make it using tuples (with a slightly different configuration - no players and salary) but it ended up not working.

    tuples = []
    index = None

    for year in nbaSalaryData.keys():
        for team in nbaSalaryData[year]:
            t = nbaSalaryData[year][team]
            tuples.append((year, team))

    index = pd.MultiIndex.from_tuples(tuples, names=["year", "team"])
    df = index.to_frame()
    df

Which outputs:

                             year   team
    year    team        
    1990-1991   Cleveland   1990-1991   Cleveland
                New York    1990-1991   New York
                Detroit     1990-1991   Detroit
                LA Lakers   1990-1991   LA Lakers
                Atlanta     1990-1991   Atlanta  

I'm not that familiar with pandas but realize there must be a faster way than append().

2 Answers 2

1

You can adapt the answer to a very similar question as follow:

z = json.loads(json_data)

out = pd.Series({
    (i,j,m): z[i][j][k][m]
    for i in z
    for j in z[i]
    for k in ['players']
    for m in z[i][j][k]
}).to_frame('salary').rename_axis('year team player'.split())

# out:

                                           salary
year      team      player                       
1990-1991 Cleveland Hot Rod Williams   $3,785,000
                    Danny Ferry        $2,640,000
                    Mark Price         $1,400,000
                    Brad Daugherty     $1,320,000
                    Larry Nance        $1,260,000
                    Chucky Brown         $630,000
                    Steve Kerr           $548,000
                    Derrick Chievous     $525,000
                    Winston Bennett      $525,000
                    John Morton          $350,000
                    Milos Babic          $200,000
                    Gerald Paddio        $120,000
                    Darnell Valentine    $100,000
                    Henry James           $75,000

Also, if you intend to do some numerical analysis with those salaries, you probably want them as numbers, not strings. If so, also consider:

out['salary'] = pd.to_numeric(out['salary'].str.replace(r'\D', ''))

PS: Explanation:

The for lines are just one big comprehension to flatten your nested dict. To understand how it works, try first:

[
    (i,j)
    for i in z
    for j in z[i]
]

The 3rd for would be to list all keys of z[i][j], which would be: ['salary', 'players', 'url'], but we are only interested in 'players', so we say so.

The final bit is, instead of a list, we want a dict. Try the expression without surrounding with pd.Series() and you'll see exactly what's going on.

Sign up to request clarification or add additional context in comments.

2 Comments

Wow, that's a doozy to comprehend at the moment. That's a good post - I will go over it more to try and understand how the series is built. This is very fast, especially compared to what I was doing. Thanks Pierre
sure, I added a tiny bit of an explanation and how to progressively see what the various bits are for.
1

We can use the for loop to create the dataframe and append, before finally concatenating: Delaying the concatenation till the end is much better than appending dataframes within the loop

box = []
# data refers to the shared json in the question
for year, value in data.items():
    for team, players in value.items():
        content = players["players"]
        content = pd.DataFrame.from_dict(
            content, orient="index", columns=["salary"]
        ).rename_axis(index="player")
        content = content.assign(year=year, team=team)
        box.append(content)

box

[                       salary       year       team
 player                                             
 Hot Rod Williams   $3,785,000  1990-1991  Cleveland
 Danny Ferry        $2,640,000  1990-1991  Cleveland
 Mark Price         $1,400,000  1990-1991  Cleveland
 Brad Daugherty     $1,320,000  1990-1991  Cleveland
 Larry Nance        $1,260,000  1990-1991  Cleveland
 Chucky Brown         $630,000  1990-1991  Cleveland
 Steve Kerr           $548,000  1990-1991  Cleveland
 Derrick Chievous     $525,000  1990-1991  Cleveland
 Winston Bennett      $525,000  1990-1991  Cleveland
 John Morton          $350,000  1990-1991  Cleveland
 Milos Babic          $200,000  1990-1991  Cleveland
 Gerald Paddio        $120,000  1990-1991  Cleveland
 Darnell Valentine    $100,000  1990-1991  Cleveland
 Henry James           $75,000  1990-1991  Cleveland]

Concatenate and reorder index levels:

(
    pd.concat(box)
    .set_index(["year", "team"], append=True)
    .reorder_levels(["year", "team", "player"])
)

1 Comment

That is an interesting approach. I wouldn't have thought of that.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.