1

I am new to coding and I can understand that this is a very basic question

I have a dataframe as:

df

      Unnamed: 0  time                 home_team      away_team       full_time_result                    both_teams_to_score        double_chance
--  ------------  -------------------  -------------  --------------  ----------------------------------  -------------------------  ------------------------------------
 0             0  2021-01-12 18:00:00  Sheff Utd      Newcastle       {'1': 2400, 'X': 3200, '2': 3100}   {'yes': 2000, 'no': 1750}  {'1X': 1360, '12': 1360, '2X': 1530}
 1             1  2021-01-12 20:15:00  Burnley        Man Utd         {'1': 7000, 'X': 4500, '2': 1440}   {'yes': 1900, 'no': 1900}  {'1X': 2620, '12': 1180, '2X': 1100}
 2             2  2021-01-12 20:15:00  Wolverhampton  Everton         {'1': 2450, 'X': 3200, '2': 3000}   {'yes': 1950, 'no': 1800}  {'1X': 1360, '12': 1360, '2X': 1530}
 3             3  2021-01-13 18:00:00  Man City       Brighton        {'1': 1180, 'X': 6500, '2': 14000}  {'yes': 2040, 'no': 1700}  {'1X': 1040, '12': 1110, '2X': 4500}
 4             4  2021-01-13 20:15:00  Aston Villa    Tottenham       {'1': 2620, 'X': 3500, '2': 2500}   {'yes': 1570, 'no': 2250}  {'1X': 1500, '12': 1280, '2X': 1440}
 5             5  2021-01-14 20:00:00  Arsenal        Crystal Palace  {'1': 1500, 'X': 4000, '2': 6500}   {'yes': 1950, 'no': 1800}  {'1X': 1110, '12': 1220, '2X': 2500}
 6             6  2021-01-15 20:00:00  Fulham         Chelsea         {'1': 5750, 'X': 4330, '2': 1530}   {'yes': 1800, 'no': 1950}  {'1X': 2370, '12': 1200, '2X': 1140}
 7             7  2021-01-16 12:30:00  Wolverhampton  West Brom       {'1': 1440, 'X': 4200, '2': 7500}   {'yes': 2250, 'no': 1570}  {'1X': 1100, '12': 1220, '2X': 2620}
 8             8  2021-01-16 15:00:00  Leeds          Brighton        {'1': 2000, 'X': 3600, '2': 3600}   {'yes': 1530, 'no': 2370}  {'1X': 1280, '12': 1280, '2X': 1720}

I am looking to format the dictionary list nicely and get the dataframe as e.g. the full_time_result column would be split into full_time_result_1, full_time_result_X, full_time_result_2 and the same for both_teams_to_score and double_chance as below:

      Unnamed: 0  time                 home_team      away_team       full_time_result_1                    full_time_result_x                    full_time_result_2                    both_teams_to_score_yes        both_teams_to_score_no        double_chance_1X
--  ------------  -------------------  -------------  --------------  ----------------------------------  -------------------------  ------------------------------------

I am following this example given here but I am unable to get it to work. Here is my code:

import pandas as pd
from tabulate import tabulate
df = pd.read_csv(r'C:\Users\Harshad\Desktop\re.csv')
df['full_time_result'] = df['full_time_result'].apply(pd.Series)
print(tabulate(df, headers='keys'))

      Unnamed: 0  time                 home_team      away_team       full_time_result                    both_teams_to_score        double_chance
--  ------------  -------------------  -------------  --------------  ----------------------------------  -------------------------  ------------------------------------
 0             0  2021-01-12 18:00:00  Sheff Utd      Newcastle       {'1': 2400, 'X': 3200, '2': 3100}   {'yes': 2000, 'no': 1750}  {'1X': 1360, '12': 1360, '2X': 1530}
 1             1  2021-01-12 20:15:00  Burnley        Man Utd         {'1': 7000, 'X': 4500, '2': 1440}   {'yes': 1900, 'no': 1900}  {'1X': 2620, '12': 1180, '2X': 1100}
 2             2  2021-01-12 20:15:00  Wolverhampton  Everton         {'1': 2450, 'X': 3200, '2': 3000}   {'yes': 1950, 'no': 1800}  {'1X': 1360, '12': 1360, '2X': 1530}
 3             3  2021-01-13 18:00:00  Man City       Brighton        {'1': 1180, 'X': 6500, '2': 14000}  {'yes': 2040, 'no': 1700}  {'1X': 1040, '12': 1110, '2X': 4500}
 4             4  2021-01-13 20:15:00  Aston Villa    Tottenham       {'1': 2620, 'X': 3500, '2': 2500}   {'yes': 1570, 'no': 2250}  {'1X': 1500, '12': 1280, '2X': 1440}

Help would be greatly appreciated.

0

1 Answer 1

1
  • Verify the columns are dict type, and not str type.
    • If the columns are str type, convert them with ast.literal_eval.
  • Use pandas.json_normalize() to normaize each column of dicts
  • Use a list-comprehension to rename the columns.
  • Use pandas.concat() with axis=1 to combine the dataframes.
import pandas as pd
from ast import literal_eval

# test dataframe
data = {'time': ['2021-01-12 18:00:00', '2021-01-12 20:15:00', '2021-01-12 20:15:00', '2021-01-13 18:00:00', '2021-01-13 20:15:00', '2021-01-14 20:00:00', '2021-01-15 20:00:00', '2021-01-16 12:30:00', '2021-01-16 15:00:00'], 'home_team': ['Sheff Utd', 'Burnley', 'Wolverhampton', 'Man City', 'Aston Villa', 'Arsenal', 'Fulham', 'Wolverhampton', 'Leeds'], 'away_team': ['Newcastle', 'Man Utd', 'Everton', 'Brighton', 'Tottenham', 'Crystal Palace', 'Chelsea', 'West Brom', 'Brighton'], 'full_time_result': ["{'1': 2400, 'X': 3200, '2': 3100}", "{'1': 7000, 'X': 4500, '2': 1440}", "{'1': 2450, 'X': 3200, '2': 3000}", "{'1': 1180, 'X': 6500, '2': 14000}", "{'1': 2620, 'X': 3500, '2': 2500}", "{'1': 1500, 'X': 4000, '2': 6500}", "{'1': 5750, 'X': 4330, '2': 1530}", "{'1': 1440, 'X': 4200, '2': 7500}", "{'1': 2000, 'X': 3600, '2': 3600}"], 'both_teams_to_score': ["{'yes': 2000, 'no': 1750}", "{'yes': 1900, 'no': 1900}", "{'yes': 1950, 'no': 1800}", "{'yes': 2040, 'no': 1700}", "{'yes': 1570, 'no': 2250}", "{'yes': 1950, 'no': 1800}", "{'yes': 1800, 'no': 1950}", "{'yes': 2250, 'no': 1570}", "{'yes': 1530, 'no': 2370}"], 'double_chance': ["{'1X': 1360, '12': 1360, '2X': 1530}", "{'1X': 2620, '12': 1180, '2X': 1100}", "{'1X': 1360, '12': 1360, '2X': 1530}", "{'1X': 1040, '12': 1110, '2X': 4500}", "{'1X': 1500, '12': 1280, '2X': 1440}", "{'1X': 1110, '12': 1220, '2X': 2500}", "{'1X': 2370, '12': 1200, '2X': 1140}", "{'1X': 1100, '12': 1220, '2X': 2620}", "{'1X': 1280, '12': 1280, '2X': 1720}"]}
df = pd.DataFrame(data)

# display(df.head(2))
                  time  home_team  away_team                   full_time_result        both_teams_to_score                         double_chance
0  2021-01-12 18:00:00  Sheff Utd  Newcastle  {'1': 2400, 'X': 3200, '2': 3100}  {'yes': 2000, 'no': 1750}  {'1X': 1360, '12': 1360, '2X': 1530}
1  2021-01-12 20:15:00    Burnley    Man Utd  {'1': 7000, 'X': 4500, '2': 1440}  {'yes': 1900, 'no': 1900}  {'1X': 2620, '12': 1180, '2X': 1100}

# convert time to datetime
df.time = pd.to_datetime(df.time)

# determine if columns are str or dict type
print(type(df.iloc[0, 3]))
[out]:
str

# convert columns from str to dict only if the columns are str type
df.iloc[:, 3:] = df.iloc[:, 3:].applymap(literal_eval)

# normalize columns and rename headers
ftr = pd.json_normalize(df.full_time_result)
ftr.columns = [f'full_time_result_{col}' for col in ftr.columns]

btts = pd.json_normalize(df.both_teams_to_score)
btts.columns = [f'both_teams_to_score_{col}' for col in btts.columns]

dc = pd.json_normalize(df.double_chance)
dc.columns = [f'double_chance_{col}' for col in dc.columns]

# concat the dataframes
df_normalized = pd.concat([df.iloc[:, :3], ftr, btts, dc], axis=1)

display(df_normalized)

                 time      home_team       away_team  full_time_result_1  full_time_result_X  full_time_result_2  both_teams_to_score_yes  both_teams_to_score_no  double_chance_1X  double_chance_12  double_chance_2X
0 2021-01-12 18:00:00      Sheff Utd       Newcastle                2400                3200                3100                     2000                    1750              1360              1360              1530
1 2021-01-12 20:15:00        Burnley         Man Utd                7000                4500                1440                     1900                    1900              2620              1180              1100
2 2021-01-12 20:15:00  Wolverhampton         Everton                2450                3200                3000                     1950                    1800              1360              1360              1530
3 2021-01-13 18:00:00       Man City        Brighton                1180                6500               14000                     2040                    1700              1040              1110              4500
4 2021-01-13 20:15:00    Aston Villa       Tottenham                2620                3500                2500                     1570                    2250              1500              1280              1440
5 2021-01-14 20:00:00        Arsenal  Crystal Palace                1500                4000                6500                     1950                    1800              1110              1220              2500
6 2021-01-15 20:00:00         Fulham         Chelsea                5750                4330                1530                     1800                    1950              2370              1200              1140
7 2021-01-16 12:30:00  Wolverhampton       West Brom                1440                4200                7500                     2250                    1570              1100              1220              2620
8 2021-01-16 15:00:00          Leeds        Brighton                2000                3600                3600                     1530                    2370              1280              1280              1720

Consolidated Code

# convert the columns to dict type if they are str type
df.iloc[:, 3:] = df.iloc[:, 3:].applymap(literal_eval)

# normalize all columns
df_list = list()

for col in df.columns[3:]:
    v = pd.json_normalize(df[col])
    v.columns = [f'{col}_{c}' for c in v.columns]
    df_list.append(v)

# combine into one dataframe
df_normalized = pd.concat([df.iloc[:, :3]] + df_list, axis=1)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.