create new column of dataframe base on value of another dataframe run fast?

Question

i want to create a new columns for my df_cau2['continent']. first there r 2 df of mine:

country_continent
    Continent
Country 
Afghanistan Asia
Albania Europe
Algeria Africa
American Samoa  Oceania

and
df_cau2 
    date    home_team   away_team   home_score  away_score  tournament  city    country neutral
0   1872-11-30  Scotland    England 0   0   Friendly    Glasgow Scotland    False
1   1873-03-08  England Scotland    4   2   Friendly    London  England False
2   1874-03-07  Scotland    England 2   1   Friendly    Glasgow Scotland    False

to create new column continent i use apply for df_cau2 like this:


def same_continent(home,away):
    if country_continent.loc[home].Continent == country_continent.loc[away].Continent:
        return country_continent.loc[home].Continent
    return 'None'

df_cau2['continent']=df_cau2.apply(lambda x: same_continent(x['home_team'],x['away_team']),axis=1)
df_cau2.head()

with 39480 rows of df_cau2, this code run too slow, how can i change my code to run it's faster? i am thinking about using np.select but i don't know how to use it's in this case.

This is result that i want:

date    home_team   away_team   home_score  away_score  tournament  city    country neutral continent
7611    1970-09-11  Iran    Turkey  1   1   Friendly    Teheran Iran    False   None
31221   2009-03-11  Nepal   Pakistan    1   0   Friendly    Kathmandu   Nepal   False   Asia
32716   2010-11-17  Colombia    Peru    1   1   Friendly    Bogotá  Colombia    False   South America

Thanks

Corralien · Accepted Answer · 2022-01-15 16:21:51Z

1

IIUC, you want to set continent column only if home_team and away_team columns are in the same continent:

home_continent = df1['home_team'].map(df2.squeeze())
away_continent = df1['away_team'].map(df2.squeeze())
m = home_continent == away_continent
df1.loc[m, 'continent'] = home_continent.loc[m]
print(df1)

# Output
  home_team away_team continent
0    Canada   England       NaN
1    France     Spain    Europe
2     China     Japan      Asia

Setup a MRE

df1 = pd.DataFrame({'home_team': ['Canada', 'France', 'China'],
                    'away_team': ['England', 'Spain', 'Japan']})
print(df1)

df2 = pd.DataFrame({'Country': ['Canada', 'China', 'England',
                                'France', 'Japan', 'Spain'],
                    'Continent': ['North America', 'Asia', 'Europe',
                                  'Europe', 'Asia', 'Europe']}).set_index('Country')
print(df2)

# Output df1
  home_team away_team
0    Canada   England
1    France     Spain
2     China     Japan

# Output df2
             Continent
Country               
Canada   North America
China             Asia
England         Europe
France          Europe
Japan             Asia
Spain           Europe

answered Jan 15, 2022 at 16:21

Corralien

121k8 gold badges44 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

robocon20x Over a year ago

thanks, i think this is the fastest way i could find. my code need 26s but ur only 0.1s :))

Parfait · Accepted Answer · 2022-01-16 05:44:26Z

1

Consider merge of the continent lookup data frame twice to create home and away continent columns. And since you will have both continents, assign new shared continent column conditionally with numpy.where:

df_cau2 = (
    df.cau2.merge(
        country_continent.reset_index(),
        left_on = "home_team",
        right_on = "Country",
        how = "left"
    ).merge(
        country_continent.reset_index(),
        left_on = "away_team",
        right_on = "Country",
        how = "left",
        suffixes = ["_home", "_away"]
    )
)

df_cau2["shared_continent"] = np.where(
    df_cau2["Continent_home"].eq(df_cau2["Continent_away"]),
    df_cau2["Continent_home"],
    np.nan
)

edited Jan 16, 2022 at 5:44

answered Jan 15, 2022 at 18:29

Parfait

108k19 gold badges103 silver badges138 bronze badges

Collectives™ on Stack Overflow

create new column of dataframe base on value of another dataframe run fast?

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related