pandas : cannot reindex from a duplicate axis error

Question

I am trying to generate some synthetic data using the below code:

df = pd.DataFrame({
    'year' : pd.Series([2014, 2018]).repeat(500),
    'gender': np.random.choice(['M','F'],1000,p=[0.55,0.45]),
    'age': np.random.choice(range(20,65),1000),
})

df.loc[df['gender']=='M', 'income'] = ( 2000 + 550*df['age'] - 5.25*df['age']**2 ) * np.random.lognormal() * 1.035**(df['year']-2014)

It was working fine until I added in year variable, and now I am getting the following error:

ValueError: cannot reindex from a duplicate axis

I don't understand why I am getting this error as I am not doing anything with the index. What does this particular error mean, and how would I amend the code to get this to work?

jezrael · Accepted Answer · 2022-05-05 10:33:12Z

1

Filter in both sides can help, because is assigned same Series from left side to right side:

np.random.seed(0)

df = pd.DataFrame({
    'year' : pd.Series([2014, 2018]).repeat(500),
    'gender': np.random.choice(['M','F'],1000,p=[0.55,0.45]),
    'age': np.random.choice(range(20,65),1000),
})

m = df['gender']=='M'
df.loc[m, 'income'] = ( 2000 + 550*df.loc[m, 'age'] - 5.25*df.loc[m, 'age']**2 ) * np.random.lognormal() * 1.035**(df.loc[m, 'year']-2014)

print (df)
    year gender  age        income
0   2014      M   37  11556.758435
0   2014      F   49           NaN
0   2014      F   26           NaN
0   2014      M   48  12426.597386
0   2014      M   30  10499.041891
..   ...    ...  ...           ...
1   2018      M   31  12248.836026
1   2018      M   49  14295.447090
1   2018      F   34           NaN
1   2018      M   39  13525.781391
1   2018      F   46           NaN

[1000 rows x 4 columns]

Reason is if not filtered right side are generated duplicated indices with size 1000:

print (( 2000 + 550*df['age'] - 5.25*df['age']**2 ) * np.random.lognormal() * 1.035**(df['year']-2014))
    0    11556.758435
0    12457.656258
0     9718.568650
0    12426.597386
0    10499.041891
    
1    12248.836026
1    14295.447090
1    12796.566872
1    13525.781391
1    14160.974248
Length: 1000, dtype: float64

If filtering lengt of duplicated values is smae like number of Trues values, so assig working well:

print (m.sum())
566 
               

print (( 2000 + 550*df.loc[m, 'age'] - 5.25*df.loc[m, 'age']**2 ) * np.random.lognormal() * 1.035**(df.loc[m, 'year']-2014))
0    11556.758435
0    12426.597386
0    10499.041891
0    12450.987175
0    12072.183268
    
1    10384.145945
1    12248.836026
1    12248.836026
1    14295.447090
1    13525.781391
Length: 566, dtype: float64

Another idea is create default index:

df = df.reset_index(drop=True)

edited May 5, 2022 at 10:33

answered May 5, 2022 at 10:27

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

brb Over a year ago

Oh. Thanks. Can you explain what the error means or alternatively, why don't I need to filter it without the 'year' column? Why does adding the 'year' column kill the previous code?

jezrael Over a year ago

@brb - added to answer.

brb Over a year ago

Oh, I see what is happening by resetting the index - I didn't see that it had a binary index before.

jezrael Over a year ago

@brb - understand. Answer is because pandas assign by indices. If duplicated indices and same length, assigned correct. But if duplicated indices and different length pandas has problem because ambiguous assignment whuich values from 1000 are assigned to 566 values? And raise error

jezrael Over a year ago

@brb - if default index after df = df.reset_index(drop=True) then no duplicatates, so working well

Collectives™ on Stack Overflow

pandas : cannot reindex from a duplicate axis error

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related