1

I am trying to generate some synthetic data using the below code:

df = pd.DataFrame({
    'year' : pd.Series([2014, 2018]).repeat(500),
    'gender': np.random.choice(['M','F'],1000,p=[0.55,0.45]),
    'age': np.random.choice(range(20,65),1000),
})

df.loc[df['gender']=='M', 'income'] = ( 2000 + 550*df['age'] - 5.25*df['age']**2 ) * np.random.lognormal() * 1.035**(df['year']-2014)

It was working fine until I added in year variable, and now I am getting the following error:

ValueError: cannot reindex from a duplicate axis

I don't understand why I am getting this error as I am not doing anything with the index. What does this particular error mean, and how would I amend the code to get this to work?

1 Answer 1

1

Filter in both sides can help, because is assigned same Series from left side to right side:

np.random.seed(0)

df = pd.DataFrame({
    'year' : pd.Series([2014, 2018]).repeat(500),
    'gender': np.random.choice(['M','F'],1000,p=[0.55,0.45]),
    'age': np.random.choice(range(20,65),1000),
})

m = df['gender']=='M'
df.loc[m, 'income'] = ( 2000 + 550*df.loc[m, 'age'] - 5.25*df.loc[m, 'age']**2 ) * np.random.lognormal() * 1.035**(df.loc[m, 'year']-2014)

print (df)
    year gender  age        income
0   2014      M   37  11556.758435
0   2014      F   49           NaN
0   2014      F   26           NaN
0   2014      M   48  12426.597386
0   2014      M   30  10499.041891
..   ...    ...  ...           ...
1   2018      M   31  12248.836026
1   2018      M   49  14295.447090
1   2018      F   34           NaN
1   2018      M   39  13525.781391
1   2018      F   46           NaN

[1000 rows x 4 columns]
                    

Reason is if not filtered right side are generated duplicated indices with size 1000:

print (( 2000 + 550*df['age'] - 5.25*df['age']**2 ) * np.random.lognormal() * 1.035**(df['year']-2014))
    0    11556.758435
0    12457.656258
0     9718.568650
0    12426.597386
0    10499.041891
    
1    12248.836026
1    14295.447090
1    12796.566872
1    13525.781391
1    14160.974248
Length: 1000, dtype: float64

If filtering lengt of duplicated values is smae like number of Trues values, so assig working well:

print (m.sum())
566 
               

print (( 2000 + 550*df.loc[m, 'age'] - 5.25*df.loc[m, 'age']**2 ) * np.random.lognormal() * 1.035**(df.loc[m, 'year']-2014))
0    11556.758435
0    12426.597386
0    10499.041891
0    12450.987175
0    12072.183268
    
1    10384.145945
1    12248.836026
1    12248.836026
1    14295.447090
1    13525.781391
Length: 566, dtype: float64    

Another idea is create default index:

df = df.reset_index(drop=True)
Sign up to request clarification or add additional context in comments.

5 Comments

Oh. Thanks. Can you explain what the error means or alternatively, why don't I need to filter it without the 'year' column? Why does adding the 'year' column kill the previous code?
@brb - added to answer.
Oh, I see what is happening by resetting the index - I didn't see that it had a binary index before.
@brb - understand. Answer is because pandas assign by indices. If duplicated indices and same length, assigned correct. But if duplicated indices and different length pandas has problem because ambiguous assignment whuich values from 1000 are assigned to 566 values? And raise error
@brb - if default index after df = df.reset_index(drop=True) then no duplicatates, so working well

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.