1

I have a dataframe like this,

>>> data = {
    'year':[2019, 2020, 2020, 2019, 2020, 2019],
    'provider':['X', 'X', 'Y', 'Z', 'Z', 'T'],
    'price':[100, 122, 0, 150, 120, 80],
    'count':[20, 15, 24, 16, 24, 10]
}
>>> df = pd.DataFrame(data)
>>> df
   year provider  price  count
0  2019        X    100     20
1  2020        X    122     15
2  2020        Y      0     24
3  2019        Z    150     16
4  2020        Z    120     24
5  2019        T     80     10

And this is expected output:

  provider  price_rate  count_rate
0        X        0.22       -0.25
1        Z       -0.20        0.50

I want to group prices on providers and find price, count differences between 2019 and 2020. If there is no price or count record at 2020 or 2019, don't want to see related provider.

2
  • Are there always only 1 or 2 rows per provider? Commented Jan 16, 2020 at 15:14
  • Yeap always 1 or 2. Commented Jan 16, 2020 at 15:17

2 Answers 2

3

By the assumption that there are always only 1 or 2 rows per provider, we can first sort_values on year to make sure 2019 comes before 2020.

Then we groupby on provider and divide the rows of price and count and substract 1.

df = df.sort_values('year')
grp = (
    df.groupby('provider')
      .apply(lambda x: x[['price', 'count']].div(x[['price', 'count']].shift()).sub(1))
)

dfnew = df[['provider']].join(grp).dropna()

  provider  price  count
1        X   0.22  -0.25
4        Z  -0.20   0.50

Or only vectorized methods:

dfnew = df[df['provider'].duplicated(keep=False)].sort_values(['provider', 'year'])
dfnew[['price', 'count']] = (
    dfnew[['price', 'count']].div(dfnew[['price', 'count']].shift()).sub(1)
)

dfnew = dfnew[dfnew['provider'].eq(dfnew['provider'].shift())].drop('year', axis=1)

  provider  price  count
1        X   0.22  -0.25
4        Z  -0.20   0.50
Sign up to request clarification or add additional context in comments.

1 Comment

Just for your convenience, I added another method, which might (not 100% sure) run faster on larger datasets. @AlperenTaşkın
3

You can try:

final = (df.set_index(['provider','year']).groupby(level=0)
      .pct_change().dropna().droplevel(1).add_suffix('_count').reset_index())

  provider  price_rate  count_rate
0        X        0.22       -0.25
1        Z       -0.20        0.50

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.