1

I have a pandas DataFrame with a MultiIndex as follows:

>>> import pandas as pd
>>> category = ['bar', 'bar', 'bar', 'bar', 'bar', 'baz', 'baz', 'baz', 'baz',
                'baz', 'baz', 'foo', 'foo', 'foo']
>>> timestamp = ['2017-01-01 09:00:00', '2017-01-01 09:01:00', '2017-01-01 09:02:00', 
                '2017-01-01 09:03:00', '2017-01-01 09:04:00', '2016-11-18 03:18:00', 
                '2016-11-18 03:19:00', '2016-11-18 03:20:00', '2016-11-18 03:21:00', 
                '2016-11-18 03:22:00', '2016-11-18 03:23:00', '2017-02-03 20:39:00', 
                '2017-02-03 20:40:00', '2017-02-03 20:41:00']
>>> values = [1,1,2,2,2,35,3,3,4,4,4,28,28,28]
>>> tuples = list(zip(*[category,timestamp]))
>>> index = pd.MultiIndex.from_tuples(tuples, names=['category', 'timestamp'])
>>> df = pd.DataFrame(values,index=index,columns=['values'])
>>> df
                                  values
category timestamp                  
bar      2017-01-01 09:00:00       1
         2017-01-01 09:01:00       1
         2017-01-01 09:02:00       2
         2017-01-01 09:03:00       2
         2017-01-01 09:04:00       2
baz      2016-11-18 03:18:00      35
         2016-11-18 03:19:00       3
         2016-11-18 03:20:00       3
         2016-11-18 03:21:00       4
         2016-11-18 03:22:00       4
         2016-11-18 03:23:00       4
foo      2017-02-03 20:39:00      28
         2017-02-03 20:40:00      28
         2017-02-03 20:41:00      28

For each category, I want to find the cumulative sum of the number of times the value column changes, like this:

                              values changed  cum_changes
category timestamp                                       
bar      2017-01-01 09:00:00       1   False            0
         2017-01-01 09:01:00       1   False            0
         2017-01-01 09:02:00       2    True            1
         2017-01-01 09:03:00       2   False            1
         2017-01-01 09:04:00       2   False            1
baz      2016-11-18 03:18:00      35   False            0
         2016-11-18 03:19:00       3    True            1
         2016-11-18 03:20:00       3   False            1
         2016-11-18 03:21:00       4    True            2
         2016-11-18 03:22:00       4   False            2
         2016-11-18 03:23:00       4   False            2
foo      2017-02-03 20:39:00      28   False            0
         2017-02-03 20:40:00      28   False            0
         2017-02-03 20:41:00      28   False            0

I tried doing this:

df["changes"] = False
df.iloc[idx[:,1:],1] = df.iloc[idx[:,1:],0] == df.iloc[idx[:,:-1],0]   #This doesn't work
df["cum_changes"] = df["changed"].groupby(level=[0]).cumsum().astype(int)

But unfortunately the second line doesn't work. It's analogous to the way you would multi-index by value using loc, but apparently iloc doesn't handle a MultiIndex the same way. I can't index by label because the timestamps are different in each group, and I can't use head() because the length of each group is different. Is it possible to do positional indexing on the second level of a MultiIndex?

What I actually need is the "cum_changes" column, the "changed" column is just an intermediate step. If there's a different way to calculate the "cum_changes" column I'm interested to hear it. I know it can be done by iterating over the category column, but it seems like it should be possible to keep this vectorized, so I'm looking for a solution that does not involve looping.

I found this related question, but I don't believe it applies since the solution isn't actually indexing by position but rather finding the labels that correspond to the given positions and indexing by the labels: Slice MultiIndex pandas DataFrame by position

1
  • 2
    Use diff() function and check if the result is 0, you don't need the iloc. Commented Mar 6, 2017 at 20:41

1 Answer 1

1

you can use diff() as @Psidom has already said in the comment:

In [25]: df['x'] = df.groupby(level=0)['values'] \
                     .apply(lambda x: x.diff().fillna(0).ne(0).cumsum())

In [26]: df
Out[26]:
                              values  x
category timestamp
bar      2017-01-01 09:00:00       1  0
         2017-01-01 09:01:00       1  0
         2017-01-01 09:02:00       2  1
         2017-01-01 09:03:00       2  1
         2017-01-01 09:04:00       2  1
baz      2016-11-18 03:18:00      35  0
         2016-11-18 03:19:00       3  1
         2016-11-18 03:20:00       3  1
         2016-11-18 03:21:00       4  2
         2016-11-18 03:22:00       4  2
         2016-11-18 03:23:00       4  2
foo      2017-02-03 20:39:00      28  0
         2017-02-03 20:40:00      28  0
         2017-02-03 20:41:00      28  0
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.