Splitting a pandas Dataframe into separate groups

Question

Given the following dataset DF:

uuid,eventTime,Op.progress,Op.progressPercentage, AnotherAttribute
C0972765-8436-0000-0000-000000000000,2017-08-19T12:52:39,P,3.0,01:57:00
C0972765-8436-0000-0000-000000000000,2017-08-19T12:52:49,P,3.0,01:56:00
C0972765-8436-0000-0000-000000000000,2017-08-19T12:53:18,P,4.0,01:55:00
C0972765-8436-0000-0000-000000000000,2017-08-19T12:53:49,P,5.0,01:55:00
C0972765-8436-0000-0000-000000000000,2017-08-19T12:54:27,P,5.0,01:54:00
C0972765-8436-0000-0000-000000000000,2017-08-19T12:55:07,P,6.0,01:54:00
C0972765-8436-0000-0000-000000000000,2017-08-19T12:55:27,P,6.0,01:53:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:33:46,W,40.0,01:13:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:40:10,N,1.0,02:00:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:40:16,N,1.0,02:00:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:40:18,N,1.0,02:00:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:40:55,P,1.0,02:00:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:41:15,P,1.0,01:59:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:41:31,P,3.0,01:57:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:41:51,P,3.0,01:56:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:42:22,P,4.0,01:56:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:42:51,P,4.0,01:55:00
C0972765-8436-0000-0000-000000000000,2017-08-19T15:29:22,S,98.0,00:04:00
C0972765-8436-0000-0000-000000000000,2017-08-19T15:29:27,S,98.0,00:03:00
C0972765-8436-0000-0000-000000000000,2017-08-19T15:30:27,S,99.0,00:02:00
C0972765-8436-0000-0000-000000000000,2017-08-19T15:31:27,S,100.0,00:01:00
C0972765-8436-0000-0000-000000000000,2017-08-19T15:33:01,F,100.0,00:01:00
C0972765-8436-0000-0000-000000000000,2017-08-19T15:33:01,F,100.0,00:01:00

I would like to split into two:

df1:

uuid,eventTime,Op.progress,Op.progressPercentage, AnotherAttribute
C0972765-8436-0000-0000-000000000000,2017-08-19T12:52:39,P,3.0,01:57:00
C0972765-8436-0000-0000-000000000000,2017-08-19T12:52:49,P,3.0,01:56:00
C0972765-8436-0000-0000-000000000000,2017-08-19T12:53:18,P,4.0,01:55:00
C0972765-8436-0000-0000-000000000000,2017-08-19T12:53:49,P,5.0,01:55:00
C0972765-8436-0000-0000-000000000000,2017-08-19T12:54:27,P,5.0,01:54:00
C0972765-8436-0000-0000-000000000000,2017-08-19T12:55:07,P,6.0,01:54:00
C0972765-8436-0000-0000-000000000000,2017-08-19T12:55:27,P,6.0,01:53:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:33:46,W,40.0,01:13:00

df2:

uuid,eventTime,Op.progress,Op.progressPercentage, AnotherAttribute
C0972765-8436-0000-0000-000000000000,2017-08-19T13:40:10,N,1.0,02:00:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:40:16,N,1.0,02:00:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:40:18,N,1.0,02:00:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:40:55,P,1.0,02:00:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:41:15,P,1.0,01:59:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:41:31,P,3.0,01:57:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:41:51,P,3.0,01:56:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:42:22,P,4.0,01:56:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:42:51,P,4.0,01:55:00
C0972765-8436-0000-0000-000000000000,2017-08-19T15:29:22,S,98.0,00:04:00
C0972765-8436-0000-0000-000000000000,2017-08-19T15:29:27,S,98.0,00:03:00
C0972765-8436-0000-0000-000000000000,2017-08-19T15:30:27,S,99.0,00:02:00
C0972765-8436-0000-0000-000000000000,2017-08-19T15:31:27,S,100.0,00:01:00
C0972765-8436-0000-0000-000000000000,2017-08-19T15:33:01,F,100.0,00:01:00
C0972765-8436-0000-0000-000000000000,2017-08-19T15:33:01,F,100.0,00:01:00

The split should be based on the Op.progressPercentage attribute which can assume value from 1 to 100.

When I try to apply the solution provided at splitting a pandas Dataframe as shown below, I do not get the right and expected result.

df_dataset = pd.read_csv(filepath) #your input data saved here
wash_list = []
shifted = df_dataset['Op.progressPercentage'].shift()
m = shifted.diff(-1).ne(0) & shifted.eq(100)
a = m.cumsum()
aa = df_dataset.groupby([df_dataset.uuid,a])
for k, gp in aa:         
    wash_list.append(gp.sort_values(['uuid', 'eventTime'], ascending=[1, 1]))

for wash in wash_list :
    print("")
    print(wash.to_string())
    print("")

Please, any help would be very appreciated. Thank you very much in advance, Best Regards, Carlo

yes. we could consider this as a general valid rule. However, it could also happen that Op.progressPercentage is not always ordered (e.g. the row with 3.0 could also come after the row with 4.o). — Carlo Allocca
– Carlo Allocca, Commented Nov 13, 2017 at 21:06
In that case I should first recognize that anomalies, correct it and then apply the general rule as we discussed above. — Carlo Allocca
– Carlo Allocca, Commented Nov 13, 2017 at 21:08
Hmm, I really don't think that's possible. For example, when would you determine whether you've hit on an anomaly or the end of a legitimate sequence? You see? I don't think it can be done. — cs95
– cs95, Commented Nov 13, 2017 at 21:09

cs95 · Accepted Answer · 2017-11-13 21:06:25Z

3

IIUC, (without accounting for anomalies) you can use diff + cumsum to get distinct groups and groupby on those:

for _, g in df.groupby((~df['Op.progressPercentage']\
                          .diff().fillna(0).ge(0)).cumsum()):
     print(g, '\n')

Details

The groups are found like this:

(~df['Op.progressPercentage'].diff().fillna(0).ge(0)).cumsum()

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     1
9     1
10    1
11    1
12    1
13    1
14    1
15    1
16    1
17    1
18    1
19    1
20    1
21    1
22    1
Name: Op.progressPercentage, dtype: int64

answered Nov 13, 2017 at 21:06

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Carlo Allocca Over a year ago

it is working without accounting anomalies. Thanks you.

piRSquared · Accepted Answer · 2017-11-13 21:36:30Z

2

Use np.diff to calculate differences between next value relative to current value.
d < 0 show where value drops
np.flatnonzero finds locations of non zero values. In our case, True values
Sind np.diff shaved off one element from the source array, I add 1 to get the positions correct.
np.split separates df into all parts where there was a negative diff
I used some fanciness to print it out.

d = np.diff(df['Op.progressPercentage'].values)
results = np.split(df, np.flatnonzero(d < 0) + 1)

print(*results, sep='\n' * 2)

                                   uuid            eventTime Op.progress  Op.progressPercentage  AnotherAttribute
0  C0972765-8436-0000-0000-000000000000  2017-08-19T12:52:39           P                    3.0          01:57:00
1  C0972765-8436-0000-0000-000000000000  2017-08-19T12:52:49           P                    3.0          01:56:00
2  C0972765-8436-0000-0000-000000000000  2017-08-19T12:53:18           P                    4.0          01:55:00
3  C0972765-8436-0000-0000-000000000000  2017-08-19T12:53:49           P                    5.0          01:55:00
4  C0972765-8436-0000-0000-000000000000  2017-08-19T12:54:27           P                    5.0          01:54:00
5  C0972765-8436-0000-0000-000000000000  2017-08-19T12:55:07           P                    6.0          01:54:00
6  C0972765-8436-0000-0000-000000000000  2017-08-19T12:55:27           P                    6.0          01:53:00
7  C0972765-8436-0000-0000-000000000000  2017-08-19T13:33:46           W                   40.0          01:13:00

                                    uuid            eventTime Op.progress  Op.progressPercentage  AnotherAttribute
8   C0972765-8436-0000-0000-000000000000  2017-08-19T13:40:10           N                    1.0          02:00:00
9   C0972765-8436-0000-0000-000000000000  2017-08-19T13:40:16           N                    1.0          02:00:00
10  C0972765-8436-0000-0000-000000000000  2017-08-19T13:40:18           N                    1.0          02:00:00
11  C0972765-8436-0000-0000-000000000000  2017-08-19T13:40:55           P                    1.0          02:00:00
12  C0972765-8436-0000-0000-000000000000  2017-08-19T13:41:15           P                    1.0          01:59:00
13  C0972765-8436-0000-0000-000000000000  2017-08-19T13:41:31           P                    3.0          01:57:00
14  C0972765-8436-0000-0000-000000000000  2017-08-19T13:41:51           P                    3.0          01:56:00
15  C0972765-8436-0000-0000-000000000000  2017-08-19T13:42:22           P                    4.0          01:56:00
16  C0972765-8436-0000-0000-000000000000  2017-08-19T13:42:51           P                    4.0          01:55:00
17  C0972765-8436-0000-0000-000000000000  2017-08-19T15:29:22           S                   98.0          00:04:00
18  C0972765-8436-0000-0000-000000000000  2017-08-19T15:29:27           S                   98.0          00:03:00
19  C0972765-8436-0000-0000-000000000000  2017-08-19T15:30:27           S                   99.0          00:02:00
20  C0972765-8436-0000-0000-000000000000  2017-08-19T15:31:27           S                  100.0          00:01:00
21  C0972765-8436-0000-0000-000000000000  2017-08-19T15:33:01           F                  100.0          00:01:00
22  C0972765-8436-0000-0000-000000000000  2017-08-19T15:33:01           F                  100.0          00:01:00

edited Nov 13, 2017 at 21:36

answered Nov 13, 2017 at 21:14

piRSquared

296k68 gold badges509 silver badges654 bronze badges

8 Comments

Carlo Allocca Over a year ago

Many Thanks. How Can I access split by split? I would like to check this solution too.

Carlo Allocca Over a year ago

So, also this way does not take into account the anomalies that we were discussing above.

Carlo Allocca Over a year ago

Please, could you explain what this line np.split(df, np.flatnonzero(d < 0) + 1) compute?

piRSquared Over a year ago

What anomalies? Are they explained in the body of your question?

Carlo Allocca Over a year ago

Thank you. The anomalies is the following: it could also happen that Op.progressPercentage is not always ordered (e.g. the row with 3.0 could also come after the row with 4.0).

|

Collectives™ on Stack Overflow

Splitting a pandas Dataframe into separate groups

2 Answers 2

1 Comment

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related