1

Given the following dataset DF:

uuid,eventTime,Op.progress,Op.progressPercentage, AnotherAttribute
C0972765-8436-0000-0000-000000000000,2017-08-19T12:52:39,P,3.0,01:57:00
C0972765-8436-0000-0000-000000000000,2017-08-19T12:52:49,P,3.0,01:56:00
C0972765-8436-0000-0000-000000000000,2017-08-19T12:53:18,P,4.0,01:55:00
C0972765-8436-0000-0000-000000000000,2017-08-19T12:53:49,P,5.0,01:55:00
C0972765-8436-0000-0000-000000000000,2017-08-19T12:54:27,P,5.0,01:54:00
C0972765-8436-0000-0000-000000000000,2017-08-19T12:55:07,P,6.0,01:54:00
C0972765-8436-0000-0000-000000000000,2017-08-19T12:55:27,P,6.0,01:53:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:33:46,W,40.0,01:13:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:40:10,N,1.0,02:00:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:40:16,N,1.0,02:00:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:40:18,N,1.0,02:00:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:40:55,P,1.0,02:00:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:41:15,P,1.0,01:59:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:41:31,P,3.0,01:57:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:41:51,P,3.0,01:56:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:42:22,P,4.0,01:56:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:42:51,P,4.0,01:55:00
C0972765-8436-0000-0000-000000000000,2017-08-19T15:29:22,S,98.0,00:04:00
C0972765-8436-0000-0000-000000000000,2017-08-19T15:29:27,S,98.0,00:03:00
C0972765-8436-0000-0000-000000000000,2017-08-19T15:30:27,S,99.0,00:02:00
C0972765-8436-0000-0000-000000000000,2017-08-19T15:31:27,S,100.0,00:01:00
C0972765-8436-0000-0000-000000000000,2017-08-19T15:33:01,F,100.0,00:01:00
C0972765-8436-0000-0000-000000000000,2017-08-19T15:33:01,F,100.0,00:01:00

I would like to split into two:

df1:

uuid,eventTime,Op.progress,Op.progressPercentage, AnotherAttribute
C0972765-8436-0000-0000-000000000000,2017-08-19T12:52:39,P,3.0,01:57:00
C0972765-8436-0000-0000-000000000000,2017-08-19T12:52:49,P,3.0,01:56:00
C0972765-8436-0000-0000-000000000000,2017-08-19T12:53:18,P,4.0,01:55:00
C0972765-8436-0000-0000-000000000000,2017-08-19T12:53:49,P,5.0,01:55:00
C0972765-8436-0000-0000-000000000000,2017-08-19T12:54:27,P,5.0,01:54:00
C0972765-8436-0000-0000-000000000000,2017-08-19T12:55:07,P,6.0,01:54:00
C0972765-8436-0000-0000-000000000000,2017-08-19T12:55:27,P,6.0,01:53:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:33:46,W,40.0,01:13:00

df2:

uuid,eventTime,Op.progress,Op.progressPercentage, AnotherAttribute
C0972765-8436-0000-0000-000000000000,2017-08-19T13:40:10,N,1.0,02:00:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:40:16,N,1.0,02:00:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:40:18,N,1.0,02:00:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:40:55,P,1.0,02:00:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:41:15,P,1.0,01:59:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:41:31,P,3.0,01:57:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:41:51,P,3.0,01:56:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:42:22,P,4.0,01:56:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:42:51,P,4.0,01:55:00
C0972765-8436-0000-0000-000000000000,2017-08-19T15:29:22,S,98.0,00:04:00
C0972765-8436-0000-0000-000000000000,2017-08-19T15:29:27,S,98.0,00:03:00
C0972765-8436-0000-0000-000000000000,2017-08-19T15:30:27,S,99.0,00:02:00
C0972765-8436-0000-0000-000000000000,2017-08-19T15:31:27,S,100.0,00:01:00
C0972765-8436-0000-0000-000000000000,2017-08-19T15:33:01,F,100.0,00:01:00
C0972765-8436-0000-0000-000000000000,2017-08-19T15:33:01,F,100.0,00:01:00

The split should be based on the Op.progressPercentage attribute which can assume value from 1 to 100.

When I try to apply the solution provided at splitting a pandas Dataframe as shown below, I do not get the right and expected result.

df_dataset = pd.read_csv(filepath) #your input data saved here
wash_list = []
shifted = df_dataset['Op.progressPercentage'].shift()
m = shifted.diff(-1).ne(0) & shifted.eq(100)
a = m.cumsum()
aa = df_dataset.groupby([df_dataset.uuid,a])
for k, gp in aa:         
    wash_list.append(gp.sort_values(['uuid', 'eventTime'], ascending=[1, 1]))

for wash in wash_list :
    print("")
    print(wash.to_string())
    print("")

Please, any help would be very appreciated. Thank you very much in advance, Best Regards, Carlo

6
  • So, all rows with increasing values are in separate groups? Commented Nov 13, 2017 at 21:03
  • yes. we could consider this as a general valid rule. However, it could also happen that Op.progressPercentage is not always ordered (e.g. the row with 3.0 could also come after the row with 4.o). Commented Nov 13, 2017 at 21:06
  • In that case, what is the logic for a split? Commented Nov 13, 2017 at 21:07
  • In that case I should first recognize that anomalies, correct it and then apply the general rule as we discussed above. Commented Nov 13, 2017 at 21:08
  • Hmm, I really don't think that's possible. For example, when would you determine whether you've hit on an anomaly or the end of a legitimate sequence? You see? I don't think it can be done. Commented Nov 13, 2017 at 21:09

2 Answers 2

3

IIUC, (without accounting for anomalies) you can use diff + cumsum to get distinct groups and groupby on those:

for _, g in df.groupby((~df['Op.progressPercentage']\
                          .diff().fillna(0).ge(0)).cumsum()):
     print(g, '\n')

Details

The groups are found like this:

(~df['Op.progressPercentage'].diff().fillna(0).ge(0)).cumsum()

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     1
9     1
10    1
11    1
12    1
13    1
14    1
15    1
16    1
17    1
18    1
19    1
20    1
21    1
22    1
Name: Op.progressPercentage, dtype: int64
Sign up to request clarification or add additional context in comments.

1 Comment

it is working without accounting anomalies. Thanks you.
2
  • Use np.diff to calculate differences between next value relative to current value.
  • d < 0 show where value drops
  • np.flatnonzero finds locations of non zero values. In our case, True values
  • Sind np.diff shaved off one element from the source array, I add 1 to get the positions correct.
  • np.split separates df into all parts where there was a negative diff
  • I used some fanciness to print it out.

d = np.diff(df['Op.progressPercentage'].values)
results = np.split(df, np.flatnonzero(d < 0) + 1)

print(*results, sep='\n' * 2)

                                   uuid            eventTime Op.progress  Op.progressPercentage  AnotherAttribute
0  C0972765-8436-0000-0000-000000000000  2017-08-19T12:52:39           P                    3.0          01:57:00
1  C0972765-8436-0000-0000-000000000000  2017-08-19T12:52:49           P                    3.0          01:56:00
2  C0972765-8436-0000-0000-000000000000  2017-08-19T12:53:18           P                    4.0          01:55:00
3  C0972765-8436-0000-0000-000000000000  2017-08-19T12:53:49           P                    5.0          01:55:00
4  C0972765-8436-0000-0000-000000000000  2017-08-19T12:54:27           P                    5.0          01:54:00
5  C0972765-8436-0000-0000-000000000000  2017-08-19T12:55:07           P                    6.0          01:54:00
6  C0972765-8436-0000-0000-000000000000  2017-08-19T12:55:27           P                    6.0          01:53:00
7  C0972765-8436-0000-0000-000000000000  2017-08-19T13:33:46           W                   40.0          01:13:00

                                    uuid            eventTime Op.progress  Op.progressPercentage  AnotherAttribute
8   C0972765-8436-0000-0000-000000000000  2017-08-19T13:40:10           N                    1.0          02:00:00
9   C0972765-8436-0000-0000-000000000000  2017-08-19T13:40:16           N                    1.0          02:00:00
10  C0972765-8436-0000-0000-000000000000  2017-08-19T13:40:18           N                    1.0          02:00:00
11  C0972765-8436-0000-0000-000000000000  2017-08-19T13:40:55           P                    1.0          02:00:00
12  C0972765-8436-0000-0000-000000000000  2017-08-19T13:41:15           P                    1.0          01:59:00
13  C0972765-8436-0000-0000-000000000000  2017-08-19T13:41:31           P                    3.0          01:57:00
14  C0972765-8436-0000-0000-000000000000  2017-08-19T13:41:51           P                    3.0          01:56:00
15  C0972765-8436-0000-0000-000000000000  2017-08-19T13:42:22           P                    4.0          01:56:00
16  C0972765-8436-0000-0000-000000000000  2017-08-19T13:42:51           P                    4.0          01:55:00
17  C0972765-8436-0000-0000-000000000000  2017-08-19T15:29:22           S                   98.0          00:04:00
18  C0972765-8436-0000-0000-000000000000  2017-08-19T15:29:27           S                   98.0          00:03:00
19  C0972765-8436-0000-0000-000000000000  2017-08-19T15:30:27           S                   99.0          00:02:00
20  C0972765-8436-0000-0000-000000000000  2017-08-19T15:31:27           S                  100.0          00:01:00
21  C0972765-8436-0000-0000-000000000000  2017-08-19T15:33:01           F                  100.0          00:01:00
22  C0972765-8436-0000-0000-000000000000  2017-08-19T15:33:01           F                  100.0          00:01:00

8 Comments

Many Thanks. How Can I access split by split? I would like to check this solution too.
So, also this way does not take into account the anomalies that we were discussing above.
Please, could you explain what this line np.split(df, np.flatnonzero(d < 0) + 1) compute?
What anomalies? Are they explained in the body of your question?
Thank you. The anomalies is the following: it could also happen that Op.progressPercentage is not always ordered (e.g. the row with 3.0 could also come after the row with 4.0).
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.