1

I would like to filter and split my original dataframe into a number of dataframes using the condition that progressPercentage goes from 1.0 to 100 as in the following example:

Input:

id_B, ts_B,course,weight,Phase,remainingTime,progressPercentage
id1,2017-04-27 01:35:30,cotton,3.5,A,01:15:00,23.0
id1,2017-04-27 01:37:30,cotton,3.5,B,01:13:00,24.0
id1,2017-04-27 01:38:00,cotton,3.5,B,01:13:00,24.0
id1,2017-04-27 01:38:30,cotton,3.5,C,01:13:00,24.0
id1,2017-04-27 01:39:00,cotton,3.5,C,00:02:00,99.0
id1,2017-04-27 01:39:30,cotton,3.5,C,00:01:00,100.0
id1,2017-04-27 01:40:00,cotton,3.5,Finish,00:01:00,100.0
id1,2017-04-27 02:35:30,cotton,3.5,A,03:15:00,1.0
id1,2017-04-27 02:36:00,cotton,3.5,A,03:14:00,2.0  
id1,2017-04-27 02:36:30,cotton,3.5,A,03:14:00,2.0 
id1,2017-04-27 02:37:00,cotton,3.5,B,03:13:00,3.0
id1,2017-04-27 02:37:30,cotton,3.5,B,03:13:00,4.0
id1,2017-04-27 02:38:00,cotton,3.5,B,03:13:00,5.0
id1,2017-04-27 02:38:30,cotton,3.5,C,03:13:00,98.0
id1,2017-04-27 02:39:00,cotton,3.5,C,00:02:00,99.0
id1,2017-04-27 02:39:30,cotton,3.5,C,00:01:00,100.0
id1,2017-04-27 02:40:00,cotton,3.5,Finish,00:01:00,100.0
id2,2017-04-27 03:36:00,cotton,3.5,A,03:15:00,1.0
id2,2017-04-27 03:36:30,cotton,3.5,A,03:14:00,1.0 
id2,2017-04-27 03:37:00,cotton,3.5,B,03:13:00,2.0
id2,2017-04-27 03:37:30,cotton,3.5,B,03:13:00,2.0
id2,2017-04-27 03:38:00,cotton,3.5,B,03:13:00,3.0
id2,2017-04-27 03:38:30,cotton,3.5,C,03:13:00,98.0
id2,2017-04-27 03:39:00,cotton,3.5,C,00:02:00,99.0
id2,2017-04-27 03:39:30,cotton,3.5,C,00:01:00,100.0
id2,2017-04-27 03:40:00,cotton,3.5,Finish,00:01:00,100.0
id1,2017-05-27 01:35:30,cotton,3.5,A,03:15:00,23.0
id1,2017-05-27 01:37:30,cotton,3.5,B,03:13:00,24.0
id1,2017-05-27 01:38:00,cotton,3.5,B,03:13:00,24.0
id1,2017-05-27 01:38:30,cotton,3.5,C,03:13:00,24.0
id1,2017-05-27 01:39:00,cotton,3.5,C,00:02:00,99.0
id1,2017-05-27 01:39:30,cotton,3.5,C,00:01:00,100.0
id1,2017-05-27 01:40:00,cotton,3.5,Finish,00:01:00,100.0
id1,2017-05-27 02:35:30,cotton,3.5,A,01:15:00,1.0
id1,2017-05-27 02:36:00,cotton,3.5,A,01:14:00,2.0  
id1,2017-05-27 02:36:30,cotton,3.5,A,01:13:00,2.0 
id1,2017-05-27 02:37:00,cotton,3.5,B,01:12:00,3.0
id1,2017-05-27 02:37:30,cotton,3.5,B,01:11:00,4.0
id1,2017-05-27 02:38:00,cotton,3.5,B,01:10:00,5.0
id1,2017-05-27 02:38:30,cotton,3.5,C,01:09:00,98.0
id1,2017-05-27 02:39:00,cotton,3.5,C,00:08:00,99.0

Outputs:

id_B, ts_B,course,weight,Phase,remainingTime,progressPercentage
id1,2017-04-27 01:35:30,cotton,3.5,A,01:15:00,23.0
id1,2017-04-27 01:37:30,cotton,3.5,B,01:13:00,24.0
id1,2017-04-27 01:38:00,cotton,3.5,B,01:13:00,24.0
id1,2017-04-27 01:38:30,cotton,3.5,C,01:13:00,24.0
id1,2017-04-27 01:39:00,cotton,3.5,C,00:02:00,99.0
id1,2017-04-27 01:39:30,cotton,3.5,C,00:01:00,100.0
id1,2017-04-27 01:40:00,cotton,3.5,Finish,00:01:00,100.0

id_B, ts_B,course,weight,Phase,remainingTime,progressPercentage
id1,2017-04-27 02:35:30,cotton,3.5,A,03:15:00,1.0
id1,2017-04-27 02:36:00,cotton,3.5,A,03:14:00,2.0  
id1,2017-04-27 02:36:30,cotton,3.5,A,03:14:00,2.0 
id1,2017-04-27 02:37:00,cotton,3.5,B,03:13:00,3.0
id1,2017-04-27 02:37:30,cotton,3.5,B,03:13:00,4.0
id1,2017-04-27 02:38:00,cotton,3.5,B,03:13:00,5.0
id1,2017-04-27 02:38:30,cotton,3.5,C,03:13:00,98.0
id1,2017-04-27 02:39:00,cotton,3.5,C,00:02:00,99.0
id1,2017-04-27 02:39:30,cotton,3.5,C,00:01:00,100.0
id1,2017-04-27 02:40:00,cotton,3.5,Finish,00:01:00,100.0

id_B, ts_B,course,weight,Phase,remainingTime,progressPercentage
id2,2017-04-27 03:36:00,cotton,3.5,A,03:15:00,1.0
id2,2017-04-27 03:36:30,cotton,3.5,A,03:14:00,1.0 
id2,2017-04-27 03:37:00,cotton,3.5,B,03:13:00,2.0
id2,2017-04-27 03:37:30,cotton,3.5,B,03:13:00,2.0
id2,2017-04-27 03:38:00,cotton,3.5,B,03:13:00,3.0
id2,2017-04-27 03:38:30,cotton,3.5,C,03:13:00,98.0
id2,2017-04-27 03:39:00,cotton,3.5,C,00:02:00,99.0
id2,2017-04-27 03:39:30,cotton,3.5,C,00:01:00,100.0
id2,2017-04-27 03:40:00,cotton,3.5,Finish,00:01:00,100.0

id_B, ts_B,course,weight,Phase,remainingTime,progressPercentage
id1,2017-05-27 01:35:30,cotton,3.5,A,03:15:00,1.0
id1,2017-05-27 01:37:30,cotton,3.5,B,03:13:00,2.0
id1,2017-05-27 01:38:00,cotton,3.5,B,03:13:00,3.0
id1,2017-05-27 01:38:30,cotton,3.5,C,03:13:00,4.0
id1,2017-05-27 01:39:00,cotton,3.5,C,00:02:00,99.0
id1,2017-05-27 01:39:30,cotton,3.5,C,00:01:00,100.0
id1,2017-05-27 01:40:00,cotton,3.5,Finish,00:01:00,100.0

id_B, ts_B,course,weight,Phase,remainingTime,progressPercentage
id1,2017-05-27 02:35:30,cotton,3.5,A,01:15:00,1.0
id1,2017-05-27 02:36:00,cotton,3.5,A,01:14:00,2.0  
id1,2017-05-27 02:36:30,cotton,3.5,A,01:13:00,2.0 
id1,2017-05-27 02:37:00,cotton,3.5,B,01:12:00,3.0
id1,2017-05-27 02:37:30,cotton,3.5,B,01:11:00,4.0
id1,2017-05-27 02:38:00,cotton,3.5,B,01:10:00,5.0
id1,2017-05-27 02:38:30,cotton,3.5,C,01:09:00,98.0
id1,2017-05-27 02:39:00,cotton,3.5,C,00:08:00,99.0
id1,2017-05-27 02:39:00,cotton,3.5,C,00:01:00,100.0

I have been using the .shift() and groupby as in the following:

 a = dfb['Operation.progressPercentage'].shift().eq(100)
 grouping = dfb.groupby([dfb.wm_id,a])

but it did not provide the expected results. Please, any help on how I should change the code to get it done?

Many Thanks in advance. Best Regards, Carlo

3 Answers 3

2

If Finish value sometimes missing and need use only progressPercentage column use:

shifted = df['progressPercentage'].shift()
#compare difference for second 100 if together 100 (e.g. 15, 16 row)   
m = shifted.diff(-1).ne(0) & shifted.eq(100)
a = m.cumsum()

aa = df.groupby([df.id_B,a])

for k, gp in aa: 
    print('key=' + str(k))
    print(gp)
    print('A NEW ONE...')

key=('id1', 0)
  id_B                 ts_B  course  weight   Phase remainingTime  \
0  id1  2017-04-27 01:35:30  cotton     3.5       A      01:15:00   
1  id1  2017-04-27 01:37:30  cotton     3.5       B      01:13:00   
2  id1  2017-04-27 01:38:00  cotton     3.5       B      01:13:00   
3  id1  2017-04-27 01:38:30  cotton     3.5       C      01:13:00   
4  id1  2017-04-27 01:39:00  cotton     3.5       C      00:02:00   
5  id1  2017-04-27 01:39:30  cotton     3.5       C      00:01:00   
6  id1  2017-04-27 01:40:00  cotton     3.5  Finish      00:01:00   

   progressPercentage  
0                23.0  
1                24.0  
2                24.0  
3                24.0  
4                99.0  
5               100.0  
6               100.0  
A NEW ONE...
key=('id1', 1)
   id_B                 ts_B  course  weight   Phase remainingTime  \
7   id1  2017-04-27 02:35:30  cotton     3.5       A      03:15:00   
8   id1  2017-04-27 02:36:00  cotton     3.5       A      03:14:00   
9   id1  2017-04-27 02:36:30  cotton     3.5       A      03:14:00   
10  id1  2017-04-27 02:37:00  cotton     3.5       B      03:13:00   
11  id1  2017-04-27 02:37:30  cotton     3.5       B      03:13:00   
12  id1  2017-04-27 02:38:00  cotton     3.5       B      03:13:00   
13  id1  2017-04-27 02:38:30  cotton     3.5       C      03:13:00   
14  id1  2017-04-27 02:39:00  cotton     3.5       C      00:02:00   
15  id1  2017-04-27 02:39:30  cotton     3.5       C      00:01:00   
16  id1  2017-04-27 02:40:00  cotton     3.5  Finish      00:01:00   

    progressPercentage  
7                  1.0  
8                  2.0  
9                  2.0  
10                 3.0  
11                 4.0  
12                 5.0  
13                98.0  
14                99.0  
15               100.0  
16               100.0  
A NEW ONE...
key=('id2', 2)

...
Sign up to request clarification or add additional context in comments.

5 Comments

It is working. Many Thanks Jezrael. Actually my solution is very close to yours and based on the previous feedback you gave. Thanks.
Super, glad you help.
This is much neater. Plus one sir
given the above output, May I ask please: how to count for each id how many dataset I got? in our example, id1=4 and id2=1
You can use print (df.id_B.value_counts())
2

You can divide the dataframe by progressPercentage which is equal 100. Remove the earlier index if they are consecutive.Then slice and append the dataframe to an array. Hope this helps

import numpy as np
df = pd.read_csv('input.csv',delimiter=',') # The input csv provided
df1 = df[(df["progressPercentage "]==100)]
x = (np.array(df1.index) + 1).tolist()
x.insert(0,0)
#Remove the consecutive elements so that they can be treated under one dataframe. 
x = [ begin for begin, end in zip(x, x[1:]) if (begin != end-1)]
x.insert(len(x),df.shape[0])
frames = [df.iloc[begin:end] for begin, end in zip(x, x[1:])]   

You can print the dataframes using a for loop, i.e

for df in frames:
    print(df)

Output of the dataframes:

  id_B                 ts_B  course  weight   Phase remainingTime  \
0  id1  2017-04-27 01:35:30  cotton     3.5       A      01:15:00   
1  id1  2017-04-27 01:37:30  cotton     3.5       B      01:13:00   
2  id1  2017-04-27 01:38:00  cotton     3.5       B      01:13:00   
3  id1  2017-04-27 01:38:30  cotton     3.5       C      01:13:00   
4  id1  2017-04-27 01:39:00  cotton     3.5       C      00:02:00   
5  id1  2017-04-27 01:39:30  cotton     3.5       C      00:01:00   
6  id1  2017-04-27 01:40:00  cotton     3.5  Finish      00:01:00   

   progressPercentage   
0                 23.0  
1                 24.0  
2                 24.0  
3                 24.0  
4                 99.0  
5                100.0  
6                100.0  
   id_B                 ts_B  course  weight   Phase remainingTime  \
7   id1  2017-04-27 02:35:30  cotton     3.5       A      03:15:00   
8   id1  2017-04-27 02:36:00  cotton     3.5       A      03:14:00   
9   id1  2017-04-27 02:36:30  cotton     3.5       A      03:14:00   
10  id1  2017-04-27 02:37:00  cotton     3.5       B      03:13:00   
11  id1  2017-04-27 02:37:30  cotton     3.5       B      03:13:00   
12  id1  2017-04-27 02:38:00  cotton     3.5       B      03:13:00   
13  id1  2017-04-27 02:38:30  cotton     3.5       C      03:13:00   
14  id1  2017-04-27 02:39:00  cotton     3.5       C      00:02:00   
15  id1  2017-04-27 02:39:30  cotton     3.5       C      00:01:00   
16  id1  2017-04-27 02:40:00  cotton     3.5  Finish      00:01:00   

    progressPercentage   
7                   1.0  
8                   2.0  
9                   2.0  
10                  3.0  
11                  4.0  
12                  5.0  
13                 98.0  
14                 99.0  
15                100.0  
16                100.0  
   id_B                 ts_B  course  weight   Phase remainingTime  \
17  id2  2017-04-27 03:36:00  cotton     3.5       A      03:15:00   
18  id2  2017-04-27 03:36:30  cotton     3.5       A      03:14:00   
19  id2  2017-04-27 03:37:00  cotton     3.5       B      03:13:00   
20  id2  2017-04-27 03:37:30  cotton     3.5       B      03:13:00   
21  id2  2017-04-27 03:38:00  cotton     3.5       B      03:13:00   
22  id2  2017-04-27 03:38:30  cotton     3.5       C      03:13:00   
23  id2  2017-04-27 03:39:00  cotton     3.5       C      00:02:00   
24  id2  2017-04-27 03:39:30  cotton     3.5       C      00:01:00   
25  id2  2017-04-27 03:40:00  cotton     3.5  Finish      00:01:00   

    progressPercentage   
17                  1.0  
18                  1.0  
19                  2.0  
20                  2.0  
21                  3.0  
22                 98.0  
23                 99.0  
24                100.0  
25                100.0  
   id_B                 ts_B  course  weight   Phase remainingTime  \
26  id1  2017-05-27 01:35:30  cotton     3.5       A      03:15:00   
27  id1  2017-05-27 01:37:30  cotton     3.5       B      03:13:00   
28  id1  2017-05-27 01:38:00  cotton     3.5       B      03:13:00   
29  id1  2017-05-27 01:38:30  cotton     3.5       C      03:13:00   
30  id1  2017-05-27 01:39:00  cotton     3.5       C      00:02:00   
31  id1  2017-05-27 01:39:30  cotton     3.5       C      00:01:00   
32  id1  2017-05-27 01:40:00  cotton     3.5  Finish      00:01:00   

    progressPercentage   
26                 23.0  
27                 24.0  
28                 24.0  
29                 24.0  
30                 99.0  
31                100.0  
32                100.0  
   id_B                 ts_B  course  weight Phase remainingTime  \
33  id1  2017-05-27 02:35:30  cotton     3.5     A      01:15:00   
34  id1  2017-05-27 02:36:00  cotton     3.5     A      01:14:00   
35  id1  2017-05-27 02:36:30  cotton     3.5     A      01:13:00   
36  id1  2017-05-27 02:37:00  cotton     3.5     B      01:12:00   
37  id1  2017-05-27 02:37:30  cotton     3.5     B      01:11:00   
38  id1  2017-05-27 02:38:00  cotton     3.5     B      01:10:00   
39  id1  2017-05-27 02:38:30  cotton     3.5     C      01:09:00   
40  id1  2017-05-27 02:39:00  cotton     3.5     C      00:08:00   
41  id1  2017-05-27 02:39:00  cotton     3.5     C      00:08:00   

    progressPercentage   
33                  1.0  
34                  2.0  
35                  2.0  
36                  3.0  
37                  4.0  
38                  5.0  
39                 98.0  
40                 99.0  
41                100.0  
   id_B                 ts_B  course  weight   Phase remainingTime  \
42  id2  2017-04-27 03:36:00  cotton     3.5       A      03:15:00   
43  id2  2017-04-27 03:36:30  cotton     3.5       A      03:14:00   
44  id2  2017-04-27 03:37:00  cotton     3.5       B      03:13:00   
45  id2  2017-04-27 03:37:30  cotton     3.5       B      03:13:00   
46  id2  2017-04-27 03:38:00  cotton     3.5       B      03:13:00   
47  id2  2017-04-27 03:38:30  cotton     3.5       C      03:13:00   
48  id2  2017-04-27 03:39:00  cotton     3.5       C      00:02:00   
49  id2  2017-04-27 03:39:30  cotton     3.5       C      00:01:00   
50  id2  2017-04-27 03:40:00  cotton     3.5  Finish      00:01:00   

    progressPercentage   
42                  1.0  
43                  1.0  
44                  2.0  
45                  2.0  
46                  3.0  
47                 98.0  
48                 99.0  
49                100.0  
50                100.0  

6 Comments

Thank you Bharath. I am going to check your solution with the one I designed: a = dfb['Operation.progressPercentage'].shift().eq(100).cumsum() df_output = dfb.groupby([dfb.wm_id,a])
However, the word Finish is not alway present.
I changed your solution as df1 = dfb[dfb['Operation.progressPercentage'] == 100]
Just finished running and, it did not work. What do you mean to use '40:00' as the end of the dataframe? As you can see, I need to separate one experiment from the other and, it finish when the percentage is 100%.
Well, it is not possible because it can be different in each experiment.
|
0

the best way I found is the following:

    a = dfb['progressPercentage'].shift().eq(100).cumsum()
    df_output = dfb.groupby([dfb.id_B,a])

    for k, gp in aa: 
        print('key=' + str(k))
        print(gp.sort_values(['eventTime', 'wm_id'], ascending=[1, 0]).to_string())
        print('A NEW ONE...')

3 Comments

*eventTime'= ts_B and 'wm_id'=id_B
whats aa in that for loop
aa is the df_output

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.