splitting a pandas Dataframe

Question

I would like to filter and split my original dataframe into a number of dataframes using the condition that progressPercentage goes from 1.0 to 100 as in the following example:

Input:

id_B, ts_B,course,weight,Phase,remainingTime,progressPercentage
id1,2017-04-27 01:35:30,cotton,3.5,A,01:15:00,23.0
id1,2017-04-27 01:37:30,cotton,3.5,B,01:13:00,24.0
id1,2017-04-27 01:38:00,cotton,3.5,B,01:13:00,24.0
id1,2017-04-27 01:38:30,cotton,3.5,C,01:13:00,24.0
id1,2017-04-27 01:39:00,cotton,3.5,C,00:02:00,99.0
id1,2017-04-27 01:39:30,cotton,3.5,C,00:01:00,100.0
id1,2017-04-27 01:40:00,cotton,3.5,Finish,00:01:00,100.0
id1,2017-04-27 02:35:30,cotton,3.5,A,03:15:00,1.0
id1,2017-04-27 02:36:00,cotton,3.5,A,03:14:00,2.0  
id1,2017-04-27 02:36:30,cotton,3.5,A,03:14:00,2.0 
id1,2017-04-27 02:37:00,cotton,3.5,B,03:13:00,3.0
id1,2017-04-27 02:37:30,cotton,3.5,B,03:13:00,4.0
id1,2017-04-27 02:38:00,cotton,3.5,B,03:13:00,5.0
id1,2017-04-27 02:38:30,cotton,3.5,C,03:13:00,98.0
id1,2017-04-27 02:39:00,cotton,3.5,C,00:02:00,99.0
id1,2017-04-27 02:39:30,cotton,3.5,C,00:01:00,100.0
id1,2017-04-27 02:40:00,cotton,3.5,Finish,00:01:00,100.0
id2,2017-04-27 03:36:00,cotton,3.5,A,03:15:00,1.0
id2,2017-04-27 03:36:30,cotton,3.5,A,03:14:00,1.0 
id2,2017-04-27 03:37:00,cotton,3.5,B,03:13:00,2.0
id2,2017-04-27 03:37:30,cotton,3.5,B,03:13:00,2.0
id2,2017-04-27 03:38:00,cotton,3.5,B,03:13:00,3.0
id2,2017-04-27 03:38:30,cotton,3.5,C,03:13:00,98.0
id2,2017-04-27 03:39:00,cotton,3.5,C,00:02:00,99.0
id2,2017-04-27 03:39:30,cotton,3.5,C,00:01:00,100.0
id2,2017-04-27 03:40:00,cotton,3.5,Finish,00:01:00,100.0
id1,2017-05-27 01:35:30,cotton,3.5,A,03:15:00,23.0
id1,2017-05-27 01:37:30,cotton,3.5,B,03:13:00,24.0
id1,2017-05-27 01:38:00,cotton,3.5,B,03:13:00,24.0
id1,2017-05-27 01:38:30,cotton,3.5,C,03:13:00,24.0
id1,2017-05-27 01:39:00,cotton,3.5,C,00:02:00,99.0
id1,2017-05-27 01:39:30,cotton,3.5,C,00:01:00,100.0
id1,2017-05-27 01:40:00,cotton,3.5,Finish,00:01:00,100.0
id1,2017-05-27 02:35:30,cotton,3.5,A,01:15:00,1.0
id1,2017-05-27 02:36:00,cotton,3.5,A,01:14:00,2.0  
id1,2017-05-27 02:36:30,cotton,3.5,A,01:13:00,2.0 
id1,2017-05-27 02:37:00,cotton,3.5,B,01:12:00,3.0
id1,2017-05-27 02:37:30,cotton,3.5,B,01:11:00,4.0
id1,2017-05-27 02:38:00,cotton,3.5,B,01:10:00,5.0
id1,2017-05-27 02:38:30,cotton,3.5,C,01:09:00,98.0
id1,2017-05-27 02:39:00,cotton,3.5,C,00:08:00,99.0

Outputs:

id_B, ts_B,course,weight,Phase,remainingTime,progressPercentage
id1,2017-04-27 01:35:30,cotton,3.5,A,01:15:00,23.0
id1,2017-04-27 01:37:30,cotton,3.5,B,01:13:00,24.0
id1,2017-04-27 01:38:00,cotton,3.5,B,01:13:00,24.0
id1,2017-04-27 01:38:30,cotton,3.5,C,01:13:00,24.0
id1,2017-04-27 01:39:00,cotton,3.5,C,00:02:00,99.0
id1,2017-04-27 01:39:30,cotton,3.5,C,00:01:00,100.0
id1,2017-04-27 01:40:00,cotton,3.5,Finish,00:01:00,100.0

id_B, ts_B,course,weight,Phase,remainingTime,progressPercentage
id1,2017-04-27 02:35:30,cotton,3.5,A,03:15:00,1.0
id1,2017-04-27 02:36:00,cotton,3.5,A,03:14:00,2.0  
id1,2017-04-27 02:36:30,cotton,3.5,A,03:14:00,2.0 
id1,2017-04-27 02:37:00,cotton,3.5,B,03:13:00,3.0
id1,2017-04-27 02:37:30,cotton,3.5,B,03:13:00,4.0
id1,2017-04-27 02:38:00,cotton,3.5,B,03:13:00,5.0
id1,2017-04-27 02:38:30,cotton,3.5,C,03:13:00,98.0
id1,2017-04-27 02:39:00,cotton,3.5,C,00:02:00,99.0
id1,2017-04-27 02:39:30,cotton,3.5,C,00:01:00,100.0
id1,2017-04-27 02:40:00,cotton,3.5,Finish,00:01:00,100.0

id_B, ts_B,course,weight,Phase,remainingTime,progressPercentage
id2,2017-04-27 03:36:00,cotton,3.5,A,03:15:00,1.0
id2,2017-04-27 03:36:30,cotton,3.5,A,03:14:00,1.0 
id2,2017-04-27 03:37:00,cotton,3.5,B,03:13:00,2.0
id2,2017-04-27 03:37:30,cotton,3.5,B,03:13:00,2.0
id2,2017-04-27 03:38:00,cotton,3.5,B,03:13:00,3.0
id2,2017-04-27 03:38:30,cotton,3.5,C,03:13:00,98.0
id2,2017-04-27 03:39:00,cotton,3.5,C,00:02:00,99.0
id2,2017-04-27 03:39:30,cotton,3.5,C,00:01:00,100.0
id2,2017-04-27 03:40:00,cotton,3.5,Finish,00:01:00,100.0

id_B, ts_B,course,weight,Phase,remainingTime,progressPercentage
id1,2017-05-27 01:35:30,cotton,3.5,A,03:15:00,1.0
id1,2017-05-27 01:37:30,cotton,3.5,B,03:13:00,2.0
id1,2017-05-27 01:38:00,cotton,3.5,B,03:13:00,3.0
id1,2017-05-27 01:38:30,cotton,3.5,C,03:13:00,4.0
id1,2017-05-27 01:39:00,cotton,3.5,C,00:02:00,99.0
id1,2017-05-27 01:39:30,cotton,3.5,C,00:01:00,100.0
id1,2017-05-27 01:40:00,cotton,3.5,Finish,00:01:00,100.0

id_B, ts_B,course,weight,Phase,remainingTime,progressPercentage
id1,2017-05-27 02:35:30,cotton,3.5,A,01:15:00,1.0
id1,2017-05-27 02:36:00,cotton,3.5,A,01:14:00,2.0  
id1,2017-05-27 02:36:30,cotton,3.5,A,01:13:00,2.0 
id1,2017-05-27 02:37:00,cotton,3.5,B,01:12:00,3.0
id1,2017-05-27 02:37:30,cotton,3.5,B,01:11:00,4.0
id1,2017-05-27 02:38:00,cotton,3.5,B,01:10:00,5.0
id1,2017-05-27 02:38:30,cotton,3.5,C,01:09:00,98.0
id1,2017-05-27 02:39:00,cotton,3.5,C,00:08:00,99.0
id1,2017-05-27 02:39:00,cotton,3.5,C,00:01:00,100.0

I have been using the .shift() and groupby as in the following:

 a = dfb['Operation.progressPercentage'].shift().eq(100)
 grouping = dfb.groupby([dfb.wm_id,a])

but it did not provide the expected results. Please, any help on how I should change the code to get it done?

Many Thanks in advance. Best Regards, Carlo

jezrael · Accepted Answer · 2017-07-27 06:18:42Z

2

If Finish value sometimes missing and need use only progressPercentage column use:

shifted = df['progressPercentage'].shift()
#compare difference for second 100 if together 100 (e.g. 15, 16 row)   
m = shifted.diff(-1).ne(0) & shifted.eq(100)
a = m.cumsum()

aa = df.groupby([df.id_B,a])

for k, gp in aa: 
    print('key=' + str(k))
    print(gp)
    print('A NEW ONE...')

key=('id1', 0)
  id_B                 ts_B  course  weight   Phase remainingTime  \
0  id1  2017-04-27 01:35:30  cotton     3.5       A      01:15:00   
1  id1  2017-04-27 01:37:30  cotton     3.5       B      01:13:00   
2  id1  2017-04-27 01:38:00  cotton     3.5       B      01:13:00   
3  id1  2017-04-27 01:38:30  cotton     3.5       C      01:13:00   
4  id1  2017-04-27 01:39:00  cotton     3.5       C      00:02:00   
5  id1  2017-04-27 01:39:30  cotton     3.5       C      00:01:00   
6  id1  2017-04-27 01:40:00  cotton     3.5  Finish      00:01:00   

   progressPercentage  
0                23.0  
1                24.0  
2                24.0  
3                24.0  
4                99.0  
5               100.0  
6               100.0  
A NEW ONE...
key=('id1', 1)
   id_B                 ts_B  course  weight   Phase remainingTime  \
7   id1  2017-04-27 02:35:30  cotton     3.5       A      03:15:00   
8   id1  2017-04-27 02:36:00  cotton     3.5       A      03:14:00   
9   id1  2017-04-27 02:36:30  cotton     3.5       A      03:14:00   
10  id1  2017-04-27 02:37:00  cotton     3.5       B      03:13:00   
11  id1  2017-04-27 02:37:30  cotton     3.5       B      03:13:00   
12  id1  2017-04-27 02:38:00  cotton     3.5       B      03:13:00   
13  id1  2017-04-27 02:38:30  cotton     3.5       C      03:13:00   
14  id1  2017-04-27 02:39:00  cotton     3.5       C      00:02:00   
15  id1  2017-04-27 02:39:30  cotton     3.5       C      00:01:00   
16  id1  2017-04-27 02:40:00  cotton     3.5  Finish      00:01:00   

    progressPercentage  
7                  1.0  
8                  2.0  
9                  2.0  
10                 3.0  
11                 4.0  
12                 5.0  
13                98.0  
14                99.0  
15               100.0  
16               100.0  
A NEW ONE...
key=('id2', 2)

...

answered Jul 27, 2017 at 6:18

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Carlo Allocca Over a year ago

It is working. Many Thanks Jezrael. Actually my solution is very close to yours and based on the previous feedback you gave. Thanks.

jezrael Over a year ago

Super, glad you help.

Bharath M Shetty Over a year ago

This is much neater. Plus one sir

Carlo Allocca Over a year ago

given the above output, May I ask please: how to count for each id how many dataset I got? in our example, id1=4 and id2=1

jezrael Over a year ago

You can use print (df.id_B.value_counts())

Bharath M Shetty · Accepted Answer · 2017-07-27 03:43:49Z

You can divide the dataframe by progressPercentage which is equal 100. Remove the earlier index if they are consecutive.Then slice and append the dataframe to an array. Hope this helps

import numpy as np
df = pd.read_csv('input.csv',delimiter=',') # The input csv provided
df1 = df[(df["progressPercentage "]==100)]
x = (np.array(df1.index) + 1).tolist()
x.insert(0,0)
#Remove the consecutive elements so that they can be treated under one dataframe. 
x = [ begin for begin, end in zip(x, x[1:]) if (begin != end-1)]
x.insert(len(x),df.shape[0])
frames = [df.iloc[begin:end] for begin, end in zip(x, x[1:])]

You can print the dataframes using a for loop, i.e

for df in frames:
    print(df)

Output of the dataframes:

  id_B                 ts_B  course  weight   Phase remainingTime  \
0  id1  2017-04-27 01:35:30  cotton     3.5       A      01:15:00   
1  id1  2017-04-27 01:37:30  cotton     3.5       B      01:13:00   
2  id1  2017-04-27 01:38:00  cotton     3.5       B      01:13:00   
3  id1  2017-04-27 01:38:30  cotton     3.5       C      01:13:00   
4  id1  2017-04-27 01:39:00  cotton     3.5       C      00:02:00   
5  id1  2017-04-27 01:39:30  cotton     3.5       C      00:01:00   
6  id1  2017-04-27 01:40:00  cotton     3.5  Finish      00:01:00   

   progressPercentage   
0                 23.0  
1                 24.0  
2                 24.0  
3                 24.0  
4                 99.0  
5                100.0  
6                100.0  
   id_B                 ts_B  course  weight   Phase remainingTime  \
7   id1  2017-04-27 02:35:30  cotton     3.5       A      03:15:00   
8   id1  2017-04-27 02:36:00  cotton     3.5       A      03:14:00   
9   id1  2017-04-27 02:36:30  cotton     3.5       A      03:14:00   
10  id1  2017-04-27 02:37:00  cotton     3.5       B      03:13:00   
11  id1  2017-04-27 02:37:30  cotton     3.5       B      03:13:00   
12  id1  2017-04-27 02:38:00  cotton     3.5       B      03:13:00   
13  id1  2017-04-27 02:38:30  cotton     3.5       C      03:13:00   
14  id1  2017-04-27 02:39:00  cotton     3.5       C      00:02:00   
15  id1  2017-04-27 02:39:30  cotton     3.5       C      00:01:00   
16  id1  2017-04-27 02:40:00  cotton     3.5  Finish      00:01:00   

    progressPercentage   
7                   1.0  
8                   2.0  
9                   2.0  
10                  3.0  
11                  4.0  
12                  5.0  
13                 98.0  
14                 99.0  
15                100.0  
16                100.0  
   id_B                 ts_B  course  weight   Phase remainingTime  \
17  id2  2017-04-27 03:36:00  cotton     3.5       A      03:15:00   
18  id2  2017-04-27 03:36:30  cotton     3.5       A      03:14:00   
19  id2  2017-04-27 03:37:00  cotton     3.5       B      03:13:00   
20  id2  2017-04-27 03:37:30  cotton     3.5       B      03:13:00   
21  id2  2017-04-27 03:38:00  cotton     3.5       B      03:13:00   
22  id2  2017-04-27 03:38:30  cotton     3.5       C      03:13:00   
23  id2  2017-04-27 03:39:00  cotton     3.5       C      00:02:00   
24  id2  2017-04-27 03:39:30  cotton     3.5       C      00:01:00   
25  id2  2017-04-27 03:40:00  cotton     3.5  Finish      00:01:00   

    progressPercentage   
17                  1.0  
18                  1.0  
19                  2.0  
20                  2.0  
21                  3.0  
22                 98.0  
23                 99.0  
24                100.0  
25                100.0  
   id_B                 ts_B  course  weight   Phase remainingTime  \
26  id1  2017-05-27 01:35:30  cotton     3.5       A      03:15:00   
27  id1  2017-05-27 01:37:30  cotton     3.5       B      03:13:00   
28  id1  2017-05-27 01:38:00  cotton     3.5       B      03:13:00   
29  id1  2017-05-27 01:38:30  cotton     3.5       C      03:13:00   
30  id1  2017-05-27 01:39:00  cotton     3.5       C      00:02:00   
31  id1  2017-05-27 01:39:30  cotton     3.5       C      00:01:00   
32  id1  2017-05-27 01:40:00  cotton     3.5  Finish      00:01:00   

    progressPercentage   
26                 23.0  
27                 24.0  
28                 24.0  
29                 24.0  
30                 99.0  
31                100.0  
32                100.0  
   id_B                 ts_B  course  weight Phase remainingTime  \
33  id1  2017-05-27 02:35:30  cotton     3.5     A      01:15:00   
34  id1  2017-05-27 02:36:00  cotton     3.5     A      01:14:00   
35  id1  2017-05-27 02:36:30  cotton     3.5     A      01:13:00   
36  id1  2017-05-27 02:37:00  cotton     3.5     B      01:12:00   
37  id1  2017-05-27 02:37:30  cotton     3.5     B      01:11:00   
38  id1  2017-05-27 02:38:00  cotton     3.5     B      01:10:00   
39  id1  2017-05-27 02:38:30  cotton     3.5     C      01:09:00   
40  id1  2017-05-27 02:39:00  cotton     3.5     C      00:08:00   
41  id1  2017-05-27 02:39:00  cotton     3.5     C      00:08:00   

    progressPercentage   
33                  1.0  
34                  2.0  
35                  2.0  
36                  3.0  
37                  4.0  
38                  5.0  
39                 98.0  
40                 99.0  
41                100.0  
   id_B                 ts_B  course  weight   Phase remainingTime  \
42  id2  2017-04-27 03:36:00  cotton     3.5       A      03:15:00   
43  id2  2017-04-27 03:36:30  cotton     3.5       A      03:14:00   
44  id2  2017-04-27 03:37:00  cotton     3.5       B      03:13:00   
45  id2  2017-04-27 03:37:30  cotton     3.5       B      03:13:00   
46  id2  2017-04-27 03:38:00  cotton     3.5       B      03:13:00   
47  id2  2017-04-27 03:38:30  cotton     3.5       C      03:13:00   
48  id2  2017-04-27 03:39:00  cotton     3.5       C      00:02:00   
49  id2  2017-04-27 03:39:30  cotton     3.5       C      00:01:00   
50  id2  2017-04-27 03:40:00  cotton     3.5  Finish      00:01:00   

    progressPercentage   
42                  1.0  
43                  1.0  
44                  2.0  
45                  2.0  
46                  3.0  
47                 98.0  
48                 99.0  
49                100.0  
50                100.0

Thank you Bharath. I am going to check your solution with the one I designed: a = dfb['Operation.progressPercentage'].shift().eq(100).cumsum() df_output = dfb.groupby([dfb.wm_id,a])
I changed your solution as df1 = dfb[dfb['Operation.progressPercentage'] == 100]
Just finished running and, it did not work. What do you mean to use '40:00' as the end of the dataframe? As you can see, I need to separate one experiment from the other and, it finish when the percentage is 100%.
Well, it is not possible because it can be different in each experiment.

Carlo Allocca · Accepted Answer · 2017-07-26 15:10:01Z

0

the best way I found is the following:

    a = dfb['progressPercentage'].shift().eq(100).cumsum()
    df_output = dfb.groupby([dfb.id_B,a])

    for k, gp in aa: 
        print('key=' + str(k))
        print(gp.sort_values(['eventTime', 'wm_id'], ascending=[1, 0]).to_string())
        print('A NEW ONE...')

answered Jul 26, 2017 at 15:10

Carlo Allocca

6971 gold badge8 silver badges21 bronze badges

3 Comments

Carlo Allocca Over a year ago

*eventTime'= ts_B and 'wm_id'=id_B

Bharath M Shetty Over a year ago

whats aa in that for loop

Carlo Allocca Over a year ago

aa is the df_output

Collectives™ on Stack Overflow

splitting a pandas Dataframe

3 Answers 3

5 Comments

6 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

6 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related