Slice DataFrame to others DataFrames based in Column Value with Continuos Data

Question

I have a DataFrame like this:

In[2]: import pandas as pd
  ...: flow = {
  ...:     'Date':['09/19','09/19','09/19','09/19','09/19','09/19','10/19','10/19','10/19','10/19','10/19','10/19','10/19'],
  ...:     'Time':['23:00','23:10','23:20','23:30','23:40','23:50','00:00','00:10','00:20','00:30','00:40','00:50','01:00'],
  ...:     'Name':['P10  ','P10  ','P10  ','P10  ','P5   ','P5   ','P5   ','P10  ','P10  ','P10  ','P6   ','P6   ','P6   '],
  ...:     'Data':['10000','10002','10004','10005','10007','10008','10010','10012','10013','10014','10020','10022','10023']
  ...: }
  ...: flowdata = pd.DataFrame(flow)
  ...: flowdata = flowdata[['Date', 'Time', 'Name', 'Data']]  # To preserve the columns order
  ...: 

In[3]: flowdata
Out[3]:   
     Date   Time   Name   Data
0   09/19  23:00  P10    10000
1   09/19  23:10  P10    10002
2   09/19  23:20  P10    10004
3   09/19  23:30  P10    10005
4   09/19  23:40  P5     10007
5   09/19  23:50  P5     10008
6   10/19  00:00  P5     10010
7   10/19  00:10  P10    10012
8   10/19  00:20  P10    10013
9   10/19  00:30  P10    10014
10  10/19  00:40  P6     10020
11  10/19  00:50  P6     10022
12  10/19  01:00  P6     10023

I want to slice it into others DataFrames based in "continuous" rows with values of 'Name' Column. I try with the following code and get this:

In[3]: flowdata[flowdata['Name'] == 'P5   ']
Out[3]: 
    Date   Time   Name   Data
4  09/19  23:40  P5     10007
5  09/19  23:50  P5     10008
6  10/19  00:00  P5     10010

THE PROBLEM comes when I try to slice with the Name 'P10 ' (for this case). I got a jump in the Date and Time (from index 3 to 7).

In[4]: flowdata[flowdata['Name'] == 'P10  ']
Out[4]: 
    Date   Time   Name   Data
0  09/19  23:00  P10    10000
1  09/19  23:10  P10    10002
2  09/19  23:20  P10    10004
3  09/19  23:30  P10    10005
7  10/19  00:10  P10    10012
8  10/19  00:20  P10    10013
9  10/19  00:30  P10    10014

I want to get two DataFrames based in "continuous" rows of the values of the column 'Name'. Something like this:

DataFrame 1 for First Name "P10":
        Date   Time   Name   Data
    0  09/19  23:00  P10    10000
    1  09/19  23:10  P10    10002
    2  09/19  23:20  P10    10004
    3  09/19  23:30  P10    10005

DataFrame 2 for Second Name "P10":
        Date   Time   Name   Data
    7  10/19  00:10  P10    10012
    8  10/19  00:20  P10    10013
    9  10/19  00:30  P10    10014

I looked for a way to do this with some inbuild function or method and I didn't found a way. So I decide to iterate rows, check conditions and make a list of indexes used to slice the main DataFrame. I get this code:

In[6]: name_list_with_start_end_indexes = []
  ...: current_name = flowdata.iloc[0]['Name']
  ...: current_start_index = flowdata.index[0]
  ...: for i in flowdata.index:
  ...:     next_name = flowdata.loc[i]['Name']
  ...:     if not (current_name == next_name):
  ...:         current_end_index = i - 1
  ...:         name_list_with_start_end_indexes.append([current_name, current_start_index, current_end_index])
  ...:         current_start_index = i
  ...:         current_name = next_name
  ...: name_list_with_start_end_indexes.append([current_name,current_start_index, i])
  ...: 
In[7]: name_list_with_start_end_indexes
Out[7]: 
    [['P10  ', 0, 3], 
     ['P5   ', 4, 6], 
     ['P10  ', 7, 9], 
     ['P6   ', 10, 12]]

In[8]: name_A = name_list_with_start_end_indexes[2]
In[9]: name_A
Out[9]: 
['P10  ', 7, 9]
In[10]: flowdata[name_A[1]:name_A[2]+1]
Out[10]: 

    Date   Time   Name   Data
7  10/19  00:10  P10    10012
8  10/19  00:20  P10    10013
9  10/19  00:30  P10    10014

THE PROBLEM is that this code runs slowly with 13000 rows (the file with this data normally has this amount of rows and have 11 columns).

Someone know a better way to get the same results but faster

Thanks in advance.

3kt · Accepted Answer · 2017-09-21 12:07:16Z

2

What about labelling the groups ?

If that's ok for you, you can do:

In [20]: flowdata['group'] = (flowdata['Name'] != flowdata['Name'].shift()).astype(int).cumsum()

In [21]: flowdata
Out[21]:
     Date   Time   Name   Data  group
0   09/19  23:00  P10    10000      1
1   09/19  23:10  P10    10002      1
2   09/19  23:20  P10    10004      1
3   09/19  23:30  P10    10005      1
4   09/19  23:40  P5     10007      2
5   09/19  23:50  P5     10008      2
6   10/19  00:00  P5     10010      2
7   10/19  00:10  P10    10012      3
8   10/19  00:20  P10    10013      3
9   10/19  00:30  P10    10014      3
10  10/19  00:40  P6     10020      4
11  10/19  00:50  P6     10022      4
12  10/19  01:00  P6     10023      4

You can then access the groups by doing:

In [24]: flowdata[flowdata['group'] == 1]
Out[24]:
    Date   Time   Name   Data  group
0  09/19  23:00  P10    10000      1
1  09/19  23:10  P10    10002      1
2  09/19  23:20  P10    10004      1
3  09/19  23:30  P10    10005      1

The idea here is to compare each row with the previous one, thanks to shift: if the Name of the row is not the same as the one above, the comparison will be True, which then translates to a 1, thanks to .astype(int). We then use cumsum to incrementaly count the number of 1 (therefore True values, as explained above).

To make it a bit more understandable, we actually count the number of Name change, incrementing every time we switch from a group to another.

edited Sep 21, 2017 at 12:07

answered Sep 20, 2017 at 14:22

3kt

2,5432 gold badges19 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

jmejias Over a year ago

Thanks @3kt. This is an excellent piece of code. First time I see these functions [shift() and cumsum()], definetely these are the functions needed for this, fast without iterations on the entire DataFrame. I already tested it and added it to the main code and works excellent. Definitely pandas in an excellent tool for data analysis.

3kt Over a year ago

@jmejias No problem, feel free to "accept the answer" if it solves your problem

Collectives™ on Stack Overflow

Slice DataFrame to others DataFrames based in Column Value with Continuos Data

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related