1

I have a DataFrame like this:

In[2]: import pandas as pd
  ...: flow = {
  ...:     'Date':['09/19','09/19','09/19','09/19','09/19','09/19','10/19','10/19','10/19','10/19','10/19','10/19','10/19'],
  ...:     'Time':['23:00','23:10','23:20','23:30','23:40','23:50','00:00','00:10','00:20','00:30','00:40','00:50','01:00'],
  ...:     'Name':['P10  ','P10  ','P10  ','P10  ','P5   ','P5   ','P5   ','P10  ','P10  ','P10  ','P6   ','P6   ','P6   '],
  ...:     'Data':['10000','10002','10004','10005','10007','10008','10010','10012','10013','10014','10020','10022','10023']
  ...: }
  ...: flowdata = pd.DataFrame(flow)
  ...: flowdata = flowdata[['Date', 'Time', 'Name', 'Data']]  # To preserve the columns order
  ...: 

In[3]: flowdata
Out[3]:   
     Date   Time   Name   Data
0   09/19  23:00  P10    10000
1   09/19  23:10  P10    10002
2   09/19  23:20  P10    10004
3   09/19  23:30  P10    10005
4   09/19  23:40  P5     10007
5   09/19  23:50  P5     10008
6   10/19  00:00  P5     10010
7   10/19  00:10  P10    10012
8   10/19  00:20  P10    10013
9   10/19  00:30  P10    10014
10  10/19  00:40  P6     10020
11  10/19  00:50  P6     10022
12  10/19  01:00  P6     10023

I want to slice it into others DataFrames based in "continuous" rows with values of 'Name' Column. I try with the following code and get this:

In[3]: flowdata[flowdata['Name'] == 'P5   ']
Out[3]: 
    Date   Time   Name   Data
4  09/19  23:40  P5     10007
5  09/19  23:50  P5     10008
6  10/19  00:00  P5     10010

THE PROBLEM comes when I try to slice with the Name 'P10 ' (for this case). I got a jump in the Date and Time (from index 3 to 7).

In[4]: flowdata[flowdata['Name'] == 'P10  ']
Out[4]: 
    Date   Time   Name   Data
0  09/19  23:00  P10    10000
1  09/19  23:10  P10    10002
2  09/19  23:20  P10    10004
3  09/19  23:30  P10    10005
7  10/19  00:10  P10    10012
8  10/19  00:20  P10    10013
9  10/19  00:30  P10    10014

I want to get two DataFrames based in "continuous" rows of the values of the column 'Name'. Something like this:

DataFrame 1 for First Name "P10":
        Date   Time   Name   Data
    0  09/19  23:00  P10    10000
    1  09/19  23:10  P10    10002
    2  09/19  23:20  P10    10004
    3  09/19  23:30  P10    10005

DataFrame 2 for Second Name "P10":
        Date   Time   Name   Data
    7  10/19  00:10  P10    10012
    8  10/19  00:20  P10    10013
    9  10/19  00:30  P10    10014

I looked for a way to do this with some inbuild function or method and I didn't found a way. So I decide to iterate rows, check conditions and make a list of indexes used to slice the main DataFrame. I get this code:

In[6]: name_list_with_start_end_indexes = []
  ...: current_name = flowdata.iloc[0]['Name']
  ...: current_start_index = flowdata.index[0]
  ...: for i in flowdata.index:
  ...:     next_name = flowdata.loc[i]['Name']
  ...:     if not (current_name == next_name):
  ...:         current_end_index = i - 1
  ...:         name_list_with_start_end_indexes.append([current_name, current_start_index, current_end_index])
  ...:         current_start_index = i
  ...:         current_name = next_name
  ...: name_list_with_start_end_indexes.append([current_name,current_start_index, i])
  ...: 
In[7]: name_list_with_start_end_indexes
Out[7]: 
    [['P10  ', 0, 3], 
     ['P5   ', 4, 6], 
     ['P10  ', 7, 9], 
     ['P6   ', 10, 12]]

In[8]: name_A = name_list_with_start_end_indexes[2]
In[9]: name_A
Out[9]: 
['P10  ', 7, 9]
In[10]: flowdata[name_A[1]:name_A[2]+1]
Out[10]: 

    Date   Time   Name   Data
7  10/19  00:10  P10    10012
8  10/19  00:20  P10    10013
9  10/19  00:30  P10    10014

THE PROBLEM is that this code runs slowly with 13000 rows (the file with this data normally has this amount of rows and have 11 columns).

Someone know a better way to get the same results but faster

Thanks in advance.

1 Answer 1

2

What about labelling the groups ?

If that's ok for you, you can do:

In [20]: flowdata['group'] = (flowdata['Name'] != flowdata['Name'].shift()).astype(int).cumsum()

In [21]: flowdata
Out[21]:
     Date   Time   Name   Data  group
0   09/19  23:00  P10    10000      1
1   09/19  23:10  P10    10002      1
2   09/19  23:20  P10    10004      1
3   09/19  23:30  P10    10005      1
4   09/19  23:40  P5     10007      2
5   09/19  23:50  P5     10008      2
6   10/19  00:00  P5     10010      2
7   10/19  00:10  P10    10012      3
8   10/19  00:20  P10    10013      3
9   10/19  00:30  P10    10014      3
10  10/19  00:40  P6     10020      4
11  10/19  00:50  P6     10022      4
12  10/19  01:00  P6     10023      4

You can then access the groups by doing:

In [24]: flowdata[flowdata['group'] == 1]
Out[24]:
    Date   Time   Name   Data  group
0  09/19  23:00  P10    10000      1
1  09/19  23:10  P10    10002      1
2  09/19  23:20  P10    10004      1
3  09/19  23:30  P10    10005      1

The idea here is to compare each row with the previous one, thanks to shift: if the Name of the row is not the same as the one above, the comparison will be True, which then translates to a 1, thanks to .astype(int). We then use cumsum to incrementaly count the number of 1 (therefore True values, as explained above).

To make it a bit more understandable, we actually count the number of Name change, incrementing every time we switch from a group to another.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks @3kt. This is an excellent piece of code. First time I see these functions [shift() and cumsum()], definetely these are the functions needed for this, fast without iterations on the entire DataFrame. I already tested it and added it to the main code and works excellent. Definitely pandas in an excellent tool for data analysis.
@jmejias No problem, feel free to "accept the answer" if it solves your problem

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.