How to select rows in a DataFrame based on every transition for particular values in a particular column?

Question

I have a DataFrame that has a ID column and Value column that only consist (0,1,2). I want to capture only those rows, if there is a transition from (0-1) or (1-2) in value column. This process has to be done for each ID separately.

I tried to do the groupby for ID and using a difference aggregation function. So that i can take those rows for which difference of values is 1. But it is failing in certain condition.

df=df.loc[df['values'].isin([0,1,2])]
df = df.sort_values(by=['Id'])
df.value.diff()

Given DataFrame:

Index UniqID Value

1 a 1

2 a 0

3 a 1

4 a 0

5 a 1

6 a 2

7 b 0

8 b 2

9 b 1

10 b 2

11 b 0

12 b 1

13 c 0

14 c 1

15 c 2

16 c 2

Expected Output:

2 a 0

3 a 1

4 a 0

5 a 1

6 a 2

9 b 1

10 b 2

11 b 0

12 b 1

13 c 0

14 c 1

15 c 2

Only expecting those rows when there is a transition from either 0-1 or 1-2.

Thank you in advance.

So we can expect rows are sorted by index which makes transition you cited meaningful? — jlandercy
– jlandercy, Commented Sep 5, 2019 at 7:28
@jlandercy actually yes, i did sorting of id as well of additional timestamp column which gives me the sequence of events, i modified the data and code to remove the complexity of the problem and just concentrate on asking question on a particular part where i am being stuck. — Shashank Singh Yadav
– Shashank Singh Yadav, Commented Sep 5, 2019 at 7:37
When you say transition is it in both ways 1-2 and 2-1 or only increasing ? — Mayeul sgc
– Mayeul sgc, Commented Sep 5, 2019 at 8:10
@ShashankSinghYadav - is correct 8 b 2 ? Because there is no pattern 1-2 — jezrael
– jezrael, Commented Sep 5, 2019 at 8:45

jezrael · Accepted Answer · 2019-09-06 08:56:51Z

2

Use this my solution working for groups with tuples of patterns:

np.random.seed(123)

N = 100
d = {
    'UniqID': np.random.choice(list('abcde'), N),
    'Value': np.random.choice([0,1,2], N),
}
df = pd.DataFrame(d).sort_values('UniqID')
#print (df)

pat = [(0, 1), (1, 2)]

a = np.array(pat)

s = (df.groupby('UniqID')['Value']
       .rolling(2, min_periods=1)
       .apply(lambda x: np.all(x[None :] == a, axis=1).any(), raw=True))

mask = (s.mask(s == 0)
         .groupby(level=0)
         .bfill(limit=1)
         .fillna(0)
         .astype(bool)
         .reset_index(level=0, drop=True))

df = df[mask]

print (df)
   UniqID  Value
99      a      1
98      a      2
12      a      1
63      a      2
38      a      0
41      a      1
9       a      1
72      a      2
64      b      1
67      b      2
33      b      0
68      b      1
57      b      1
71      b      2
10      b      0
8       b      1
61      c      1
66      c      2
46      c      0
0       c      1
40      c      2
21      d      0
74      d      1
15      d      1
85      d      2
6       d      1
88      d      2
91      d      0
83      d      1
4       d      1
34      d      2
96      d      0
48      d      1
29      d      0
84      d      1
32      e      0
62      e      1
37      e      1
55      e      2
16      e      0
23      e      1

edited Sep 6, 2019 at 8:56

answered Sep 5, 2019 at 8:43

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Mayeul sgc Over a year ago

you're missing line 8

Shashank Singh Yadav Over a year ago

@jezrael it is failing for large data (when i applied your solution , i got value sequence 1,2,1,1,0,1) ..here 1,1 was not required.

Shashank Singh Yadav Over a year ago

@jezrael To be more precise, a part of original data was (0,1,2,1,2) and output received by apply your method was (0,1,2,2)......similarly for (1,0,1,0,1,2) ....i gave (1,1,0,1,2), this is places i found fault while checking. That means it is excluding some rows which it shouldnt.

Shashank Singh Yadav Over a year ago

@jezrael yes, they are in same group.

Shashank Singh Yadav Over a year ago

@jezrael i mean to say that at some places when the value column had 0,1,0,1 ...the code returned 0,1,1 (skipping one 0), similarly, if somewhere 1,2,1,2 ....the code returned (1,2,2).

|

Parth · Accepted Answer · 2019-09-11 08:24:34Z

1

Assuming, transition is strictly from 1 -> 2 and 0 -> 1. (This assumption is valid as well.)

Similar Sample data:

index,id,value
1,a,1
2,a,0
3,a,1
4,a,0
5,a,1
6,a,2
7,b,0
8,b,2
9,b,1
10,b,2
11,b,0
12,b,1
13,c,0
14,c,1
15,c,2
16,c,2

Load this in pandas dataframe. Then, Using below code:

def grp_trns(x):
    x['dif']=x.value.diff().fillna(0)
    return pd.DataFrame(list(x[x.dif==1]['index']-1)+list(x[x.dif==1]['index']))
target_index=df.groupby('id').apply(lambda x:grp_trns(x)).values.squeeze()
print(df[df['index'].isin(target_index)][['index', 'id','value']])

It gives desired dataframe based on assumption:

     index id  value
1       2  a      0
2       3  a      1
3       4  a      0
4       5  a      1
5       6  a      2
8       9  b      1
9      10  b      2
10     11  b      0
11     12  b      1
12     13  c      0
13     14  c      1
14     15  c      2

Edit: To include transition 1->0, below is updated function:

def grp_trns(x):
    x['dif']=x.value.diff().fillna(0)
    index1=list(x[x.dif==1]['index']-1)+list(x[x.dif==1]['index'])
    index2=list(x[(x.dif==-1)&(x.value==0)]['index']-1)+list(x[(x.dif==-1)&(x.value==0)]['index'])
    return pd.DataFrame(index1+index2)

edited Sep 11, 2019 at 8:24

answered Sep 5, 2019 at 8:40

Parth

6444 silver badges10 bronze badges

13 Comments

Shashank Singh Yadav Over a year ago

when taken a larger data set, if value in one id ends with 0 and another id start with 1, than your solution combine them both and take both the rows even if they are from different id.

Parth Over a year ago

@ShashankSinghYadav Updated answer. Please check now.

Shashank Singh Yadav Over a year ago

still having problem, for example ...it is taking 0 for id A and 1 from id B. But according to our requirement, the code should remove them as they are not forming a pair.

Parth Over a year ago

@ShashankSinghYadav If you can give example/sample data for case where you are facing problem, it would be faster to reproduce for solving and scope of misunderstanding could be reduced. (Give all 3 columns for case you are trying to describe here.)

Parth Over a year ago

@ShashankSinghYadav Expected output Index -> 7 and 8 belongs to different group 7 has Value 1 and 8 has value 2, but this transition is invalid as they belong to different group. Please check validity of output row index you have shared, does not seem right as per requirement.

|

Mayeul sgc · Accepted Answer · 2019-09-05 08:45:03Z

1

My version is using shift and diff() to delete all lines with diff value equal to 0,2 or -2

df = pandas.DataFrame({'index':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],'UniqId':['a','a','a','a','a','a','b','b','b','b','b','b','c','c','c','c'],'Value':[1,0,1,0,1,2,0,2,1,2,0,1,0,1,2,2]})
df['diff']=np.NaN
for element in df['UniqId'].unique():
    df['diff'].loc[df['UniqId']==element]=df.loc[df['UniqId']==element]['Value'].diff()
df['diff']=df['diff'].shift(-1)
df=df.loc[(df['diff']!=-2) & (df['diff']!=2) & (df['diff']!=0)]
print(df)

Actually waiting for updates about the 2-1 and 1-2 relationship

answered Sep 5, 2019 at 8:45

Mayeul sgc

2,0993 gold badges23 silver badges38 bronze badges

Collectives™ on Stack Overflow

How to select rows in a DataFrame based on every transition for particular values in a particular column?

3 Answers 3

10 Comments

13 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

10 Comments

13 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related