2

I have a DataFrame that has a ID column and Value column that only consist (0,1,2). I want to capture only those rows, if there is a transition from (0-1) or (1-2) in value column. This process has to be done for each ID separately.

I tried to do the groupby for ID and using a difference aggregation function. So that i can take those rows for which difference of values is 1. But it is failing in certain condition.

df=df.loc[df['values'].isin([0,1,2])]
df = df.sort_values(by=['Id'])
df.value.diff()

Given DataFrame:

Index   UniqID   Value

1    a    1

2    a    0

3    a    1

4    a    0

5    a    1

6    a    2

7    b    0

8    b    2

9    b    1

10    b    2

11    b    0

12    b    1

13    c    0

14    c    1

15    c    2

16    c    2

Expected Output:

2    a    0

3    a    1

4    a    0

5    a    1

6    a    2

9    b    1

10    b    2

11    b    0

12    b    1

13    c    0

14    c    1

15    c    2

Only expecting those rows when there is a transition from either 0-1 or 1-2.

Thank you in advance.

11
  • So we can expect rows are sorted by index which makes transition you cited meaningful? Commented Sep 5, 2019 at 7:28
  • @jlandercy actually yes, i did sorting of id as well of additional timestamp column which gives me the sequence of events, i modified the data and code to remove the complexity of the problem and just concentrate on asking question on a particular part where i am being stuck. Commented Sep 5, 2019 at 7:37
  • 1
    When you say transition is it in both ways 1-2 and 2-1 or only increasing ? Commented Sep 5, 2019 at 8:10
  • @Mayeulsgc strictly either (0-1) or (1-2) Commented Sep 5, 2019 at 8:43
  • @ShashankSinghYadav - is correct 8    b    2 ? Because there is no pattern 1-2 Commented Sep 5, 2019 at 8:45

3 Answers 3

2

Use this my solution working for groups with tuples of patterns:

np.random.seed(123)

N = 100
d = {
    'UniqID': np.random.choice(list('abcde'), N),
    'Value': np.random.choice([0,1,2], N),
}
df = pd.DataFrame(d).sort_values('UniqID')
#print (df)

pat = [(0, 1), (1, 2)]

a = np.array(pat)

s = (df.groupby('UniqID')['Value']
       .rolling(2, min_periods=1)
       .apply(lambda x: np.all(x[None :] == a, axis=1).any(), raw=True))

mask = (s.mask(s == 0)
         .groupby(level=0)
         .bfill(limit=1)
         .fillna(0)
         .astype(bool)
         .reset_index(level=0, drop=True))

df = df[mask]

print (df)
   UniqID  Value
99      a      1
98      a      2
12      a      1
63      a      2
38      a      0
41      a      1
9       a      1
72      a      2
64      b      1
67      b      2
33      b      0
68      b      1
57      b      1
71      b      2
10      b      0
8       b      1
61      c      1
66      c      2
46      c      0
0       c      1
40      c      2
21      d      0
74      d      1
15      d      1
85      d      2
6       d      1
88      d      2
91      d      0
83      d      1
4       d      1
34      d      2
96      d      0
48      d      1
29      d      0
84      d      1
32      e      0
62      e      1
37      e      1
55      e      2
16      e      0
23      e      1
Sign up to request clarification or add additional context in comments.

10 Comments

you're missing line 8
@jezrael it is failing for large data (when i applied your solution , i got value sequence 1,2,1,1,0,1) ..here 1,1 was not required.
@jezrael To be more precise, a part of original data was (0,1,2,1,2) and output received by apply your method was (0,1,2,2)......similarly for (1,0,1,0,1,2) ....i gave (1,1,0,1,2), this is places i found fault while checking. That means it is excluding some rows which it shouldnt.
@jezrael yes, they are in same group.
@jezrael i mean to say that at some places when the value column had 0,1,0,1 ...the code returned 0,1,1 (skipping one 0), similarly, if somewhere 1,2,1,2 ....the code returned (1,2,2).
|
1

Assuming, transition is strictly from 1 -> 2 and 0 -> 1. (This assumption is valid as well.)

Similar Sample data:

index,id,value
1,a,1
2,a,0
3,a,1
4,a,0
5,a,1
6,a,2
7,b,0
8,b,2
9,b,1
10,b,2
11,b,0
12,b,1
13,c,0
14,c,1
15,c,2
16,c,2

Load this in pandas dataframe. Then, Using below code:

def grp_trns(x):
    x['dif']=x.value.diff().fillna(0)
    return pd.DataFrame(list(x[x.dif==1]['index']-1)+list(x[x.dif==1]['index']))
target_index=df.groupby('id').apply(lambda x:grp_trns(x)).values.squeeze()
print(df[df['index'].isin(target_index)][['index', 'id','value']])

It gives desired dataframe based on assumption:

     index id  value
1       2  a      0
2       3  a      1
3       4  a      0
4       5  a      1
5       6  a      2
8       9  b      1
9      10  b      2
10     11  b      0
11     12  b      1
12     13  c      0
13     14  c      1
14     15  c      2

Edit: To include transition 1->0, below is updated function:

def grp_trns(x):
    x['dif']=x.value.diff().fillna(0)
    index1=list(x[x.dif==1]['index']-1)+list(x[x.dif==1]['index'])
    index2=list(x[(x.dif==-1)&(x.value==0)]['index']-1)+list(x[(x.dif==-1)&(x.value==0)]['index'])
    return pd.DataFrame(index1+index2)

13 Comments

when taken a larger data set, if value in one id ends with 0 and another id start with 1, than your solution combine them both and take both the rows even if they are from different id.
@ShashankSinghYadav Updated answer. Please check now.
still having problem, for example ...it is taking 0 for id A and 1 from id B. But according to our requirement, the code should remove them as they are not forming a pair.
@ShashankSinghYadav If you can give example/sample data for case where you are facing problem, it would be faster to reproduce for solving and scope of misunderstanding could be reduced. (Give all 3 columns for case you are trying to describe here.)
@ShashankSinghYadav Expected output Index -> 7 and 8 belongs to different group 7 has Value 1 and 8 has value 2, but this transition is invalid as they belong to different group. Please check validity of output row index you have shared, does not seem right as per requirement.
|
1

My version is using shift and diff() to delete all lines with diff value equal to 0,2 or -2

df = pandas.DataFrame({'index':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],'UniqId':['a','a','a','a','a','a','b','b','b','b','b','b','c','c','c','c'],'Value':[1,0,1,0,1,2,0,2,1,2,0,1,0,1,2,2]})
df['diff']=np.NaN
for element in df['UniqId'].unique():
    df['diff'].loc[df['UniqId']==element]=df.loc[df['UniqId']==element]['Value'].diff()
df['diff']=df['diff'].shift(-1)
df=df.loc[(df['diff']!=-2) & (df['diff']!=2) & (df['diff']!=0)]
print(df)

Actually waiting for updates about the 2-1 and 1-2 relationship

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.