Group by and aggregate columns but create NaN if values do not match

Question

I have a data frame like the following:

test = pd.DataFrame({'ID':[4, 5, 6, 6, 6, 7, 7, 7], 'val1':['one', 'one', 'two', 'two', 'three', np.nan, 'seven', 'seven'], 'val2':['hi', 'bye', 'hola', 'hola', 'hola', 'ciao', 'ciao', 'namaste'], 'val3':[3, 3, 4, np.nan, 4, 5, 5, 6]})

test
   ID   val1     val2  val3
0   4    one       hi   3.0
1   5    one      bye   3.0
2   6    two     hola   4.0
3   6    two     hola   NaN
4   6  three     hola   4.0
5   7    NaN     ciao   5.0
6   7  seven     ciao   5.0
7   7  seven  namaste   6.0

Each ID has some measured values, with some IDs being done in triplicate.

If there is any disagreement between the replicate IDs for a specific column, then I want the new data frame to have an NaN for that value.

If there is an NaN already present for one value (consider it not measured), but the other two for that replicate sample match, then I want that agreement to be present in the final data frame. If there is disagreement between the two where values are present, then NaN.

I was thinking of using pandas groupby then aggregate for this, but I wasn't sure of how to do the logic for the aggregate function.

Essentially the output I am looking for is like:

pd.DataFrame({'ID':[4, 5, 6, 7], 'val1':['one', 'one', np.nan, 'seven'], 'val2':['hi', 'bye', 'hola',  np.nan], 'val3':[3, 3, 4, np.nan]})

   ID   val1  val2  val3
0   4    one    hi   3.0
1   5    one   bye   3.0
2   6    NaN  hola   4.0
3   7  seven   NaN   NaN

Could you suggest how to do this?

Thanks!

Jack

BENY · Accepted Answer · 2018-08-20 16:42:32Z

7

Using

test.groupby('ID',as_index=False).agg(lambda x : x.mode()[0] if x.nunique()==1 else np.nan)
Out[372]: 
   ID   val1  val2  val3
0   4    one    hi   3.0
1   5    one   bye   3.0
2   6    NaN  hola   4.0
3   7  seven   NaN   NaN

edited Aug 20, 2018 at 16:42

answered Aug 20, 2018 at 16:40

BENY

324k22 gold badges176 silver badges250 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

cs95 Over a year ago

One final reset_index call needed!

rafaelc Over a year ago

Why not .agg(lambda s: s.unique().item() if s.nunique()==1 else np.nan)

cs95 Over a year ago

@RafaelC That's okay, but you're calling unique() twice which seems redundant. Maybe use a function?

cs95 Over a year ago

Plus this is using a lambda, which is ehh,... to begin with.

rafaelc Over a year ago

@coldspeed hm, idea was not to call .mode which seems overkill

|

cs95 · Accepted Answer · 2018-08-20 16:39:20Z

5

This works because of how you've defined your problem.

First, get the first row of each ID. Next, figure out what IDs have valid values and mask everything else.

v = df.groupby('ID', as_index=False).first()
v[df.groupby('ID', as_index=False).nunique().eq(1)]

   ID   val1  val2  val3
0   4    one    hi   3.0
1   5    one   bye   3.0
2   6    NaN  hola   4.0
3   7  seven   NaN   NaN

answered Aug 20, 2018 at 16:39

cs95

406k106 gold badges744 silver badges797 bronze badges

Collectives™ on Stack Overflow

Group by and aggregate columns but create NaN if values do not match

2 Answers 2

7 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related