9

I have a data frame like the following:

test = pd.DataFrame({'ID':[4, 5, 6, 6, 6, 7, 7, 7], 'val1':['one', 'one', 'two', 'two', 'three', np.nan, 'seven', 'seven'], 'val2':['hi', 'bye', 'hola', 'hola', 'hola', 'ciao', 'ciao', 'namaste'], 'val3':[3, 3, 4, np.nan, 4, 5, 5, 6]})

test
   ID   val1     val2  val3
0   4    one       hi   3.0
1   5    one      bye   3.0
2   6    two     hola   4.0
3   6    two     hola   NaN
4   6  three     hola   4.0
5   7    NaN     ciao   5.0
6   7  seven     ciao   5.0
7   7  seven  namaste   6.0

Each ID has some measured values, with some IDs being done in triplicate.

If there is any disagreement between the replicate IDs for a specific column, then I want the new data frame to have an NaN for that value.

If there is an NaN already present for one value (consider it not measured), but the other two for that replicate sample match, then I want that agreement to be present in the final data frame. If there is disagreement between the two where values are present, then NaN.

I was thinking of using pandas groupby then aggregate for this, but I wasn't sure of how to do the logic for the aggregate function.

Essentially the output I am looking for is like:

pd.DataFrame({'ID':[4, 5, 6, 7], 'val1':['one', 'one', np.nan, 'seven'], 'val2':['hi', 'bye', 'hola',  np.nan], 'val3':[3, 3, 4, np.nan]})

   ID   val1  val2  val3
0   4    one    hi   3.0
1   5    one   bye   3.0
2   6    NaN  hola   4.0
3   7  seven   NaN   NaN

Could you suggest how to do this?

Thanks!

Jack

2 Answers 2

7

Using

test.groupby('ID',as_index=False).agg(lambda x : x.mode()[0] if x.nunique()==1 else np.nan)
Out[372]: 
   ID   val1  val2  val3
0   4    one    hi   3.0
1   5    one   bye   3.0
2   6    NaN  hola   4.0
3   7  seven   NaN   NaN
Sign up to request clarification or add additional context in comments.

7 Comments

One final reset_index call needed!
Why not .agg(lambda s: s.unique().item() if s.nunique()==1 else np.nan)
@RafaelC That's okay, but you're calling unique() twice which seems redundant. Maybe use a function?
Plus this is using a lambda, which is ehh,... to begin with.
@coldspeed hm, idea was not to call .mode which seems overkill
|
5

This works because of how you've defined your problem.

First, get the first row of each ID. Next, figure out what IDs have valid values and mask everything else.

v = df.groupby('ID', as_index=False).first()
v[df.groupby('ID', as_index=False).nunique().eq(1)]

   ID   val1  val2  val3
0   4    one    hi   3.0
1   5    one   bye   3.0
2   6    NaN  hola   4.0
3   7  seven   NaN   NaN

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.