Get the mean of two DataFrames with missing values

Question

I have two DataFrames, which I am trying to make a single containing the mean of the two. Each has missing values.

If there were not missing values I could use (df1 + df2)/2.

I would like to take the mean of those that have both data points, while returning 'NaN' as the 'mean' for the points that are missing data.

For example:

df1 = pd.DataFrame({'ID': ['Alpha', 'Bravo', 'Charlie', 'Delta'], 'Var1': [15,'NaN', 17, 18], 'Var2': [1.0, 1.5, 2.0, 1.5], 'Var3': [1, 0, 0, 1]})   

df1 = df1.set_index('ID')

        Var1  Var2  Var3
ID                      
Alpha     15   1.0     1
Bravo    NaN   1.5     0
Charlie   17   2.0     0
Delta     18   1.5     1

_

df2 = pd.DataFrame({'ID': ['Alpha', 'Bravo', 'Charlie', 'Delta'], 'Var1': [20, 15, 17, 20], 'Var2': [1.2,'NaN', 3.0, 1.0], 'Var3': [0, 0, 1, 1]})

df2=df2.set_index('ID')

         Var1 Var2  Var3
ID                      
Alpha      20  1.2     0
Bravo      15  NaN     0
Charlie    17    3     1
Delta      20    1     1

_

The resulting DataFrame should be:

         Var1   Var2  Var3
ID                      
Alpha      17.5  1.10  0.5
Bravo      NaN   NaN   0.0
Charlie    17.0  2.50  0.5
Delta      19.0  1.25  1.0

So, in simple terms, my question is how can I do (df1 + df2)/2 while ignoring the points that have NaN's?

miradulo · Accepted Answer · 2018-08-03 16:37:39Z

5

You can do exactly that, (df1 + df2) / 2.

Your real problem here is that the NaN-containing columns in your DataFrames are of object dtype, not floating datatypes. Fix that, and the above method works. Ideally fix that by having np.nan in your inputs, or alternatively force the column dtypes to numeric later.

df1 = df1.apply(pd.to_numeric, errors='coerce')
df2 = df2.apply(pd.to_numeric, errors='coerce')

answered Aug 3, 2018 at 16:37

miradulo

29.8k7 gold badges86 silver badges97 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

BENY · Accepted Answer · 2018-08-03 16:39:18Z

3

Your columns type for var1 in both df1 and df2 is object , that is because of the original data is [1.2,'NaN', 3.0, 1.0] which is mixed type also 'NaN' is not np.NaN .

df1 = pd.DataFrame({'ID': ['Alpha', 'Bravo', 'Charlie', 'Delta'], 'Var1': [15,np.NaN, 17, 18], 'Var2': [1.0, 1.5, 2.0, 1.5], 'Var3': [1, 0, 0, 1]})   
df1 = df1.set_index('ID')
df2 = pd.DataFrame({'ID': ['Alpha', 'Bravo', 'Charlie', 'Delta'], 'Var1': [20, 15, 17, 20], 'Var2': [1.2,np.NaN, 3.0, 1.0], 'Var3': [0, 0, 1, 1]})
df2=df2.set_index('ID')
df1.add(df2)/2
Out[109]: 
         Var1  Var2  Var3
ID                       
Alpha    17.5  1.10   0.5
Bravo     NaN   NaN   0.0
Charlie  17.0  2.50   0.5
Delta    19.0  1.25   1.0

answered Aug 3, 2018 at 16:39

BENY

324k22 gold badges176 silver badges250 bronze badges

Comments

jpp · Accepted Answer · 2018-08-03 16:37:10Z

2

You can use NumPy to calculate the mean of an array of arrays.

arr = np.array([df1.replace('NaN', np.nan).values,
                df2.replace('NaN', np.nan).values])

res = pd.DataFrame(arr.mean(0), index=df1.index, columns=df1.columns)

print(res)

         Var1  Var2  Var3
ID                       
Alpha    17.5  1.10   0.5
Bravo     NaN   NaN   0.0
Charlie  17.0  2.50   0.5
Delta    19.0  1.25   1.0

Explanation

This solution requires the following steps:

Convert strings 'NaN' to np.nan.
Combine 2 arrays into an array of arrays.
Compute mean over axis=0.
Construct dataframe, feeding index and columns from inputs.

answered Aug 3, 2018 at 16:37

jpp

166k37 gold badges301 silver badges363 bronze badges

Collectives™ on Stack Overflow

Get the mean of two DataFrames with missing values

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related