2

I have two DataFrames, which I am trying to make a single containing the mean of the two. Each has missing values.

If there were not missing values I could use (df1 + df2)/2.

I would like to take the mean of those that have both data points, while returning 'NaN' as the 'mean' for the points that are missing data.

For example:

df1 = pd.DataFrame({'ID': ['Alpha', 'Bravo', 'Charlie', 'Delta'], 'Var1': [15,'NaN', 17, 18], 'Var2': [1.0, 1.5, 2.0, 1.5], 'Var3': [1, 0, 0, 1]})   

df1 = df1.set_index('ID')

        Var1  Var2  Var3
ID                      
Alpha     15   1.0     1
Bravo    NaN   1.5     0
Charlie   17   2.0     0
Delta     18   1.5     1

_

df2 = pd.DataFrame({'ID': ['Alpha', 'Bravo', 'Charlie', 'Delta'], 'Var1': [20, 15, 17, 20], 'Var2': [1.2,'NaN', 3.0, 1.0], 'Var3': [0, 0, 1, 1]})

df2=df2.set_index('ID')

         Var1 Var2  Var3
ID                      
Alpha      20  1.2     0
Bravo      15  NaN     0
Charlie    17    3     1
Delta      20    1     1

_

The resulting DataFrame should be:

         Var1   Var2  Var3
ID                      
Alpha      17.5  1.10  0.5
Bravo      NaN   NaN   0.0
Charlie    17.0  2.50  0.5
Delta      19.0  1.25  1.0

So, in simple terms, my question is how can I do (df1 + df2)/2 while ignoring the points that have NaN's?

3 Answers 3

5

You can do exactly that, (df1 + df2) / 2.

Your real problem here is that the NaN-containing columns in your DataFrames are of object dtype, not floating datatypes. Fix that, and the above method works. Ideally fix that by having np.nan in your inputs, or alternatively force the column dtypes to numeric later.

df1 = df1.apply(pd.to_numeric, errors='coerce')
df2 = df2.apply(pd.to_numeric, errors='coerce')
Sign up to request clarification or add additional context in comments.

Comments

3

Your columns type for var1 in both df1 and df2 is object , that is because of the original data is [1.2,'NaN', 3.0, 1.0] which is mixed type also 'NaN' is not np.NaN .

df1 = pd.DataFrame({'ID': ['Alpha', 'Bravo', 'Charlie', 'Delta'], 'Var1': [15,np.NaN, 17, 18], 'Var2': [1.0, 1.5, 2.0, 1.5], 'Var3': [1, 0, 0, 1]})   
df1 = df1.set_index('ID')
df2 = pd.DataFrame({'ID': ['Alpha', 'Bravo', 'Charlie', 'Delta'], 'Var1': [20, 15, 17, 20], 'Var2': [1.2,np.NaN, 3.0, 1.0], 'Var3': [0, 0, 1, 1]})
df2=df2.set_index('ID')
df1.add(df2)/2
Out[109]: 
         Var1  Var2  Var3
ID                       
Alpha    17.5  1.10   0.5
Bravo     NaN   NaN   0.0
Charlie  17.0  2.50   0.5
Delta    19.0  1.25   1.0

Comments

2

You can use NumPy to calculate the mean of an array of arrays.

arr = np.array([df1.replace('NaN', np.nan).values,
                df2.replace('NaN', np.nan).values])

res = pd.DataFrame(arr.mean(0), index=df1.index, columns=df1.columns)

print(res)

         Var1  Var2  Var3
ID                       
Alpha    17.5  1.10   0.5
Bravo     NaN   NaN   0.0
Charlie  17.0  2.50   0.5
Delta    19.0  1.25   1.0 

Explanation

This solution requires the following steps:

  1. Convert strings 'NaN' to np.nan.
  2. Combine 2 arrays into an array of arrays.
  3. Compute mean over axis=0.
  4. Construct dataframe, feeding index and columns from inputs.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.