0

I have two pandas dataframes: One assembled manually in Python, the other imported from a dashboard's .csv output.

All columns in both dataframes are objects, and look like this:

2020 2021 2022 2023
0.441 0.554 0.113 0.445
0.233 0.215 0.225 0.115
Fifty (50/99) One (1/99) Ten (10/99) Eleven (11/99)
0.554 0.111 0.545 0.577
Africa Europe Africa Asia
y_2020 = [0.441, 0.233, 'Fifty (50/99)', 0.554, 'Africa']
y_2021 = [0.554, 0.215, 'One (1/99)', 0.111, 'Europe']
y_2022 = [0.113, 0.225, 'Ten (10/99)', 0.545, 'Africa']
y_2023 = [0.445, 0.115, 'Eleven (11/99)', 0.577, 'Asia']


df1 = pd.DataFrame(
 data = list(zip(y_2020, y_2021 , y_2022, y_2023)),
 columns = ['2020', '2021', '2022', '2023'])

df2 = df1.copy()

I want to check if the dashboard is producing accurate figures by comparing the contents of both outputs. I'd like to check the numeric values are within a 0.1 tolerance of each other. Strings must be exact.

I'm struggling with .equals() and .compare() because of the string/integer mix and wanting a 0.1 tolerance.

Thanks.

2
  • what did you try? Show your code. You could at least use for-loop to check every value in row. And if every row has different data then you could create different code for different data. Commented Aug 10 at 15:26
  • if you want tolerance then you could at least substract values and get abs() and compare it with <= 0.1 Commented Aug 10 at 15:29

4 Answers 4

1

I suggest you first transpose your dataframes so that you can have numeric/string columns, rather than numeric/string rows. This will allow you to perform arithmetic with the numeric columns. After transposing, convert all columns with numbers to numeric datatypes, and split the dataframes into numeric and non-numeric dataframes. You can compare the two numeric dataframes up to your tolerance, and compare the non-numeric columns with .equals().

Here is the setup:

tol = 0.1

# Transpose both dataframes, and convert numeric columns to numeric
df1T = df2.T.apply(to_numeric)
df2T = df2.T.apply(to_numeric)

# Split into dataframes of strings and numbers
df1Numeric = df1T.select_dtypes(include='number')
df2Numeric = df2T.select_dtypes(include='number')

df1String = df1T.select_dtypes(exclude='number')
df2String = df2T.select_dtypes(exclude='number')

# Compute difference of the numeric dataframes
dfNumericDiff =  df1Numeric - df2Numeric

# Take slice where numeric dataframe is above tolerance
overThreshold = dfNumericDiff[(abs(dfNumericDiff) > tol).all(axis = 1)]

Note that to_numeric used in apply above is a custom function - it replicates pd.to_numeric(, errors = 'ignore'), which has been deprecated, see this post. Below is the function:

def to_numeric(s):
    try:
        return pd.to_numeric(s, errors='raise')
    except ValueError:
        return s

Finally you can check that df1String and df2String are equal and that overThreshold has 0 rows - I did it like this below, but note that it can be done in a single line:

if df1String.equals(df2String):
    print("String rows are equal - continue to compare numeric rows.")
else:
    print("String rows have differences, please adjust.")

if len(overThreshold) == 0:
    print(f"All numeric rows match to the specified tolerance of {tol}.")
else:
    print(f"The difference in some rows is over your specified tolerance {tol}, see below:")
    print(overThreshold)

If both conditions are true, we are done.

Sign up to request clarification or add additional context in comments.

Comments

1

Here's one approach:

# adding some changes for `df2`
df2.loc[0, '2020'] = df2.loc[0, '2020'] + 0.12
df2.loc[0, '2021'] = df2.loc[0, '2021'] + 0.08 # within `atol`
df2.loc[2, '2023'] = 'Twelve (12/99)'

m = np.isclose(
    **{k: pd.to_numeric(df.stack(), errors='coerce').unstack()
       for k, df in zip(['a', 'b'], [df1, df2])},
    atol=0.1, rtol=0.0
    )

out = df1.mask(m).compare(df2.mask(m))

Result

out

    2020                   2023                
    self  other            self           other
0  0.441  0.561             NaN             NaN
2    NaN    NaN  Eleven (11/99)  Twelve (12/99)

Explanation

  • Use pd.to_numeric on df.stack with errors='coerce' to get NaN values for your strings, and get the original shape back with Series.unstack.
  • Apply to both DataFrames to pass to np.isclose as a and b (arrays to be compared) with the appropriate absolute tolerance. All NaN values will (by default) be treated as unequal:
m

array([[False,  True,  True,  True], # False: df2.loc[0, '2020'] change
       [ True,  True,  True,  True],
       [False, False, False, False], # strings
       [ True,  True,  True,  True],
       [False, False, False, False]]) # strings
  • Now, use df.mask to hide the numeric cells within tolerance before applying df.compare.

Comments

0

That is not an ideal situation. We usually try to have consistent type by columns, not by rows. If your datafrme were transposed, with a correct dtype per column, you could compare each columns (or even batches of columns) depending on their dtypes.

One attempts

floatEq = np.allclose(df1.T[[0,1,3]].astype(float), df2.T[[0,1,3]].astype(float), atol=0.1)
strEq = df1.T[[2,4]].equals(df2.T[[2,4]])
print(floatEq and strEq)

But it would certainly be better to do the transpose, and the type conversion directly when building the dataframes, rather than keeping them as is, and transpose/convert at each computation (the transpose itself cost nothing. It is the conversion that should be done once for all. But since you have to do it for columns, you need the transpose too)

If you don't know where are the numeric rows, you can

isStr = pd.to_numeric(df1.iloc[:,0], errors='coerce').isna()
isNum=~isStr

And then, as before

floatEq = np.allclose(df1.T[df1.index[isNum]].astype(float), df2.T[[0,1,3]].astype(float), atol=0.1)
strEq = df1.T[[2,4]].equals(df2.T[df1.index[isStr]])
print(floatEq and strEq)

Comments

0

A simple pandas approach to explicitly check (1) if the values are equal (eq) (string or else), (2) if the numeric values (to_numeric) are within tolerance (sub+abs+le), (3) if the values were not originally NaN (since NaN != NaN ; isna+&):

tolerance = 0.1

# are the values (string or else) equal?
m1 = df1.eq(df2)
# are the numeric values within tolerance?
m2 = (df1.apply(pd.to_numeric, errors='coerce')
         .sub(df2.apply(pd.to_numeric, errors='coerce'))
         .abs().le(tolerance)
     )
# are both values NaN? (optional)
m3 = df1.isna()&df2.isna()

# is there any difference?
m = m1|m2|m3

Output m:

    2020  2021  2022   2023
0  False  True  True   True
1   True  True  True   True
2   True  True  True  False
3   True  True  True   True
4   True  True  True   True

Intermediate masks:

       2020   2021   2022   2023
m1 0  False  False   True   True
   1   True   True   True   True
   2   True   True   True  False
   3   True   True   True   True
   4   True   True   True   True
m2 0  False   True   True   True
   1   True   True   True   True
   2  False  False  False  False
   3   True   True   True   True
   4  False  False  False  False
m3 0  False  False  False  False
   1  False  False  False  False
   2  False  False  False  False
   3  False  False  False  False
   4  False  False  False  False

As already demonstrated by @ouroboros1, if you want to use compare, mask the correct values (using their example here):

df1.mask(m).compare(df2.mask(m))

Output:

    2020                   2023                
    self  other            self           other
0  0.441  0.561             NaN             NaN
2    NaN    NaN  Eleven (11/99)  Twelve (12/99)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.