1

I am working on a binary classification using random forest model, neural networks in which am using SHAP to explain the model predictions. I followed the tutorial and wrote the below code to get the waterfall plot shown below

With the help of Sergey Bushmanaov's SO post here, I managed to export the waterfall plot to dataframe. But this doesn't copy the feature values of the columns. It only copies the shap values, expected_value and feature names. But I want feature names as well. So, I tried the below

shap.waterfall_plot(shap.Explanation(values=shap_values[1])[4],base_values=explainer.expected_value[1],data=ord_test_t.iloc[4],feature_names=ord_test_t.columns.tolist())

but this threw an error

TypeError: waterfall() got an unexpected keyword argument 'base_values'

I expect my output to be like as below. I have used background of 1 point to compute base value. But you are free to use background 1,10 or 100 as well. In the below output, I have stored the values and feature in one column called Feature. This is something similar to LIME. But not sure whether SHAP has this flexibility to do this?

enter image description here

update - plot

enter image description here

update code - kernel explainer waterfall to dataframe

masker = Independent(X_train, max_samples=100)
explainer = KernelExplainer(rf_boruta.predict,X_train)
bv = explainer.expected_value
sv = explainer.shap_values(X_train)

sdf_train = pd.DataFrame({
    'row_id': X_train.index.values.repeat(X_train.shape[1]),
    'feature': X_train.columns.to_list() * X_train.shape[0],
    'feature_value': X_train.values.flatten(),
    'base_value': bv,
    'shap_values': sv.values[:,:,1].flatten()   # i changed this to pd.DataFrame(sv).values[:,1].flatten()
})

1 Answer 1

14

Try following:

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from shap import TreeExplainer, Explanation
from shap.plots import waterfall

import shap
print(shap.__version__)

X, y = load_breast_cancer(return_X_y=True, as_frame=True)
model = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X, y)
explainer = TreeExplainer(model)
sv = explainer(X)
exp = Explanation(sv.values[:,:,1], 
                  sv.base_values[:,1], 
                  data=X.values, 
                  feature_names=X.columns)
idx = 0
waterfall(exp[idx])

0.39.0

enter image description here

Then:

pd.DataFrame({
    'row_id':idx,
    'feature': X.columns,
    'feature_value': exp[idx].values,
    'base_value': exp[idx].base_values,
    'shap_values': exp[idx].values
})

#expected output
row_id  feature feature_value   base_value  shap_values
0   0   mean radius -0.035453   0.628998    -0.035453
1   0   mean texture    0.047571    0.628998    0.047571
2   0   mean perimeter  -0.036218   0.628998    -0.036218
3   0   mean area   -0.041276   0.628998    -0.041276
4   0   mean smoothness -0.006842   0.628998    -0.006842
5   0   mean compactness    -0.009275   0.628998    -0.009275
6   0   mean concavity  -0.035188   0.628998    -0.035188
7   0   mean concave points -0.051165   0.628998    -0.051165
8   0   mean symmetry   -0.002192   0.628998    -0.002192
9   0   mean fractal dimension  0.001521    0.628998    0.001521
10  0   radius error    -0.021223   0.628998    -0.021223
11  0   texture error   -0.000470   0.628998    -0.000470
12  0   perimeter error -0.021423   0.628998    -0.021423
13  0   area error  -0.035313   0.628998    -0.035313
14  0   smoothness error    -0.000060   0.628998    -0.000060
15  0   compactness error   0.001053    0.628998    0.001053
16  0   concavity error -0.002988   0.628998    -0.002988
17  0   concave points error    0.000140    0.628998    0.000140
18  0   symmetry error  0.001238    0.628998    0.001238
19  0   fractal dimension error -0.001097   0.628998    -0.001097
20  0   worst radius    -0.050027   0.628998    -0.050027
21  0   worst texture   0.038056    0.628998    0.038056
22  0   worst perimeter -0.079717   0.628998    -0.079717
23  0   worst area  -0.072312   0.628998    -0.072312
24  0   worst smoothness    -0.006917   0.628998    -0.006917
25  0   worst compactness   -0.016184   0.628998    -0.016184
26  0   worst concavity -0.022500   0.628998    -0.022500
27  0   worst concave points    -0.088697   0.628998    -0.088697
28  0   worst symmetry  -0.026166   0.628998    -0.026166
29  0   worst fractal dimension -0.007683   0.628998    -0.007683

RandomForest is a bit special, this is why. When something fails with the new plots API, try to feed Explanation object.

UPDATE

To explain a single datapoint exp_id vs a single background datapoint back_id (i.e. to answer question "why prediction for exp_id differes from prediction for back_id"):

back_id = 10
exp_id = 20
explainer = TreeExplainer(model, data=X.loc[[back_id]])
sv = explainer(X.loc[[exp_id]])
exp = Explanation(sv.values[:,:,1], sv.base_values[:,1], data=X.loc[[back_id]].values, feature_names=X.columns)
waterfall(exp[0])

enter image description here

Finally, as you asked for everything in the suggested format:

from shap.maskers import Independent
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

model = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X_train, y_train)

masker = Independent(X_train, max_samples=100)
explainer = TreeExplainer(model, data=masker)
bv = explainer.expected_value[1]
sv = explainer(X_test, check_additivity=False)

pd.DataFrame({
    'row_id': X_test.index.values.repeat(X_test.shape[1]),
    'feature': X_test.columns.to_list() * X_test.shape[0],
    'feature_value': X_test.values.flatten(),
    'base_value': bv,
    'shap_values': sv.values[:,:,1].flatten()
})

but I'd definitely not show this to my mom.

Sign up to request clarification or add additional context in comments.

24 Comments

Thanks a lot for your help. Upvoted. Accepted. Instead of passing a row index 0, is it possible to pass full dataframe?
like in expected output, you cam see we have two different rows. Since, I have a big data, was wondering whether there would be any way. Would really be helpful if you can edit your code to show that? Am not sure whether looping is the only option. Then, it may take time. Isn't it?
And I see that, you don't really use background information. So, the base value is same for all observation. Isbit possible to consider only 1 instance as background, so Shap computation os faster if I have to iterate over all rows.?
Gimme some time to do like in your example (for the whole df)
I think your code still provides/stores output for a single row at a time. Am I right? We have to use for loop to ierate through full dataframe of 10000 rows in test data? Moreover, we want the explanations of both the classes (0 and 1) for full dataframe. Like shown in my expected output which has both +ve and _ve values along with feature values as well.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.