Get waterfall plot values of a feature in a dataframe using shap package

Question

I am working on a binary classification using random forest model, neural networks in which am using SHAP to explain the model predictions. I followed the tutorial and wrote the below code to get the waterfall plot shown below

With the help of Sergey Bushmanaov's SO post here, I managed to export the waterfall plot to dataframe. But this doesn't copy the feature values of the columns. It only copies the shap values, expected_value and feature names. But I want feature names as well. So, I tried the below

shap.waterfall_plot(shap.Explanation(values=shap_values[1])[4],base_values=explainer.expected_value[1],data=ord_test_t.iloc[4],feature_names=ord_test_t.columns.tolist())

but this threw an error

TypeError: waterfall() got an unexpected keyword argument 'base_values'

I expect my output to be like as below. I have used background of 1 point to compute base value. But you are free to use background 1,10 or 100 as well. In the below output, I have stored the values and feature in one column called Feature. This is something similar to LIME. But not sure whether SHAP has this flexibility to do this?

update - plot

update code - kernel explainer waterfall to dataframe

masker = Independent(X_train, max_samples=100)
explainer = KernelExplainer(rf_boruta.predict,X_train)
bv = explainer.expected_value
sv = explainer.shap_values(X_train)

sdf_train = pd.DataFrame({
    'row_id': X_train.index.values.repeat(X_train.shape[1]),
    'feature': X_train.columns.to_list() * X_train.shape[0],
    'feature_value': X_train.values.flatten(),
    'base_value': bv,
    'shap_values': sv.values[:,:,1].flatten()   # i changed this to pd.DataFrame(sv).values[:,1].flatten()
})

Sergey Bushmanov · Accepted Answer · 2023-10-02 14:41:26Z

14

Try following:

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from shap import TreeExplainer, Explanation
from shap.plots import waterfall

import shap
print(shap.__version__)

X, y = load_breast_cancer(return_X_y=True, as_frame=True)
model = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X, y)
explainer = TreeExplainer(model)
sv = explainer(X)
exp = Explanation(sv.values[:,:,1], 
                  sv.base_values[:,1], 
                  data=X.values, 
                  feature_names=X.columns)
idx = 0
waterfall(exp[idx])

0.39.0

Then:

pd.DataFrame({
    'row_id':idx,
    'feature': X.columns,
    'feature_value': exp[idx].values,
    'base_value': exp[idx].base_values,
    'shap_values': exp[idx].values
})

#expected output
row_id  feature feature_value   base_value  shap_values
0   0   mean radius -0.035453   0.628998    -0.035453
1   0   mean texture    0.047571    0.628998    0.047571
2   0   mean perimeter  -0.036218   0.628998    -0.036218
3   0   mean area   -0.041276   0.628998    -0.041276
4   0   mean smoothness -0.006842   0.628998    -0.006842
5   0   mean compactness    -0.009275   0.628998    -0.009275
6   0   mean concavity  -0.035188   0.628998    -0.035188
7   0   mean concave points -0.051165   0.628998    -0.051165
8   0   mean symmetry   -0.002192   0.628998    -0.002192
9   0   mean fractal dimension  0.001521    0.628998    0.001521
10  0   radius error    -0.021223   0.628998    -0.021223
11  0   texture error   -0.000470   0.628998    -0.000470
12  0   perimeter error -0.021423   0.628998    -0.021423
13  0   area error  -0.035313   0.628998    -0.035313
14  0   smoothness error    -0.000060   0.628998    -0.000060
15  0   compactness error   0.001053    0.628998    0.001053
16  0   concavity error -0.002988   0.628998    -0.002988
17  0   concave points error    0.000140    0.628998    0.000140
18  0   symmetry error  0.001238    0.628998    0.001238
19  0   fractal dimension error -0.001097   0.628998    -0.001097
20  0   worst radius    -0.050027   0.628998    -0.050027
21  0   worst texture   0.038056    0.628998    0.038056
22  0   worst perimeter -0.079717   0.628998    -0.079717
23  0   worst area  -0.072312   0.628998    -0.072312
24  0   worst smoothness    -0.006917   0.628998    -0.006917
25  0   worst compactness   -0.016184   0.628998    -0.016184
26  0   worst concavity -0.022500   0.628998    -0.022500
27  0   worst concave points    -0.088697   0.628998    -0.088697
28  0   worst symmetry  -0.026166   0.628998    -0.026166
29  0   worst fractal dimension -0.007683   0.628998    -0.007683

RandomForest is a bit special, this is why. When something fails with the new plots API, try to feed Explanation object.

UPDATE

To explain a single datapoint exp_id vs a single background datapoint back_id (i.e. to answer question "why prediction for exp_id differes from prediction for back_id"):

back_id = 10
exp_id = 20
explainer = TreeExplainer(model, data=X.loc[[back_id]])
sv = explainer(X.loc[[exp_id]])
exp = Explanation(sv.values[:,:,1], sv.base_values[:,1], data=X.loc[[back_id]].values, feature_names=X.columns)
waterfall(exp[0])

Finally, as you asked for everything in the suggested format:

from shap.maskers import Independent
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

model = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X_train, y_train)

masker = Independent(X_train, max_samples=100)
explainer = TreeExplainer(model, data=masker)
bv = explainer.expected_value[1]
sv = explainer(X_test, check_additivity=False)

pd.DataFrame({
    'row_id': X_test.index.values.repeat(X_test.shape[1]),
    'feature': X_test.columns.to_list() * X_test.shape[0],
    'feature_value': X_test.values.flatten(),
    'base_value': bv,
    'shap_values': sv.values[:,:,1].flatten()
})

but I'd definitely not show this to my mom.

edited Oct 2, 2023 at 14:41

answered Apr 6, 2022 at 17:00

Sergey Bushmanov

25.5k8 gold badges63 silver badges84 bronze badges

Sign up to request clarification or add additional context in comments.

24 Comments

The Great Over a year ago

Thanks a lot for your help. Upvoted. Accepted. Instead of passing a row index 0, is it possible to pass full dataframe?

The Great Over a year ago

like in expected output, you cam see we have two different rows. Since, I have a big data, was wondering whether there would be any way. Would really be helpful if you can edit your code to show that? Am not sure whether looping is the only option. Then, it may take time. Isn't it?

The Great Over a year ago

And I see that, you don't really use background information. So, the base value is same for all observation. Isbit possible to consider only 1 instance as background, so Shap computation os faster if I have to iterate over all rows.?

Sergey Bushmanov Over a year ago

Gimme some time to do like in your example (for the whole df)

The Great Over a year ago

I think your code still provides/stores output for a single row at a time. Am I right? We have to use for loop to ierate through full dataframe of 10000 rows in test data? Moreover, we want the explanations of both the classes (0 and 1) for full dataframe. Like shown in my expected output which has both +ve and _ve values along with feature values as well.

|

Collectives™ on Stack Overflow

Get waterfall plot values of a feature in a dataframe using shap package

1 Answer 1

24 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

24 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related