2

I am working on a binary classification using random forest model, neural networks in which am using SHAP to explain the model predictions. I followed the tutorial and wrote the below code to get the waterfall plot shown below

row_to_show = 20
data_for_prediction = ord_test_t.iloc[row_to_show]  # use 1 row of data here. Could use multiple rows if desired
data_for_prediction_array = data_for_prediction.values.reshape(1, -1)
rf_boruta.predict_proba(data_for_prediction_array)
explainer = shap.TreeExplainer(rf_boruta)
# Calculate Shap values
shap_values = explainer.shap_values(data_for_prediction)
shap.plots._waterfall.waterfall_legacy(explainer.expected_value[0], shap_values[0],ord_test_t.iloc[row_to_show])

This generated the plot as shown below

enter image description here

However, I want to export this to dataframe and how can I do it?

I expect my output to be like as shown below. I want to export this for the full dataframe. Can you help me please?

enter image description here

3 Answers 3

2

Let's do a small experiment:

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from shap import TreeExplainer

X, y = load_breast_cancer(return_X_y=True)
model = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X, y)
explainer = TreeExplainer(model)

What is explainer here? If you do dir(explainer) you'll find out it has some methods and attributes among which is:

explainer.expected_value

which is of interest to you because this is base on which SHAP values add up.

Furthermore:

sv = explainer.shap_values(X)
len(sv)

will give a hint sv is a list consisting of 2 objects which are most probably SHAP values for 1 and 0, which must be symmetric (because what moves towards 1 moves exactly by the same amount, but with opposite sign, towards 0).

Hence:

sv1 = sv[1]

Now you have everything to pack it to the desired format:

df = pd.DataFrame(sv1, columns=X.columns)
df.insert(0, 'bv', explainer.expected_value[1])
Sign up to request clarification or add additional context in comments.

11 Comments

When you do 'sv[1], does it get shap value for class 1 or class 0? Orbit isbjust index position?, because we want shap values of class 1.
These are SV for class 1.
One thing is, I realized that when we put shap values in dataframe (df), it loses the indices. How do we know then which row belongs to which record in X?
Should we assume that it's in the same order (shap df) as our input dataframe (which is X)?... If my X dataframe first row (starts with an index of 14), am I right to understand that index 0 in shap df belongs to index 14 of input dataframe?
I believe it's not an assumption but the way it works: if you put your data to Pandas df, the order is preserved.
|
2

If I recall correctly, you can do something like this with pandas

import pandas as pd
    
shap_values = explainer.shap_values(data_for_prediction)
shap_values_df = pd.DataFrame(shap_values)

to get the feature names, you should do something like this (if data_for_prediction is a dataframe):

feature_names = data_for_prediction.columns.tolist()
shap_df = pd.DataFrame(shap_values.values, columns=feature_names)

8 Comments

thanks for the help. Upvoted. Can you show me how can we get the base value, shap value and feature name along with row index?
Currentky, your code guves only shap value and features?, how can I get the base valir and row ids?
@TheGreat I don't have an example to run right now, I edited the post. Let me now if with the update there are some improvements for what you want
But your code doesn't get base value. Right? It only gets shap values?
@TheGreat if you mean the base features values, you should do a concat of the dataframe with shapvalues with your original data, doing a reshape if necessary pd.concat([base_values, shap_values_df], axis=1) or a join with the instance id
|
1

I'm a currenty using that :

def getShapReport(classifier,X_test):
   shap_values = shap.TreeExplainer(classifier).shap_values(X_test)
   shap.summary_plot(shap_values, X_test)
   shap.summary_plot(shap_values[1], X_test)
   return pd.DataFrame(shap_values[1])

It first displays the shap values for the model, and for each prediction after that, and finally it returns the dataframe for the positive class(i'm on an imbalance context)

It is for a Tree explainer and not a waterfall, but it is basically the same.

7 Comments

Thanks for the help. Upvoted. Yes, am working on imbalanced data as well and need to find for positive class (label 1). Your code also does the same? Can you also show in your code, how can we get the base value for each instance/row along with shapley value?
Pandas keep the index, so you can basicaly (pandas.)concat your shapley values and you prediction set to match feature values and feature importance
Thanks but my comment is based on shapley base/expected value. It is different from feature SHAP value
I don't get it right. The "by row" shapley values is based on predict, if you want "base" row values, you need to predict on test, just to compute it. Or if i'm not wrong on some models like lgbm, you can ask the model to compute it, not sure if it's available on train.
If you look at my expected output, I would like to get the feature importance and base value for test data dataframe
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.