7

I need to plot how each feature impacts the predicted probability for each sample from my LightGBM binary classifier. So I need to output Shap values in probability, instead of normal Shap values. It does not appear to have any options to output in term of probability.

The example code below is what I use to generate dataframe of Shap values and do a force_plot for the first data sample. Does anyone know how I should modify the code to change the output? I'm new to Shap value and the Shap package. Thanks a lot in advance.

import pandas as pd
import numpy as np
import shap
import lightgbm as lgbm
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y,  test_size=0.2)
model = lgbm.LGBMClassifier()
model.fit(X_train, y_train)


explainer = shap.TreeExplainer(model)
shap_values = explainer(X_train)

# force plot of first row for class 1
class_idx = 1
row_idx = 0
expected_value = explainer.expected_value[class_idx]
shap_value = shap_values[:,:,class_idx].values[row_idx]

shap.force_plot (base_value = expected_value,  shap_values = shap_value, features = X_train.iloc[row_idx, :], matplotlib=True)

# dataframe of shap values for class 1
shap_df = pd.DataFrame(shap_values[:,:, 1 ].values, columns = shap_values.feature_names)
5
  • There used to be a 500 bounty attached to this question, didn't it? Commented Mar 23, 2022 at 2:43
  • yes. I just accepted your answer. Did you get the 500 bounty? sorry for the late reply Commented Mar 27, 2022 at 3:12
  • No, not at all. The issue there was a significant attention drawn to this question due to the bounty (judged by see count, e.g.). But finally bounty disappeared with debiting your account I believe, and without crediting anyone who spent their times. Strange. Can you clarify with the mods what happened? Anyways, thanks for accepting! Commented Mar 27, 2022 at 4:05
  • so sorry about that. I do not know what happened. Is that because I accept the answer too late? Not clear about the bounty policy. Is their anyway that I can transfoer the points to you directly? Commented Mar 28, 2022 at 0:16
  • I do not think there is something to be sorry about. But I do expect the service working properly (and fair). Can you raise a flag underneath the question and explain mod the situation? Appreciate it. Commented Mar 28, 2022 at 2:46

3 Answers 3

10

TL;DR:

You can achieve plotting results in probability space with link="logit" in the force_plot method:

import pandas as pd
import numpy as np
import shap
import lightgbm as lgbm
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from scipy.special import expit

shap.initjs()

data = load_breast_cancer()

X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = lgbm.LGBMClassifier()
model.fit(X_train, y_train)

explainer_raw = shap.TreeExplainer(model)
shap_values = explainer_raw(X_train)

# force plot of first row for class 1
class_idx = 1
row_idx = 0
expected_value = explainer_raw.expected_value[class_idx]
shap_value = shap_values[:, :, class_idx].values[row_idx]

shap.force_plot(
    base_value=expected_value,
    shap_values=shap_value,
    features=X_train.iloc[row_idx, :],
    link="logit",
)

Expected output:

enter image description here

Alternatively, you may achieve the same with the following, explicitly specifying model_output="probability" you're interested in to explain:

explainer = shap.TreeExplainer(
    model,
    data=X_train,
    feature_perturbation="interventional",
    model_output="probability",
)
shap_values = explainer(X_train)

# force plot of first row for class 1
class_idx = 1
row_idx = 0

shap_value = shap_values.values[row_idx]

shap.force_plot(
    base_value=expected_value, 
    shap_values=shap_value, 
    features=X_train.iloc[row_idx, :]
)

Expected output:

enter image description here

However, it might be more interesting for understanding what's happening here to find out where these figures come from:

  1. Our target proba for the point of interest:
model_proba= model.predict_proba(X_train.iloc[[row_idx]])
model_proba
# array([[0.00275887, 0.99724113]])
  1. Base case raw from model given X_train as background (note, LightGBM outputs raw for class 1):
model.predict(X_train, raw_score=True).mean()
# 2.4839751932445577
  1. Base case raw from SHAP (note, they are symmetric):
bv = explainer_raw(X_train).base_values[0]
bv
# array([-2.48397519,  2.48397519])
  1. Raw SHAP values for the point of interest:
sv_0 = explainer_raw(X_train).values[row_idx].sum(0)
sv_0
# array([-3.40619584,  3.40619584])
  1. Proba inferred from SHAP values (via sigmoid):
shap_proba = expit(bv + sv_0)
shap_proba
# array([0.00275887, 0.99724113])
  1. Check:
assert np.allclose(model_proba, shap_proba)

Please ask questions if something is not clear.

Side notes

Proba might be misleading if you're analyzing raw size effect of different features because sigmoid is non-linear and saturates after reaching certain threshold.

Some people expect to see SHAP values in probability space as well, but this is not feasible because:

  • SHAP values are additive by construction (to be precise SHapley Additive exPlanations are average marginal contributions over all possible feature coalitions)
  • exp(a + b) != exp(a) + exp(b)

You may find useful:

  1. Feature importance in a binary classification and extracting SHAP values for one of the classes only answer

  2. How to interpret base_value of GBT classifier when using SHAP? answer

Sign up to request clarification or add additional context in comments.

2 Comments

Can I ask you why did you sum the base_value with Shap value? I understand that the base value is the model's output without having any feature which is the mean prediction. But why did you sum these values? Shouldn't we get the values and pass to the logistic function to get the probability?
My expected_value is a scalar instead of a list. Is it correct? I cannot reproduce it.
0

You can consider running your output values through a softmax() function. For reference, it is defined as :

def get_softmax_probabilities(x):
    return np.exp(x) / np.sum(np.exp(x)).reshape(-1, 1)

and there is a scipy implementation as well:

from scipy.special import softmax

The output from softmax() will be probabilities proportional to the (relative) values in vector x, which are your shop values.

Comments

0
import pandas as pd
import numpy as np
import shap
import lightgbm as lgbm
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y,  test_size=0.2)
print('X_train: ',X_train.shape)
print('X_test: ',X_test.shape)

model = lgbm.LGBMClassifier()
model.fit(X_train, y_train)

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_train)

# plot
# shap.summary_plot(shap_values[class_idx], X_train, plot_type='bar')
# shap.summary_plot(shap_values[class_idx], X_train)

# shap_value = shap_values[:,:,class_idx].values[row_idx]
# shap.force_plot (base_value = expected_value,  shap_values = shap_value, features = X_train.iloc[row_idx, :], matplotlib=True)
# # dataframe of shap values for class 1
# shap_df = pd.DataFrame(shap_values[:,:, 1 ].values, columns = shap_values.feature_names)

# verification
def verification(index_number,class_idx):
    print('-----------------------------------')
    print('index_number: ', index_number)
    print('class_idx: ', class_idx)
    print('')
    
    y_base = explainer.expected_value[class_idx]
    print('y_base: ', y_base)

    player_explainer = pd.DataFrame()
    player_explainer['feature_value'] = X_train.iloc[j].values
    player_explainer['shap_value'] = shap_values[class_idx][j]
    print('verification: ')
    print('y_base + sum_of_shap_values: %.2f'%(y_base + player_explainer['shap_value'].sum()))
    print('y_pred: %.2f'%(y_train[j]))

j = 10  # index
verification(j,0)
verification(j,1)

# show: 
# X_train:  (455, 30)
# X_test:  (114, 30)
# -----------------------------------
# index_number:  10
# class_idx:  0

# y_base:  -2.391423081639827
# verification: 
# y_base + sum_of_shap_values: -9.40
# y_pred: 1.00
# -----------------------------------
# index_number:  10
# class_idx:  1

# y_base:  2.391423081639827
# verification: 
# y_base + sum_of_shap_values: 9.40
# y_pred: 1.00
# -9.40,9.40 takes the maximum value(class_idx:1 = y_pred), and the result is obviously correct.

I helped you achieve it and verified the reliability of the results.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.