Note

This tutorial is intended to be run in an IPython notebook. It is also available as a notebook file here.

Explaining XGBoost predictions on the Titanic dataset

This tutorial will show you how to analyze predictions of an XGBoost classifier (regression for XGBoost and most scikit-learn tree ensembles are also supported by eli5). We will use Titanic dataset, which is small and has not too many features, but is still interesting enough.

We are using XGBoost 0.6a2 and data downloaded from https://www.kaggle.com/c/titanic/data (it is also bundled in the eli5 repo: https://github.com/TeamHG-Memex/eli5/blob/master/notebooks/titanic-train.csv).

1. Training data

Let’s start by loading the data:

import csv
import numpy as np

with open('titanic-train.csv', 'rt') as f:
    data = list(csv.DictReader(f))
data[:1]
[OrderedDict([('PassengerId', '1'),
              ('Survived', '0'),
              ('Pclass', '3'),
              ('Name', 'Braund, Mr. Owen Harris'),
              ('Sex', 'male'),
              ('Age', '22'),
              ('SibSp', '1'),
              ('Parch', '0'),
              ('Ticket', 'A/5 21171'),
              ('Fare', '7.25'),
              ('Cabin', ''),
              ('Embarked', 'S')])]

Variable descriptions:

  • Age: Age
  • Cabin: Cabin
  • Embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
  • Fare: Passenger Fare
  • Name: Name
  • Parch: Number of Parents/Children Aboard
  • Pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
  • Sex: Sex
  • Sibsp: Number of Siblings/Spouses Aboard
  • Survived: Survival (0 = No; 1 = Yes)
  • Ticket: Ticket Number

Next, shuffle data and separate features from what we are trying to predict: survival.

from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

_all_xs = [{k: v for k, v in row.items() if k != 'Survived'} for row in data]
_all_ys = np.array([int(row['Survived']) for row in data])

all_xs, all_ys = shuffle(_all_xs, _all_ys, random_state=0)
train_xs, valid_xs, train_ys, valid_ys = train_test_split(
    all_xs, all_ys, test_size=0.25, random_state=0)
print('{} items total, {:.1%} true'.format(len(all_xs), np.mean(all_ys)))
891 items total, 38.4% true

We do just minimal preprocessing: convert obviously contiuous Age and Fare variables to floats, and SibSp, Parch to integers. Missing Age values are removed.

for x in all_xs:
    if x['Age']:
        x['Age'] = float(x['Age'])
    else:
        x.pop('Age')
    x['Fare'] = float(x['Fare'])
    x['SibSp'] = int(x['SibSp'])
    x['Parch'] = int(x['Parch'])

2. Simple XGBoost classifier

Let’s first build a very simple classifier with xbgoost.XGBClassifier and sklearn.feature_extraction.DictVectorizer, and check its accuracy with 10-fold cross-validation:

import warnings
# xgboost <= 0.6a2 shows a warning when used with scikit-learn 0.18+
warnings.filterwarnings('ignore', category=DeprecationWarning)
from xgboost import XGBClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

class CSCTransformer:
    def transform(self, xs):
        # work around https://github.com/dmlc/xgboost/issues/1238#issuecomment-243872543
        return xs.tocsc()
    def fit(self, *args):
        return self

clf = XGBClassifier()
vec = DictVectorizer()
pipeline = make_pipeline(vec, CSCTransformer(), clf)

def evaluate(_clf):
    scores = cross_val_score(_clf, all_xs, all_ys, scoring='accuracy', cv=10)
    print('Accuracy: {:.3f} ± {:.3f}'.format(np.mean(scores), 2 * np.std(scores)))
    _clf.fit(train_xs, train_ys)  # so that parts of the original pipeline are fitted

evaluate(pipeline)
Accuracy: 0.823 ± 0.071

There is one tricky bit about the code above: XGBClassifier in xgboost 0.6a2 has some issues with sparse data. One way to solve them is to convert a sparse matrix to CSC format, so we add a CSCTransformer to the pipelne. One may be templed to just pass dense=True to DictVectorizer: after all, in this case the matrixes are small. But this is not a great solution, because we will loose the ability to distinguish features that are missing and features that have zero value.

3. Explaining weights

In order to calculate a prediction, XGBoost sums predictions of all its trees. The number of trees is controlled by n_estimators argument and is 100 by default. Each tree is not a great predictor on it’s own, but by summing across all trees, XGBoost is able to provide a robust estimate in many cases. Here is one of the trees:

booster = clf.booster()
original_feature_names = booster.feature_names
booster.feature_names = vec.get_feature_names()
print(booster.get_dump()[0])
# recover original feature names
booster.feature_names = original_feature_names
0:[Sex=female<-9.53674e-07] yes=1,no=2,missing=1
    1:[Age<13] yes=3,no=4,missing=4
            3:[SibSp<2] yes=7,no=8,missing=7
                    7:leaf=0.145455
                    8:leaf=-0.125
            4:[Fare<26.2687] yes=9,no=10,missing=9
                    9:leaf=-0.151515
                    10:leaf=-0.0727273
    2:[Pclass=3<-9.53674e-07] yes=5,no=6,missing=5
            5:[Fare<12.175] yes=11,no=12,missing=12
                    11:leaf=0.05
                    12:leaf=0.175194
            6:[Fare<24.8083] yes=13,no=14,missing=14
                    13:leaf=0.0365591
                    14:leaf=-0.152

We see that this tree checks Sex, Age, Pclass, Fare and SibSp features. leaf gives the decision of a single tree, and they are summed over all trees in the ensemble.

Let’s check feature importances with eli5.show_weights():

from eli5 import show_weights
show_weights(clf, vec=vec)
Weight Feature
0.4278 Sex=female
0.1949 Pclass=3
0.0665 Embarked=S
0.0510 Pclass=2
0.0420 SibSp
0.0417 Cabin=
0.0385 Embarked=C
0.0358 Ticket=1601
0.0331 Age
0.0323 Fare
0.0220 Pclass=1
0.0143 Parch

There are several different ways to calculate feature importances. By default, “gain” is used, that is the average gain of the feature when it is used in trees. Other types are “weight” - the number of times a feature is used to split the data, and “cover” - the average coverage of the feature. You can pass it with importance_type argument.

Now we know that two most important features are Sex=female and Pclass=3, but we still don’t know how XGBoost decides what prediction to make based on their values.

4. Explaining predictions

To get a better idea of how our classifier works, let’s examine individual predictions with eli5.show_prediction():

from eli5 import show_prediction
show_prediction(clf, valid_xs[1], vec=vec, show_feature_values=True)

y=1 (probability 0.566, score 0.264) top features

Contribution? Feature Value
+1.673 Sex=female 1.000
+0.479 Embarked=S Missing
+0.070 Fare 7.879
-0.004 Cabin= 1.000
-0.006 Parch 0.000
-0.009 Pclass=2 Missing
-0.009 Ticket=1601 Missing
-0.012 Embarked=C Missing
-0.071 SibSp 0.000
-0.073 Pclass=1 Missing
-0.147 Age 19.000
-0.528 <BIAS> 1.000
-1.100 Pclass=3 1.000

Weight means how much each feature contributed to the final prediction across all trees. The idea for weight calculation is described in http://blog.datadive.net/interpreting-random-forests/; eli5 provides an independent implementation of this algorithm for XGBoost and most scikit-learn tree ensembles.

Here we see that classifier thinks it’s good to be a female, but bad to travel third class. Some features have “Missing” as value (we are passing show_feature_values=True to view the values): that means that the feature was missing, so in this case it’s good to not have embarked in Southampton. This is where our decision to go with sparse matrices comes handy - we still see that Parch is zero, not missing.

It’s possible to show only features that are present using feature_filter argument: it’s a function that accepts feature name and value, and returns True value for features that should be shown:

no_missing = lambda feature_name, feature_value: not np.isnan(feature_value)
show_prediction(clf, valid_xs[1], vec=vec, show_feature_values=True, feature_filter=no_missing)

y=1 (probability 0.566, score 0.264) top features

Contribution? Feature Value
+1.673 Sex=female 1.000
+0.070 Fare 7.879
-0.004 Cabin= 1.000
-0.006 Parch 0.000
-0.071 SibSp 0.000
-0.147 Age 19.000
-0.528 <BIAS> 1.000
-1.100 Pclass=3 1.000

5. Adding text features

Right now we treat Name field as categorical, like other text features. But in this dataset each name is unique, so XGBoost does not use this feature at all, because it’s such a poor discriminator: it’s absent from the weights table in section 3.

But Name still might contain some useful information. We don’t want to guess how to best pre-process it and what features to extract, so let’s use the most general character ngram vectorizer:

from sklearn.pipeline import FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer

vec2 = FeatureUnion([
    ('Name', CountVectorizer(
        analyzer='char_wb',
        ngram_range=(3, 4),
        preprocessor=lambda x: x['Name'],
        max_features=100,
    )),
    ('All', DictVectorizer()),
])
clf2 = XGBClassifier()
pipeline2 = make_pipeline(vec2, CSCTransformer(), clf2)
evaluate(pipeline2)
Accuracy: 0.839 ± 0.081

In this case the pipeline is more complex, we slightly improved our result, but the improvement is not significant. Let’s look at feature importances:

show_weights(clf2, vec=vec2)
Weight Feature
0.3138 Name__Mr.
0.0821 All__Pclass=3
0.0443 Name__sso
0.0294 All__Sex=female
0.0212 Name__lia
0.0205 All__Fare
0.0203 All__Ticket=1601
0.0197 All__Embarked=S
0.0187 Name__Ma
0.0177 All__Cabin=
0.0172 Name__Mar
0.0168 Name__s,
0.0160 Name__Mr
0.0157 Name__son
0.0138 Name__ne
0.0137 Name__ber
0.0136 All__SibSp
0.0136 Name__e,
0.0134 All__Pclass=1
0.0125 All__Embarked=C
… 34 more …

We see that now there is a lot of features that come from the Name field (in fact, a classifier based on Name alone gives about 0.79 accuracy). Name features listed in this way are not very informative, they make more sense when we check out predictions. We hide missing features here because there is a lot of missing features in text, but they are not very interesting:

from IPython.display import display

for idx in [4, 5, 7, 37, 81]:
    display(show_prediction(clf2, valid_xs[idx], vec=vec2,
                            show_feature_values=True, feature_filter=no_missing))

y=1 (probability 0.771, score 1.215) top features

Contribution? Feature Value
+0.995 Name: Highlighted in text (sum)
+0.347 All__Fare 17.800
+0.236 All__Sex=female 1.000
+0.109 All__Age 18.000
-0.029 All__Cabin= 1.000
-0.069 All__Parch 0.000
-0.150 All__Embarked=S 1.000
-0.215 All__SibSp 1.000
-0.539 <BIAS> 1.000
-0.932 All__Pclass=3 1.000

Name: Arnold-Franchi, Mrs. Josef (Josefine Franchi)

y=0 (probability 0.905, score -2.248) top features

Contribution? Feature Value
+0.948 Name: Highlighted in text (sum)
+0.539 <BIAS> 1.000
+0.387 All__Parch 0.000
+0.221 All__Age 45.000
+0.071 All__Cabin= 1.000
+0.037 All__SibSp 0.000
-0.067 All__Pclass=1 1.000
-0.492 All__Fare 26.550

Name: Romaine, Mr. Charles Hallace ("Mr C Rolmane")

y=0 (probability 0.941, score -2.762) top features

Contribution? Feature Value
+1.946 All__SibSp 8.000
+0.942 All__Fare 69.550
+0.678 All__Pclass=3 1.000
+0.539 <BIAS> 1.000
+0.160 All__Parch 2.000
+0.074 All__Embarked=S 1.000
+0.029 All__Cabin= 1.000
-0.669 Name: Highlighted in text (sum)

Name: Sage, Master. Thomas Henry

y=1 (probability 0.679, score 0.750) top features

Contribution? Feature Value
+0.236 All__Sex=female 1.000
+0.226 All__Fare 7.879
+0.141 Name: Highlighted in text (sum)
+0.010 All__SibSp 0.000
-0.029 All__Cabin= 1.000
-0.041 All__Parch 0.000
-0.539 <BIAS> 1.000
-0.932 All__Pclass=3 1.000

Name: Mockler, Miss. Helen Mary "Ellie"

y=1 (probability 0.660, score 0.663) top features

Contribution? Feature Value
+0.236 All__Sex=female 1.000
+0.161 All__Fare 23.250
+0.158 Name: Highlighted in text (sum)
+0.152 All__Embarked=Q 1.000
+0.010 All__SibSp 2.000
-0.029 All__Cabin= 1.000
-0.069 All__Parch 0.000
-0.539 <BIAS> 1.000
-0.932 All__Pclass=3 1.000

Name: McCoy, Miss. Agnes

Text features from the Name field are highlighted directly in text, and the sum of weights is shown in the weights table as “Name: Highlighted in text (sum)”.

Looks like name classifier tried to infer both gender and status from the title: “Mr.” is bad because women are saved first, and it’s better to be “Mrs.” (married) than “Miss.”. Also name classifier is trying to pick some parts of names and surnames, especially endings, perhaps as a proxy for social status. It’s especially bad to be “Mary” if you are from the third class.