eli5.lime¶
eli5.lime.lime¶
An impementation of LIME (http://arxiv.org/abs/1602.04938), an algorithm to explain predictions of black-box models.
-
class
TextExplainer
(n_samples=5000, char_based=None, clf=None, vec=None, sampler=None, position_dependent=False, rbf_sigma=None, random_state=None, expand_factor=10, token_pattern=None)[source]¶ TextExplainer allows to explain predictions of black-box text classifiers using LIME algorithm.
Parameters: n_samples (int) – A number of samples to generate and train on. Default is 5000.
With larger n_samples it takes more CPU time and RAM to explain a prediction, but it could give better results. Larger n_samples could be also required to get good results if you don’t want to make strong assumptions about the black-box classifier (e.g. char_based=True and position_dependent=True).
char_based (bool) – True if explanation should be char-based, False if it should be token-based. Default is False.
clf (object, optional) – White-box probabilistic classifier. It should be supported by eli5, follow scikit-learn interface and provide predict_proba method. When not set, a default classifier is used (logistic regression with elasticnet regularization trained with SGD).
vec (object, optional) – Vectorizer which converts generated texts to feature vectors for the white-box classifier. When not set, a default vectorizer is used; which one depends on
char_based
andposition_dependent
arguments.sampler (MaskingTextSampler or MaskingTextSamplers, optional) – Sampler used to generate modified versions of the text.
position_dependent (bool) – When True, a special vectorizer is used which takes each token or character (depending on
char_based
value) in account separately. When False (default) a vectorized passed invec
or a default vectorizer is used.Default vectorizer converts text to vector using bag-of-ngrams or bag-of-char-ngrams approach (depending on
char_based
argument). It means that it may be not powerful enough to approximate a black-box classifier which e.g. takes in account word FOO in the beginning of the document, but not in the end.When
position_dependent
is True the model becomes powerful enough to account for that, but it can become more noisy and require largern_samples
to get an OK explanation.When
char_based=False
the default vectorizer uses word bigrams in addition to unigrams; this is less powerful thanposition_dependent=True
, but can give similar results in practice.rbf_sigma (float, optional) – Sigma parameter of RBF kernel used to post-process cosine similarity values. Default is None, meaning no post-processing (cosine simiilarity is used as sample weight as-is). Small
rbf_sigma
values (e.g. 0.1) tell the classifier to pay more attention to generated texts which are close to the original text. Largerbf_sigma
values (e.g. 1.0) make distance between text irrelevant.Note that if you’re using large
rbf_sigma
it could be more efficient to use customsamplers
instead, in order to generate text samples which are closer to the original text in the first place. Use e.g.max_replace
parameter ofMaskingTextSampler
.random_state (integer or numpy.random.RandomState, optional) – random state
expand_factor (int or None) – To approximate output of the probabilistic classifier generated dataset is expanded by
expand_factor
(10 by default) according to the predicted label probabilities. This is a workaround for scikit-learn limitation (no cross-entropy loss for non 1/0 labels). With larger values training takes longer, but probability output can be approximated better.expand_factor=None turns this feature off; pass None when you know that black-box classifier returns only 1.0 or 0.0 probabilities.
token_pattern (str, optional) – Regex which matches a token. Use it to customize tokenization. Default value depends on
char_based
parameter.
-
rng_
¶ random state
Type: numpy.random.RandomState
-
samples_
¶ A list of samples the local model is trained on. Only available after
fit()
.Type: list[str]
-
X_
¶ A matrix with vectorized
samples_
. Only available afterfit()
.Type: ndarray or scipy.sparse matrix
-
y_proba_
¶ probabilities predicted by black-box classifier (
predict_proba(self.samples_)
result). Only available afterfit()
.Type: ndarray
-
metrics_
¶ A dictionary with metrics of how well the local classification pipeline approximates the black-box pipeline. Only available after
fit()
.Type: dict
-
explain_prediction
(**kwargs)[source]¶ Call
eli5.explain_prediction()
for the locally-fit classification pipeline. Keyword arguments are passed toeli5.explain_prediction()
.fit()
must be called before using this method.
-
explain_weights
(**kwargs)[source]¶ Call
eli5.show_weights()
for the locally-fit classification pipeline. Keyword arguments are passed toeli5.show_weights()
.fit()
must be called before using this method.
-
fit
(doc, predict_proba)[source]¶ Explain
predict_proba
probabilistic classification function for thedoc
example. This method fits a local classification pipeline following LIME approach.To get the explanation use
show_prediction()
,show_weights()
,explain_prediction()
orexplain_weights()
.Parameters: - doc (str) – Text to explain
- predict_proba (callable) – Black-box classification pipeline.
predict_proba
should be a function which takes a list of strings (documents) and return a matrix of shape(n_samples, n_classes)
with probability values - a row per document and a column per output label.
-
show_prediction
(**kwargs)[source]¶ Call
eli5.show_prediction()
for the locally-fit classification pipeline. Keyword arguments are passed toeli5.show_prediction()
.fit()
must be called before using this method.
-
show_weights
(**kwargs)[source]¶ Call
eli5.show_weights()
for the locally-fit classification pipeline. Keyword arguments are passed toeli5.show_weights()
.fit()
must be called before using this method.
eli5.lime.samplers¶
-
class
BaseSampler
[source]¶ Base sampler class. Sampler is an object which generates examples similar to a given example.
-
class
MaskingTextSampler
(token_pattern=None, bow=True, random_state=None, replacement='', min_replace=1, max_replace=1.0, group_size=1)[source]¶ Sampler for text data. It randomly removes or replaces tokens from text.
Parameters: - token_pattern (str, optional) – Regexp for token matching
- bow (bool, optional) – Sampler could either replace all instances of a given token (bow=True, bag of words sampling) or replace just a single token (bow=False).
- random_state (integer or numpy.random.RandomState, optional) – random state
- replacement (str) – Defalt value is ‘’ - by default tokens are removed. If you want to
preserve the total token count set
replacement
to a non-empty string, e.g. ‘UNKN’. - min_replace (int or float) – A minimum number of tokens to replace. Default is 1, meaning 1 token. If this value is float in range [0.0, 1.0], it is used as a ratio. More than min_replace tokens could be replaced if group_size > 1.
- max_replace (int or float) – A maximum number of tokens to replace. Default is 1.0, meaning all tokens can be replaced. If this value is float in range [0.0, 0.1], it is used as a ratio.
- group_size (int) – When group_size > 1, groups of nearby tokens are replaced all in once (each token is still replaced with a replacement). Default is 1, meaning individual tokens are replaced.
-
class
MaskingTextSamplers
(sampler_params, token_pattern=None, random_state=None, weights=None)[source]¶ Union of MaskingText samplers, with weights.
sample_near()
orsample_near_with_mask()
generate a requested number of samples using all samplers; a probability of using a sampler is proportional to its weight.All samplers must use the same token_pattern in order for
sample_near_with_mask()
to work.Create it with a list of {param: value} dicts with
MaskingTextSampler
paremeters.
-
class
MultivariateKernelDensitySampler
(kde=None, metric='euclidean', fit_bandwidth=True, bandwidths=array([1.00000000e-06, 1.00000000e-03, 3.16227766e-03, 1.00000000e-02, 3.16227766e-02, 1.00000000e-01, 3.16227766e-01, 1.00000000e+00, 3.16227766e+00, 1.00000000e+01, 3.16227766e+01, 1.00000000e+02, 3.16227766e+02, 1.00000000e+03, 3.16227766e+03, 1.00000000e+04]), sigma='bandwidth', n_jobs=1, random_state=None)[source]¶ General-purpose sampler for dense continuous data, based on multivariate kernel density estimation.
The limitation is that a single bandwidth value is used for all dimensions, i.e. bandwith matrix is a positive scalar times the identity matrix. It is a problem e.g. when features have different variances (e.g. some of them are one-hot encoded and other are continuous).
-
class
UnivariateKernelDensitySampler
(kde=None, metric='euclidean', fit_bandwidth=True, bandwidths=array([1.00000000e-06, 1.00000000e-03, 3.16227766e-03, 1.00000000e-02, 3.16227766e-02, 1.00000000e-01, 3.16227766e-01, 1.00000000e+00, 3.16227766e+00, 1.00000000e+01, 3.16227766e+01, 1.00000000e+02, 3.16227766e+02, 1.00000000e+03, 3.16227766e+03, 1.00000000e+04]), sigma='bandwidth', n_jobs=1, random_state=None)[source]¶ General-purpose sampler for dense continuous data, based on univariate kernel density estimation. It estimates a separate probability distribution for each input dimension.
The limitation is that variable interactions are not taken in account.
Unlike KernelDensitySampler it uses different bandwidths for different dimensions; because of that it can handle one-hot encoded features somehow (make sure to at least tune the default
sigma
parameter). Also, at sampling time it replaces only random subsets of the features instead of generating totally new examples.
eli5.lime.textutils¶
Utilities for text generation.
-
cosine_similarity_vec
(num_tokens, num_removed_vec)[source]¶ Return cosine similarity between a binary vector with all ones of length
num_tokens
and vectors of the same length withnum_removed_vec
elements set to zero.
-
generate_samples
(text, n_samples=500, bow=True, random_state=None, replacement='', min_replace=1, max_replace=1.0, group_size=1)[source]¶ Return
n_samples
changed versions of text (with some words removed), along with distances between the original text and a generated examples. Ifbow=False
, all tokens are considered unique (i.e. token position matters).