eli5.lime

eli5.lime.lime

An impementation of LIME (http://arxiv.org/abs/1602.04938), an algorithm to explain predictions of black-box models.

class TextExplainer(n_samples: int = 5000, char_based: bool | None = None, clf=None, vec=None, sampler: BaseSampler | None = None, position_dependent: bool = False, rbf_sigma: float | None = None, random_state=None, expand_factor: int | None = 10, token_pattern: str | None = None)[source]

TextExplainer allows to explain predictions of black-box text classifiers using LIME algorithm.

Parameters:

n_samples (int) – A number of samples to generate and train on. Default is 5000.

With larger n_samples it takes more CPU time and RAM to explain a prediction, but it could give better results. Larger n_samples could be also required to get good results if you don’t want to make strong assumptions about the black-box classifier (e.g. char_based=True and position_dependent=True).
char_based (bool) – True if explanation should be char-based, False if it should be token-based. Default is False.
clf (object, optional) – White-box probabilistic classifier. It should be supported by eli5, follow scikit-learn interface and provide predict_proba method. When not set, a default classifier is used (logistic regression with elasticnet regularization trained with SGD).
vec (object, optional) – Vectorizer which converts generated texts to feature vectors for the white-box classifier. When not set, a default vectorizer is used; which one depends on char_based and position_dependent arguments.
sampler (MaskingTextSampler or MaskingTextSamplers, optional) – Sampler used to generate modified versions of the text.
position_dependent (bool) – When True, a special vectorizer is used which takes each token or character (depending on char_based value) in account separately. When False (default) a vectorized passed in vec or a default vectorizer is used.

Default vectorizer converts text to vector using bag-of-ngrams or bag-of-char-ngrams approach (depending on char_based argument). It means that it may be not powerful enough to approximate a black-box classifier which e.g. takes in account word FOO in the beginning of the document, but not in the end.

When position_dependent is True the model becomes powerful enough to account for that, but it can become more noisy and require larger n_samples to get an OK explanation.

When char_based=False the default vectorizer uses word bigrams in addition to unigrams; this is less powerful than position_dependent=True, but can give similar results in practice.
rbf_sigma (float, optional) – Sigma parameter of RBF kernel used to post-process cosine similarity values. Default is None, meaning no post-processing (cosine simiilarity is used as sample weight as-is). Small rbf_sigma values (e.g. 0.1) tell the classifier to pay more attention to generated texts which are close to the original text. Large rbf_sigma values (e.g. 1.0) make distance between text irrelevant.

Note that if you’re using large rbf_sigma it could be more efficient to use custom samplers instead, in order to generate text samples which are closer to the original text in the first place. Use e.g. max_replace parameter of MaskingTextSampler.
random_state (integer or numpy.random.RandomState, optional) – random state
expand_factor (int or None) – To approximate output of the probabilistic classifier generated dataset is expanded by expand_factor (10 by default) according to the predicted label probabilities. This is a workaround for scikit-learn limitation (no cross-entropy loss for non 1/0 labels). With larger values training takes longer, but probability output can be approximated better.

expand_factor=None turns this feature off; pass None when you know that black-box classifier returns only 1.0 or 0.0 probabilities.
token_pattern (str, optional) – Regex which matches a token. Use it to customize tokenization. Default value depends on char_based parameter.

rng_

random state

Type:: numpy.random.RandomState

samples_

A list of samples the local model is trained on. Only available after fit().

Type:: list[str]

X_

A matrix with vectorized samples_. Only available after fit().

Type:: ndarray or scipy.sparse matrix

similarity_

Similarity vector. Only available after fit().

Type:: ndarray

y_proba_

probabilities predicted by black-box classifier (predict_proba(self.samples_) result). Only available after fit().

Type:: ndarray

clf_

Trained white-box classifier. Only available after fit().

Type:: object

vec_

Fit white-box vectorizer. Only available after fit().

Type:: object

metrics_

A dictionary with metrics of how well the local classification pipeline approximates the black-box pipeline. Only available after fit().

Type:: dict

explain_prediction(**kwargs)[source]

Call eli5.explain_prediction() for the locally-fit classification pipeline. Keyword arguments are passed to eli5.explain_prediction().

fit() must be called before using this method.

explain_weights(**kwargs)[source]

Call eli5.show_weights() for the locally-fit classification pipeline. Keyword arguments are passed to eli5.show_weights().

fit() must be called before using this method.

fit(doc: str, predict_proba: Callable[[Any], Any]) → TextExplainer[source]

Explain predict_proba probabilistic classification function for the doc example. This method fits a local classification pipeline following LIME approach.

To get the explanation use show_prediction(), show_weights(), explain_prediction() or explain_weights().

Parameters:

doc (str) – Text to explain
predict_proba (callable) – Black-box classification pipeline. predict_proba should be a function which takes a list of strings (documents) and return a matrix of shape (n_samples, n_classes) with probability values - a row per document and a column per output label.

set_fit_request(*, doc: bool | None | str = '$UNCHANGED$', predict_proba: bool | None | str = '$UNCHANGED$') → TextExplainer

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

doc (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for doc parameter in fit.
predict_proba (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for predict_proba parameter in fit.

Returns:

self (object) – The updated object.

show_prediction(**kwargs)[source]

Call eli5.show_prediction() for the locally-fit classification pipeline. Keyword arguments are passed to eli5.show_prediction().

fit() must be called before using this method.

show_weights(**kwargs)[source]

Call eli5.show_weights() for the locally-fit classification pipeline. Keyword arguments are passed to eli5.show_weights().

fit() must be called before using this method.

eli5.lime.samplers

class BaseSampler[source]

Base sampler class. Sampler is an object which generates examples similar to a given example.

fit(X=None, y=None)[source]

abstractmethod sample_near(doc, n_samples=1)[source]: Return (examples, similarity) tuple with generated documents similar to a given document and a vector of similarity values.

class MaskingTextSampler(token_pattern: str | None = None, bow: bool = True, random_state=None, replacement: str = '', min_replace: int | float = 1, max_replace: int | float = 1.0, group_size: int = 1)[source]

Sampler for text data. It randomly removes or replaces tokens from text.

Parameters:

token_pattern (str, optional) – Regexp for token matching
bow (bool, optional) – Sampler could either replace all instances of a given token (bow=True, bag of words sampling) or replace just a single token (bow=False).
random_state (integer or numpy.random.RandomState, optional) – random state
replacement (str) – Defalt value is ‘’ - by default tokens are removed. If you want to preserve the total token count set replacement to a non-empty string, e.g. ‘UNKN’.
min_replace (int or float) – A minimum number of tokens to replace. Default is 1, meaning 1 token. If this value is float in range [0.0, 1.0], it is used as a ratio. More than min_replace tokens could be replaced if group_size > 1.
max_replace (int or float) – A maximum number of tokens to replace. Default is 1.0, meaning all tokens can be replaced. If this value is float in range [0.0, 0.1], it is used as a ratio.
group_size (int) – When group_size > 1, groups of nearby tokens are replaced all in once (each token is still replaced with a replacement). Default is 1, meaning individual tokens are replaced.

sample_near(doc: str, n_samples: int = 1) → tuple[list[str], ndarray][source]: Return (examples, similarity) tuple with generated documents similar to a given document and a vector of similarity values.

sample_near_with_mask(doc: TokenizedText | str, n_samples: int = 1) → tuple[list[str], ndarray, ndarray, TokenizedText][source]

class MaskingTextSamplers(sampler_params: list[dict[str, Any]], token_pattern: str | None = None, random_state=None, weights: ndarray | list[float] | None = None)[source]

Union of MaskingText samplers, with weights. sample_near() or sample_near_with_mask() generate a requested number of samples using all samplers; a probability of using a sampler is proportional to its weight.

All samplers must use the same token_pattern in order for sample_near_with_mask() to work.

Create it with a list of {param: value} dicts with MaskingTextSampler paremeters.

sample_near(doc: str, n_samples: int = 1) → tuple[list[str], ndarray][source]: Return (examples, similarity) tuple with generated documents similar to a given document and a vector of similarity values.

sample_near_with_mask(doc: str, n_samples: int = 1) → tuple[list[str], ndarray, ndarray, TokenizedText][source]

class MultivariateKernelDensitySampler(kde=None, metric='euclidean', fit_bandwidth=True, bandwidths=array([1.00000000e-06, 1.00000000e-03, 3.16227766e-03, 1.00000000e-02, 3.16227766e-02, 1.00000000e-01, 3.16227766e-01, 1.00000000e+00, 3.16227766e+00, 1.00000000e+01, 3.16227766e+01, 1.00000000e+02, 3.16227766e+02, 1.00000000e+03, 3.16227766e+03, 1.00000000e+04]), sigma='bandwidth', n_jobs=1, random_state=None)[source]

General-purpose sampler for dense continuous data, based on multivariate kernel density estimation.

The limitation is that a single bandwidth value is used for all dimensions, i.e. bandwith matrix is a positive scalar times the identity matrix. It is a problem e.g. when features have different variances (e.g. some of them are one-hot encoded and other are continuous).

fit(X=None, y=None)[source]

sample_near(doc, n_samples=1)[source]: Return (examples, similarity) tuple with generated documents similar to a given document and a vector of similarity values.

class UnivariateKernelDensitySampler(kde=None, metric='euclidean', fit_bandwidth=True, bandwidths=array([1.00000000e-06, 1.00000000e-03, 3.16227766e-03, 1.00000000e-02, 3.16227766e-02, 1.00000000e-01, 3.16227766e-01, 1.00000000e+00, 3.16227766e+00, 1.00000000e+01, 3.16227766e+01, 1.00000000e+02, 3.16227766e+02, 1.00000000e+03, 3.16227766e+03, 1.00000000e+04]), sigma='bandwidth', n_jobs=1, random_state=None)[source]

General-purpose sampler for dense continuous data, based on univariate kernel density estimation. It estimates a separate probability distribution for each input dimension.

The limitation is that variable interactions are not taken in account.

Unlike KernelDensitySampler it uses different bandwidths for different dimensions; because of that it can handle one-hot encoded features somehow (make sure to at least tune the default sigma parameter). Also, at sampling time it replaces only random subsets of the features instead of generating totally new examples.

fit(X=None, y=None)[source]

sample_near(doc, n_samples=1)[source]: Sample near the document by replacing some of its features with values sampled from distribution found by KDE.

eli5.lime.textutils

Utilities for text generation.

cosine_similarity_vec(num_tokens, num_removed_vec)[source]: Return cosine similarity between a binary vector with all ones of length num_tokens and vectors of the same length with num_removed_vec elements set to zero.

generate_samples(text: TokenizedText, n_samples=500, bow=True, random_state=None, replacement='', min_replace=1.0, max_replace=1.0, group_size=1) → Tuple[List[str], ndarray, ndarray][source]: Return n_samples changed versions of text (with some words removed), along with distances between the original text and a generated examples. If bow=False, all tokens are considered unique (i.e. token position matters).