eli5.lime
eli5.lime.lime
An impementation of LIME (http://arxiv.org/abs/1602.04938), an algorithm to explain predictions of black-box models.
- class TextExplainer(n_samples: int = 5000, char_based: bool | None = None, clf=None, vec=None, sampler: BaseSampler | None = None, position_dependent: bool = False, rbf_sigma: float | None = None, random_state=None, expand_factor: int | None = 10, token_pattern: str | None = None)[source]
TextExplainer allows to explain predictions of black-box text classifiers using LIME algorithm.
- Parameters:
n_samples (int) – A number of samples to generate and train on. Default is 5000.
With larger n_samples it takes more CPU time and RAM to explain a prediction, but it could give better results. Larger n_samples could be also required to get good results if you don’t want to make strong assumptions about the black-box classifier (e.g. char_based=True and position_dependent=True).
char_based (bool) – True if explanation should be char-based, False if it should be token-based. Default is False.
clf (object, optional) – White-box probabilistic classifier. It should be supported by eli5, follow scikit-learn interface and provide predict_proba method. When not set, a default classifier is used (logistic regression with elasticnet regularization trained with SGD).
vec (object, optional) – Vectorizer which converts generated texts to feature vectors for the white-box classifier. When not set, a default vectorizer is used; which one depends on
char_basedandposition_dependentarguments.sampler (MaskingTextSampler or MaskingTextSamplers, optional) – Sampler used to generate modified versions of the text.
position_dependent (bool) – When True, a special vectorizer is used which takes each token or character (depending on
char_basedvalue) in account separately. When False (default) a vectorized passed invecor a default vectorizer is used.Default vectorizer converts text to vector using bag-of-ngrams or bag-of-char-ngrams approach (depending on
char_basedargument). It means that it may be not powerful enough to approximate a black-box classifier which e.g. takes in account word FOO in the beginning of the document, but not in the end.When
position_dependentis True the model becomes powerful enough to account for that, but it can become more noisy and require largern_samplesto get an OK explanation.When
char_based=Falsethe default vectorizer uses word bigrams in addition to unigrams; this is less powerful thanposition_dependent=True, but can give similar results in practice.rbf_sigma (float, optional) – Sigma parameter of RBF kernel used to post-process cosine similarity values. Default is None, meaning no post-processing (cosine simiilarity is used as sample weight as-is). Small
rbf_sigmavalues (e.g. 0.1) tell the classifier to pay more attention to generated texts which are close to the original text. Largerbf_sigmavalues (e.g. 1.0) make distance between text irrelevant.Note that if you’re using large
rbf_sigmait could be more efficient to use customsamplersinstead, in order to generate text samples which are closer to the original text in the first place. Use e.g.max_replaceparameter ofMaskingTextSampler.random_state (integer or numpy.random.RandomState, optional) – random state
expand_factor (int or None) – To approximate output of the probabilistic classifier generated dataset is expanded by
expand_factor(10 by default) according to the predicted label probabilities. This is a workaround for scikit-learn limitation (no cross-entropy loss for non 1/0 labels). With larger values training takes longer, but probability output can be approximated better.expand_factor=None turns this feature off; pass None when you know that black-box classifier returns only 1.0 or 0.0 probabilities.
token_pattern (str, optional) – Regex which matches a token. Use it to customize tokenization. Default value depends on
char_basedparameter.
- rng_
random state
- Type:
numpy.random.RandomState
- samples_
A list of samples the local model is trained on. Only available after
fit().- Type:
list[str]
- X_
A matrix with vectorized
samples_. Only available afterfit().- Type:
ndarray or scipy.sparse matrix
- y_proba_
probabilities predicted by black-box classifier (
predict_proba(self.samples_)result). Only available afterfit().- Type:
ndarray
- metrics_
A dictionary with metrics of how well the local classification pipeline approximates the black-box pipeline. Only available after
fit().- Type:
dict
- explain_prediction(**kwargs)[source]
Call
eli5.explain_prediction()for the locally-fit classification pipeline. Keyword arguments are passed toeli5.explain_prediction().fit()must be called before using this method.
- explain_weights(**kwargs)[source]
Call
eli5.show_weights()for the locally-fit classification pipeline. Keyword arguments are passed toeli5.show_weights().fit()must be called before using this method.
- fit(doc: str, predict_proba: Callable[[Any], Any]) TextExplainer[source]
Explain
predict_probaprobabilistic classification function for thedocexample. This method fits a local classification pipeline following LIME approach.To get the explanation use
show_prediction(),show_weights(),explain_prediction()orexplain_weights().- Parameters:
doc (str) – Text to explain
predict_proba (callable) – Black-box classification pipeline.
predict_probashould be a function which takes a list of strings (documents) and return a matrix of shape(n_samples, n_classes)with probability values - a row per document and a column per output label.
- set_fit_request(*, doc: bool | None | str = '$UNCHANGED$', predict_proba: bool | None | str = '$UNCHANGED$') TextExplainer
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
doc (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
docparameter infit.predict_proba (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
predict_probaparameter infit.
- Returns:
self (object) – The updated object.
- show_prediction(**kwargs)[source]
Call
eli5.show_prediction()for the locally-fit classification pipeline. Keyword arguments are passed toeli5.show_prediction().fit()must be called before using this method.
- show_weights(**kwargs)[source]
Call
eli5.show_weights()for the locally-fit classification pipeline. Keyword arguments are passed toeli5.show_weights().fit()must be called before using this method.
eli5.lime.samplers
- class BaseSampler[source]
Base sampler class. Sampler is an object which generates examples similar to a given example.
- class MaskingTextSampler(token_pattern: str | None = None, bow: bool = True, random_state=None, replacement: str = '', min_replace: int | float = 1, max_replace: int | float = 1.0, group_size: int = 1)[source]
Sampler for text data. It randomly removes or replaces tokens from text.
- Parameters:
token_pattern (str, optional) – Regexp for token matching
bow (bool, optional) – Sampler could either replace all instances of a given token (bow=True, bag of words sampling) or replace just a single token (bow=False).
random_state (integer or numpy.random.RandomState, optional) – random state
replacement (str) – Defalt value is ‘’ - by default tokens are removed. If you want to preserve the total token count set
replacementto a non-empty string, e.g. ‘UNKN’.min_replace (int or float) – A minimum number of tokens to replace. Default is 1, meaning 1 token. If this value is float in range [0.0, 1.0], it is used as a ratio. More than min_replace tokens could be replaced if group_size > 1.
max_replace (int or float) – A maximum number of tokens to replace. Default is 1.0, meaning all tokens can be replaced. If this value is float in range [0.0, 0.1], it is used as a ratio.
group_size (int) – When group_size > 1, groups of nearby tokens are replaced all in once (each token is still replaced with a replacement). Default is 1, meaning individual tokens are replaced.
- class MaskingTextSamplers(sampler_params: list[dict[str, Any]], token_pattern: str | None = None, random_state=None, weights: ndarray | list[float] | None = None)[source]
Union of MaskingText samplers, with weights.
sample_near()orsample_near_with_mask()generate a requested number of samples using all samplers; a probability of using a sampler is proportional to its weight.All samplers must use the same token_pattern in order for
sample_near_with_mask()to work.Create it with a list of {param: value} dicts with
MaskingTextSamplerparemeters.
- class MultivariateKernelDensitySampler(kde=None, metric='euclidean', fit_bandwidth=True, bandwidths=array([1.00000000e-06, 1.00000000e-03, 3.16227766e-03, 1.00000000e-02, 3.16227766e-02, 1.00000000e-01, 3.16227766e-01, 1.00000000e+00, 3.16227766e+00, 1.00000000e+01, 3.16227766e+01, 1.00000000e+02, 3.16227766e+02, 1.00000000e+03, 3.16227766e+03, 1.00000000e+04]), sigma='bandwidth', n_jobs=1, random_state=None)[source]
General-purpose sampler for dense continuous data, based on multivariate kernel density estimation.
The limitation is that a single bandwidth value is used for all dimensions, i.e. bandwith matrix is a positive scalar times the identity matrix. It is a problem e.g. when features have different variances (e.g. some of them are one-hot encoded and other are continuous).
- class UnivariateKernelDensitySampler(kde=None, metric='euclidean', fit_bandwidth=True, bandwidths=array([1.00000000e-06, 1.00000000e-03, 3.16227766e-03, 1.00000000e-02, 3.16227766e-02, 1.00000000e-01, 3.16227766e-01, 1.00000000e+00, 3.16227766e+00, 1.00000000e+01, 3.16227766e+01, 1.00000000e+02, 3.16227766e+02, 1.00000000e+03, 3.16227766e+03, 1.00000000e+04]), sigma='bandwidth', n_jobs=1, random_state=None)[source]
General-purpose sampler for dense continuous data, based on univariate kernel density estimation. It estimates a separate probability distribution for each input dimension.
The limitation is that variable interactions are not taken in account.
Unlike KernelDensitySampler it uses different bandwidths for different dimensions; because of that it can handle one-hot encoded features somehow (make sure to at least tune the default
sigmaparameter). Also, at sampling time it replaces only random subsets of the features instead of generating totally new examples.
eli5.lime.textutils
Utilities for text generation.
- cosine_similarity_vec(num_tokens, num_removed_vec)[source]
Return cosine similarity between a binary vector with all ones of length
num_tokensand vectors of the same length withnum_removed_vecelements set to zero.
- generate_samples(text: TokenizedText, n_samples=500, bow=True, random_state=None, replacement='', min_replace=1.0, max_replace=1.0, group_size=1) Tuple[List[str], ndarray, ndarray][source]
Return
n_sampleschanged versions of text (with some words removed), along with distances between the original text and a generated examples. Ifbow=False, all tokens are considered unique (i.e. token position matters).