Note
This tutorial is intended to be run in an IPython notebook. It is also available as a notebook file here.
Debugging scikit-learn text classification pipeline¶
scikit-learn docs provide a nice text classification tutorial. Make sure to read it first. We’ll be doing something similar to it, while taking more detailed look at classifier weights and predictions.
1. Baseline model¶
First, we need some data. Let’s load 20 Newsgroups data, keeping only 4 categories:
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'soc.religion.christian',
'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(
subset='train',
categories=categories,
shuffle=True,
random_state=42
)
twenty_test = fetch_20newsgroups(
subset='test',
categories=categories,
shuffle=True,
random_state=42
)
A basic text processing pipeline - bag of words features and Logistic Regression as a classifier:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.pipeline import make_pipeline
vec = CountVectorizer()
clf = LogisticRegressionCV()
pipe = make_pipeline(vec, clf)
pipe.fit(twenty_train.data, twenty_train.target);
We’re using LogisticRegressionCV here to adjust regularization parameter C automatically. It allows to compare different vectorizers - optimal C value could be different for different input features (e.g. for bigrams or for character-level input). An alternative would be to use GridSearchCV or RandomizedSearchCV.
Let’s check quality of this pipeline:
from sklearn import metrics
def print_report(pipe):
y_test = twenty_test.target
y_pred = pipe.predict(twenty_test.data)
report = metrics.classification_report(y_test, y_pred,
target_names=twenty_test.target_names)
print(report)
print("accuracy: {:0.3f}".format(metrics.accuracy_score(y_test, y_pred)))
print_report(pipe)
precision recall f1-score support
alt.atheism 0.93 0.80 0.86 319
comp.graphics 0.87 0.96 0.91 389
sci.med 0.94 0.81 0.87 396
soc.religion.christian 0.85 0.98 0.91 398
avg / total 0.90 0.89 0.89 1502
accuracy: 0.891
Not bad. We can try other classifiers and preprocessing methods, but
let’s check first what the model learned using eli5.show_weights()
function:
import eli5
eli5.show_weights(clf, top=10)
y=0 top features | y=1 top features | y=2 top features | y=3 top features | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|
|
The table above doesn’t make any sense; the problem is that eli5 was not able to get feature and class names from the classifier object alone. We can provide feature and target names explicitly:
# eli5.show_weights(clf,
# feature_names=vec.get_feature_names(),
# target_names=twenty_test.target_names)
The code above works, but a better way is to provide vectorizer instead and let eli5 figure out the details automatically:
eli5.show_weights(clf, vec=vec, top=10,
target_names=twenty_test.target_names)
y=alt.atheism top features | y=comp.graphics top features | y=sci.med top features | y=soc.religion.christian top features | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|
|
This starts to make more sense. Columns are target classes. In each
column there are features and their weights. Intercept (bias) feature is
shown as <BIAS>
in the same table. We can inspect features and
weights because we’re using a bag-of-words vectorizer and a linear
classifier (so there is a direct mapping between individual words and
classifier coefficients). For other classifiers features can be harder
to inspect.
Some features look good, but some don’t. It seems model learned some names specific to a dataset (email parts, etc.) though, instead of learning topic-specific words. Let’s check prediction results on an example:
eli5.show_prediction(clf, twenty_test.data[0], vec=vec,
target_names=twenty_test.target_names)
y=alt.atheism (probability 0.000, score -8.709) top features
Contribution? | Feature |
---|---|
+1.743 | Highlighted in text (sum) |
-10.453 | <BIAS> |
from: brian@ucsd.edu (brian kantor) subject: re: help for kidney stones .............. organization: the avant-garde of the now, ltd. lines: 12 nntp-posting-host: ucsd.edu as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less. demerol worked, although i nearly got arrested on my way home when i barfed all over the police car parked just outside the er. - brian
y=comp.graphics (probability 0.010, score -4.592) top features
Contribution? | Feature |
---|---|
-1.379 | <BIAS> |
-3.213 | Highlighted in text (sum) |
from: brian@ucsd.edu (brian kantor) subject: re: help for kidney stones .............. organization: the avant-garde of the now, ltd. lines: 12 nntp-posting-host: ucsd.edu as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less. demerol worked, although i nearly got arrested on my way home when i barfed all over the police car parked just outside the er. - brian
y=sci.med (probability 0.989, score 3.945) top features
Contribution? | Feature |
---|---|
+8.958 | Highlighted in text (sum) |
-5.013 | <BIAS> |
from: brian@ucsd.edu (brian kantor) subject: re: help for kidney stones .............. organization: the avant-garde of the now, ltd. lines: 12 nntp-posting-host: ucsd.edu as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less. demerol worked, although i nearly got arrested on my way home when i barfed all over the police car parked just outside the er. - brian
y=soc.religion.christian (probability 0.001, score -7.157) top features
Contribution? | Feature |
---|---|
-0.258 | <BIAS> |
-6.899 | Highlighted in text (sum) |
from: brian@ucsd.edu (brian kantor) subject: re: help for kidney stones .............. organization: the avant-garde of the now, ltd. lines: 12 nntp-posting-host: ucsd.edu as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less. demerol worked, although i nearly got arrested on my way home when i barfed all over the police car parked just outside the er. - brian
What can be highlighted in text is highlighted in text. There is also a
separate table for features which can’t be highlighted in text -
<BIAS>
in this case. If you hover mouse on a highlighted word it
shows you a weight of this word in a title. Words are colored according
to their weights.
2. Baseline model, improved data¶
Aha, from the highlighting above it can be seen that a classifier learned some non-interesting stuff indeed, e.g. it remembered parts of email addresses. We should probably clean the data first to make it more interesting; improving model (trying different classifiers, etc.) doesn’t make sense at this point - it may just learn to leverage these email addresses better.
In practice we’d have to do cleaning yourselves; in this example 20 newsgroups dataset provides an option to remove footers and headers from the messages. Nice. Let’s clean up the data and re-train a classifier.
twenty_train = fetch_20newsgroups(
subset='train',
categories=categories,
shuffle=True,
random_state=42,
remove=['headers', 'footers'],
)
twenty_test = fetch_20newsgroups(
subset='test',
categories=categories,
shuffle=True,
random_state=42,
remove=['headers', 'footers'],
)
vec = CountVectorizer()
clf = LogisticRegressionCV()
pipe = make_pipeline(vec, clf)
pipe.fit(twenty_train.data, twenty_train.target);
We just made the task harder and more realistic for a classifier.
print_report(pipe)
precision recall f1-score support
alt.atheism 0.83 0.78 0.80 319
comp.graphics 0.82 0.96 0.88 389
sci.med 0.89 0.80 0.84 396
soc.religion.christian 0.88 0.86 0.87 398
avg / total 0.85 0.85 0.85 1502
accuracy: 0.852
A great result - we just made quality worse! Does it mean pipeline is worse now? No, likely it has a better quality on unseen messages. It is evaluation which is more fair now. Inspecting features used by classifier allowed us to notice a problem with the data and made a good change, despite of numbers which told us not to do that.
Instead of removing headers and footers we could have improved evaluation setup directly, using e.g. GroupKFold from scikit-learn. Then quality of old model would have dropped, we could have removed headers/footers and see increased accuracy, so the numbers would have told us to remove headers and footers. It is not obvious how to split data though, what groups to use with GroupKFold.
So, what have the updated classifier learned? (output is less verbose because only a subset of classes is shown - see “targets” argument):
eli5.show_prediction(clf, twenty_test.data[0], vec=vec,
target_names=twenty_test.target_names,
targets=['sci.med'])
y=sci.med (probability 0.732, score 0.031) top features
Contribution? | Feature |
---|---|
+1.747 | Highlighted in text (sum) |
-1.716 | <BIAS> |
as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.
Hm, it no longer uses email addresses, but it still doesn’t look good: classifier assigns high weights to seemingly unrelated words like ‘do’ or ‘my’. These words appear in many texts, so maybe classifier uses them as a proxy for bias. Or maybe some of them are more common in some of classes.
3. Pipeline improvements¶
To help classifier we may filter out stop words:
vec = CountVectorizer(stop_words='english')
clf = LogisticRegressionCV()
pipe = make_pipeline(vec, clf)
pipe.fit(twenty_train.data, twenty_train.target)
print_report(pipe)
precision recall f1-score support
alt.atheism 0.87 0.76 0.81 319
comp.graphics 0.85 0.95 0.90 389
sci.med 0.93 0.85 0.89 396
soc.religion.christian 0.85 0.89 0.87 398
avg / total 0.87 0.87 0.87 1502
accuracy: 0.871
eli5.show_prediction(clf, twenty_test.data[0], vec=vec,
target_names=twenty_test.target_names,
targets=['sci.med'])
y=sci.med (probability 0.714, score 0.510) top features
Contribution? | Feature |
---|---|
+2.184 | Highlighted in text (sum) |
-1.674 | <BIAS> |
as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.
Looks better, isn’t it?
Alternatively, we can use TF*IDF scheme; it should give a somewhat similar effect.
Note that we’re cross-validating LogisticRegression regularisation parameter here, like in other examples (LogisticRegressionCV, not LogisticRegression). TF*IDF values are different from word count values, so optimal C value can be different. We could draw a wrong conclusion if a classifier with fixed regularization strength is used - the chosen C value could have worked better for one kind of data.
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
clf = LogisticRegressionCV()
pipe = make_pipeline(vec, clf)
pipe.fit(twenty_train.data, twenty_train.target)
print_report(pipe)
precision recall f1-score support
alt.atheism 0.91 0.79 0.85 319
comp.graphics 0.83 0.97 0.90 389
sci.med 0.95 0.87 0.91 396
soc.religion.christian 0.90 0.91 0.91 398
avg / total 0.90 0.89 0.89 1502
accuracy: 0.892
eli5.show_prediction(clf, twenty_test.data[0], vec=vec,
target_names=twenty_test.target_names,
targets=['sci.med'])
y=sci.med (probability 0.987, score 1.585) top features
Contribution? | Feature |
---|---|
+6.788 | Highlighted in text (sum) |
-5.203 | <BIAS> |
as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.
It helped, but didn’t have quite the same effect. Why not do both?
vec = TfidfVectorizer(stop_words='english')
clf = LogisticRegressionCV()
pipe = make_pipeline(vec, clf)
pipe.fit(twenty_train.data, twenty_train.target)
print_report(pipe)
precision recall f1-score support
alt.atheism 0.93 0.77 0.84 319
comp.graphics 0.84 0.97 0.90 389
sci.med 0.95 0.89 0.92 396
soc.religion.christian 0.88 0.92 0.90 398
avg / total 0.90 0.89 0.89 1502
accuracy: 0.893
eli5.show_prediction(clf, twenty_test.data[0], vec=vec,
target_names=twenty_test.target_names,
targets=['sci.med'])
y=sci.med (probability 0.939, score 1.910) top features
Contribution? | Feature |
---|---|
+5.488 | Highlighted in text (sum) |
-3.578 | <BIAS> |
as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.
This starts to look good!
4. Char-based pipeline¶
Maybe we can get somewhat better quality by choosing a different classifier, but let’s skip it for now. Let’s try other analysers instead - use char n-grams instead of words:
vec = TfidfVectorizer(stop_words='english', analyzer='char',
ngram_range=(3,5))
clf = LogisticRegressionCV()
pipe = make_pipeline(vec, clf)
pipe.fit(twenty_train.data, twenty_train.target)
print_report(pipe)
precision recall f1-score support
alt.atheism 0.93 0.79 0.85 319
comp.graphics 0.81 0.97 0.89 389
sci.med 0.95 0.86 0.90 396
soc.religion.christian 0.89 0.91 0.90 398
avg / total 0.89 0.89 0.89 1502
accuracy: 0.888
eli5.show_prediction(clf, twenty_test.data[0], vec=vec,
target_names=twenty_test.target_names)
y=alt.atheism (probability 0.002, score -7.318) top features
Contribution? | Feature |
---|---|
-0.838 | Highlighted in text (sum) |
-6.480 | <BIAS> |
as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.
y=comp.graphics (probability 0.017, score -5.118) top features
Contribution? | Feature |
---|---|
+0.934 | <BIAS> |
-6.052 | Highlighted in text (sum) |
as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.
y=sci.med (probability 0.963, score -0.656) top features
Contribution? | Feature |
---|---|
+4.493 | Highlighted in text (sum) |
-5.149 | <BIAS> |
as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.
y=soc.religion.christian (probability 0.018, score -5.048) top features
Contribution? | Feature |
---|---|
+0.600 | Highlighted in text (sum) |
-5.648 | <BIAS> |
as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.
It works, but quality is a bit worse. Also, it takes ages to train.
It looks like stop_words have no effect now - in fact, this is documented in scikit-learn docs, so our stop_words=‘english’ was useless. But at least it is now more obvious how the text looks like for a char ngram-based classifier. Grab a cup of tea and see how char_wb looks like:
vec = TfidfVectorizer(analyzer='char_wb', ngram_range=(3,5))
clf = LogisticRegressionCV()
pipe = make_pipeline(vec, clf)
pipe.fit(twenty_train.data, twenty_train.target)
print_report(pipe)
precision recall f1-score support
alt.atheism 0.93 0.79 0.85 319
comp.graphics 0.87 0.96 0.91 389
sci.med 0.91 0.90 0.90 396
soc.religion.christian 0.89 0.91 0.90 398
avg / total 0.90 0.89 0.89 1502
accuracy: 0.894
eli5.show_prediction(clf, twenty_test.data[0], vec=vec,
target_names=twenty_test.target_names)
y=alt.atheism (probability 0.000, score -8.878) top features
Contribution? | Feature |
---|---|
-2.560 | Highlighted in text (sum) |
-6.318 | <BIAS> |
as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.
y=comp.graphics (probability 0.005, score -6.007) top features
Contribution? | Feature |
---|---|
+0.974 | <BIAS> |
-6.981 | Highlighted in text (sum) |
as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.
y=sci.med (probability 0.834, score -0.440) top features
Contribution? | Feature |
---|---|
+2.134 | Highlighted in text (sum) |
-2.573 | <BIAS> |
as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.
y=soc.religion.christian (probability 0.160, score -2.510) top features
Contribution? | Feature |
---|---|
+3.263 | Highlighted in text (sum) |
-5.773 | <BIAS> |
as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.
The result is similar, with some minor changes. Quality is better for unknown reason; maybe cross-word dependencies are not that important.
5. Debugging HashingVectorizer¶
To check that we can try fitting word n-grams instead of char n-grams. But let’s deal with efficiency first. To handle large vocabularies we can use HashingVectorizer from scikit-learn; to make training faster we can employ SGDCLassifier:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier
vec = HashingVectorizer(stop_words='english', ngram_range=(1,2))
clf = SGDClassifier(n_iter=10, random_state=42)
pipe = make_pipeline(vec, clf)
pipe.fit(twenty_train.data, twenty_train.target)
print_report(pipe)
precision recall f1-score support
alt.atheism 0.90 0.80 0.85 319
comp.graphics 0.88 0.96 0.92 389
sci.med 0.93 0.90 0.92 396
soc.religion.christian 0.89 0.91 0.90 398
avg / total 0.90 0.90 0.90 1502
accuracy: 0.899
It was super-fast! We’re not choosing regularization parameter using cross-validation though. Let’s check what model learned:
eli5.show_prediction(clf, twenty_test.data[0], vec=vec,
target_names=twenty_test.target_names,
targets=['sci.med'])
y=sci.med (score 0.097) top features
Contribution? | Feature |
---|---|
+0.678 | Highlighted in text (sum) |
-0.581 | <BIAS> |
as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.
Result looks similar to CountVectorizer. But with HashingVectorizer we don’t even have a vocabulary! Why does it work?
eli5.show_weights(clf, vec=vec, top=10,
target_names=twenty_test.target_names)
y=alt.atheism top features | y=comp.graphics top features | y=sci.med top features | y=soc.religion.christian top features | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|
|
Ok, we don’t have a vocabulary, so we don’t have feature names. Are we
out of luck? Nope, eli5 has an answer for that:
InvertableHashingVectorizer
. It can be used to get feature names for
HahshingVectorizer without fitiing a huge vocabulary. It still needs
some data to learn words -> hashes mapping though; we can use a random
subset of data to fit it.
from eli5.sklearn import InvertableHashingVectorizer
import numpy as np
ivec = InvertableHashingVectorizer(vec)
sample_size = len(twenty_train.data) // 10
X_sample = np.random.choice(twenty_train.data, size=sample_size)
ivec.fit(X_sample);
eli5.show_weights(clf, vec=ivec, top=20,
target_names=twenty_test.target_names)
y=alt.atheism top features | y=comp.graphics top features | y=sci.med top features | y=soc.religion.christian top features | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|
|
There are collisions (hover mouse over features with “…”), and there are important features which were not seen in the random sample (FEATURE[…]), but overall it looks fine.
“rutgers edu” bigram feature is suspicious though, it looks like a part of URL.
rutgers_example = [x for x in twenty_train.data if 'rutgers' in x.lower()][0]
print(rutgers_example)
In article <Apr.8.00.57.41.1993.28246@athos.rutgers.edu> REXLEX@fnal.gov writes:
>In article <Apr.7.01.56.56.1993.22824@athos.rutgers.edu> shrum@hpfcso.fc.hp.com
>Matt. 22:9-14 'Go therefore to the main highways, and as many as you find
>there, invite to the wedding feast.'...
>hmmmmmm. Sounds like your theology and Christ's are at odds. Which one am I
>to believe?
Yep, it looks like model learned this address instead of learning something useful.
eli5.show_prediction(clf, rutgers_example, vec=vec,
target_names=twenty_test.target_names,
targets=['soc.religion.christian'])
y=soc.religion.christian (score 2.044) top features
Contribution? | Feature |
---|---|
+2.706 | Highlighted in text (sum) |
-0.662 | <BIAS> |
in article <apr.8.00.57.41.1993.28246@athos.rutgers.edu> rexlex@fnal.gov writes: >in article <apr.7.01.56.56.1993.22824@athos.rutgers.edu> shrum@hpfcso.fc.hp.com >matt. 22:9-14 'go therefore to the main highways, and as many as you find >there, invite to the wedding feast.'... >hmmmmmm. sounds like your theology and christ's are at odds. which one am i >to believe?
Quoted text makes it too easy for model to classify some of the messages; that won’t generalize to new messages. So to improve the model next step could be to process the data further, e.g. remove quoted text or replace email addresses with a special token.
You get the idea: looking at features helps to understand how classifier works. Maybe even more importantly, it helps to notice preprocessing bugs, data leaks, issues with task specification - all these nasty problems you get in a real world.