Note

This tutorial is intended to be run in an IPython notebook. It is also available as a notebook file here.

Debugging scikit-learn text classification pipeline

scikit-learn docs provide a nice text classification tutorial. Make sure to read it first. We’ll be doing something similar to it, while taking more detailed look at classifier weights and predictions.

1. Baseline model

First, we need some data. Let’s load 20 Newsgroups data, keeping only 4 categories:

from sklearn.datasets import fetch_20newsgroups

categories = ['alt.atheism', 'soc.religion.christian',
              'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(
    subset='train',
    categories=categories,
    shuffle=True,
    random_state=42
)
twenty_test = fetch_20newsgroups(
    subset='test',
    categories=categories,
    shuffle=True,
    random_state=42
)

A basic text processing pipeline - bag of words features and Logistic Regression as a classifier:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.pipeline import make_pipeline

vec = CountVectorizer()
clf = LogisticRegressionCV()
pipe = make_pipeline(vec, clf)
pipe.fit(twenty_train.data, twenty_train.target);

We’re using LogisticRegressionCV here to adjust regularization parameter C automatically. It allows to compare different vectorizers - optimal C value could be different for different input features (e.g. for bigrams or for character-level input). An alternative would be to use GridSearchCV or RandomizedSearchCV.

Let’s check quality of this pipeline:

from sklearn import metrics

def print_report(pipe):
    y_test = twenty_test.target
    y_pred = pipe.predict(twenty_test.data)
    report = metrics.classification_report(y_test, y_pred,
        target_names=twenty_test.target_names)
    print(report)
    print("accuracy: {:0.3f}".format(metrics.accuracy_score(y_test, y_pred)))

print_report(pipe)
                        precision    recall  f1-score   support

           alt.atheism       0.93      0.80      0.86       319
         comp.graphics       0.87      0.96      0.91       389
               sci.med       0.94      0.81      0.87       396
soc.religion.christian       0.85      0.98      0.91       398

           avg / total       0.90      0.89      0.89      1502

accuracy: 0.891

Not bad. We can try other classifiers and preprocessing methods, but let’s check first what the model learned using eli5.show_weights() function:

import eli5
eli5.show_weights(clf, top=10)
y=0 top features y=1 top features y=2 top features y=3 top features
Weight? Feature
+1.991 x21167
+1.925 x19218
+1.834 x5714
+1.813 x23677
+1.697 x15511
+1.696 x26415
+1.617 x6440
+1.594 x26412
… 10174 more positive …
… 25605 more negative …
-1.686 x28473
-10.453 <BIAS>
Weight? Feature
+1.702 x15699
+0.825 x17366
+0.798 x14281
+0.786 x30117
+0.779 x14277
+0.773 x17356
+0.729 x24267
+0.724 x7874
+0.702 x2148
… 11710 more positive …
… 24069 more negative …
-1.379 <BIAS>
Weight? Feature
+2.016 x25234
+1.951 x12026
+1.758 x17854
+1.697 x11729
+1.655 x32847
+1.522 x22379
+1.518 x16328
… 15007 more positive …
… 20772 more negative …
-1.764 x15521
-2.171 x15699
-5.013 <BIAS>
Weight? Feature
+1.193 x28473
+1.030 x8609
+1.021 x8559
+0.946 x8798
+0.899 x8544
+0.797 x8553
… 11122 more positive …
… 24657 more negative …
-0.852 x15699
-0.894 x25663
-1.181 x23122
-1.243 x16881

The table above doesn’t make any sense; the problem is that eli5 was not able to get feature and class names from the classifier object alone. We can provide feature and target names explicitly:

# eli5.show_weights(clf,
#                   feature_names=vec.get_feature_names(),
#                   target_names=twenty_test.target_names)

The code above works, but a better way is to provide vectorizer instead and let eli5 figure out the details automatically:

eli5.show_weights(clf, vec=vec, top=10,
                  target_names=twenty_test.target_names)
y=alt.atheism top features y=comp.graphics top features y=sci.med top features y=soc.religion.christian top features
Weight? Feature
+1.991 mathew
+1.925 keith
+1.834 atheism
+1.813 okcforum
+1.697 go
+1.696 psuvm
+1.617 believing
+1.594 psu
… 10174 more positive …
… 25605 more negative …
-1.686 rutgers
-10.453 <BIAS>
Weight? Feature
+1.702 graphics
+0.825 images
+0.798 files
+0.786 software
+0.779 file
+0.773 image
+0.729 package
+0.724 card
+0.702 3d
… 11710 more positive …
… 24069 more negative …
-1.379 <BIAS>
Weight? Feature
+2.016 pitt
+1.951 doctor
+1.758 information
+1.697 disease
+1.655 treatment
+1.522 msg
+1.518 health
… 15007 more positive …
… 20772 more negative …
-1.764 god
-2.171 graphics
-5.013 <BIAS>
Weight? Feature
+1.193 rutgers
+1.030 church
+1.021 christians
+0.946 clh
+0.899 christ
+0.797 christian
… 11122 more positive …
… 24657 more negative …
-0.852 graphics
-0.894 posting
-1.181 nntp
-1.243 host

This starts to make more sense. Columns are target classes. In each column there are features and their weights. Intercept (bias) feature is shown as <BIAS> in the same table. We can inspect features and weights because we’re using a bag-of-words vectorizer and a linear classifier (so there is a direct mapping between individual words and classifier coefficients). For other classifiers features can be harder to inspect.

Some features look good, but some don’t. It seems model learned some names specific to a dataset (email parts, etc.) though, instead of learning topic-specific words. Let’s check prediction results on an example:

eli5.show_prediction(clf, twenty_test.data[0], vec=vec,
                     target_names=twenty_test.target_names)

y=alt.atheism (probability 0.000, score -8.709) top features

Contribution? Feature
+1.743 Highlighted in text (sum)
-10.453 <BIAS>

from: brian@ucsd.edu (brian kantor) subject: re: help for kidney stones .............. organization: the avant-garde of the now, ltd. lines: 12 nntp-posting-host: ucsd.edu as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less. demerol worked, although i nearly got arrested on my way home when i barfed all over the police car parked just outside the er. - brian

y=comp.graphics (probability 0.010, score -4.592) top features

Contribution? Feature
-1.379 <BIAS>
-3.213 Highlighted in text (sum)

from: brian@ucsd.edu (brian kantor) subject: re: help for kidney stones .............. organization: the avant-garde of the now, ltd. lines: 12 nntp-posting-host: ucsd.edu as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less. demerol worked, although i nearly got arrested on my way home when i barfed all over the police car parked just outside the er. - brian

y=sci.med (probability 0.989, score 3.945) top features

Contribution? Feature
+8.958 Highlighted in text (sum)
-5.013 <BIAS>

from: brian@ucsd.edu (brian kantor) subject: re: help for kidney stones .............. organization: the avant-garde of the now, ltd. lines: 12 nntp-posting-host: ucsd.edu as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less. demerol worked, although i nearly got arrested on my way home when i barfed all over the police car parked just outside the er. - brian

y=soc.religion.christian (probability 0.001, score -7.157) top features

Contribution? Feature
-0.258 <BIAS>
-6.899 Highlighted in text (sum)

from: brian@ucsd.edu (brian kantor) subject: re: help for kidney stones .............. organization: the avant-garde of the now, ltd. lines: 12 nntp-posting-host: ucsd.edu as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less. demerol worked, although i nearly got arrested on my way home when i barfed all over the police car parked just outside the er. - brian

What can be highlighted in text is highlighted in text. There is also a separate table for features which can’t be highlighted in text - <BIAS> in this case. If you hover mouse on a highlighted word it shows you a weight of this word in a title. Words are colored according to their weights.

2. Baseline model, improved data

Aha, from the highlighting above it can be seen that a classifier learned some non-interesting stuff indeed, e.g. it remembered parts of email addresses. We should probably clean the data first to make it more interesting; improving model (trying different classifiers, etc.) doesn’t make sense at this point - it may just learn to leverage these email addresses better.

In practice we’d have to do cleaning yourselves; in this example 20 newsgroups dataset provides an option to remove footers and headers from the messages. Nice. Let’s clean up the data and re-train a classifier.

twenty_train = fetch_20newsgroups(
    subset='train',
    categories=categories,
    shuffle=True,
    random_state=42,
    remove=['headers', 'footers'],
)
twenty_test = fetch_20newsgroups(
    subset='test',
    categories=categories,
    shuffle=True,
    random_state=42,
    remove=['headers', 'footers'],
)

vec = CountVectorizer()
clf = LogisticRegressionCV()
pipe = make_pipeline(vec, clf)
pipe.fit(twenty_train.data, twenty_train.target);

We just made the task harder and more realistic for a classifier.

print_report(pipe)
                        precision    recall  f1-score   support

           alt.atheism       0.83      0.78      0.80       319
         comp.graphics       0.82      0.96      0.88       389
               sci.med       0.89      0.80      0.84       396
soc.religion.christian       0.88      0.86      0.87       398

           avg / total       0.85      0.85      0.85      1502

accuracy: 0.852

A great result - we just made quality worse! Does it mean pipeline is worse now? No, likely it has a better quality on unseen messages. It is evaluation which is more fair now. Inspecting features used by classifier allowed us to notice a problem with the data and made a good change, despite of numbers which told us not to do that.

Instead of removing headers and footers we could have improved evaluation setup directly, using e.g. GroupKFold from scikit-learn. Then quality of old model would have dropped, we could have removed headers/footers and see increased accuracy, so the numbers would have told us to remove headers and footers. It is not obvious how to split data though, what groups to use with GroupKFold.

So, what have the updated classifier learned? (output is less verbose because only a subset of classes is shown - see “targets” argument):

eli5.show_prediction(clf, twenty_test.data[0], vec=vec,
                     target_names=twenty_test.target_names,
                     targets=['sci.med'])

y=sci.med (probability 0.732, score 0.031) top features

Contribution? Feature
+1.747 Highlighted in text (sum)
-1.716 <BIAS>

as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.

Hm, it no longer uses email addresses, but it still doesn’t look good: classifier assigns high weights to seemingly unrelated words like ‘do’ or ‘my’. These words appear in many texts, so maybe classifier uses them as a proxy for bias. Or maybe some of them are more common in some of classes.

3. Pipeline improvements

To help classifier we may filter out stop words:

vec = CountVectorizer(stop_words='english')
clf = LogisticRegressionCV()
pipe = make_pipeline(vec, clf)
pipe.fit(twenty_train.data, twenty_train.target)

print_report(pipe)
                        precision    recall  f1-score   support

           alt.atheism       0.87      0.76      0.81       319
         comp.graphics       0.85      0.95      0.90       389
               sci.med       0.93      0.85      0.89       396
soc.religion.christian       0.85      0.89      0.87       398

           avg / total       0.87      0.87      0.87      1502

accuracy: 0.871
eli5.show_prediction(clf, twenty_test.data[0], vec=vec,
                     target_names=twenty_test.target_names,
                     targets=['sci.med'])

y=sci.med (probability 0.714, score 0.510) top features

Contribution? Feature
+2.184 Highlighted in text (sum)
-1.674 <BIAS>

as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.

Looks better, isn’t it?

Alternatively, we can use TF*IDF scheme; it should give a somewhat similar effect.

Note that we’re cross-validating LogisticRegression regularisation parameter here, like in other examples (LogisticRegressionCV, not LogisticRegression). TF*IDF values are different from word count values, so optimal C value can be different. We could draw a wrong conclusion if a classifier with fixed regularization strength is used - the chosen C value could have worked better for one kind of data.

from sklearn.feature_extraction.text import TfidfVectorizer

vec = TfidfVectorizer()
clf = LogisticRegressionCV()
pipe = make_pipeline(vec, clf)
pipe.fit(twenty_train.data, twenty_train.target)

print_report(pipe)
                        precision    recall  f1-score   support

           alt.atheism       0.91      0.79      0.85       319
         comp.graphics       0.83      0.97      0.90       389
               sci.med       0.95      0.87      0.91       396
soc.religion.christian       0.90      0.91      0.91       398

           avg / total       0.90      0.89      0.89      1502

accuracy: 0.892
eli5.show_prediction(clf, twenty_test.data[0], vec=vec,
                     target_names=twenty_test.target_names,
                     targets=['sci.med'])

y=sci.med (probability 0.987, score 1.585) top features

Contribution? Feature
+6.788 Highlighted in text (sum)
-5.203 <BIAS>

as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.

It helped, but didn’t have quite the same effect. Why not do both?

vec = TfidfVectorizer(stop_words='english')
clf = LogisticRegressionCV()
pipe = make_pipeline(vec, clf)
pipe.fit(twenty_train.data, twenty_train.target)

print_report(pipe)
                        precision    recall  f1-score   support

           alt.atheism       0.93      0.77      0.84       319
         comp.graphics       0.84      0.97      0.90       389
               sci.med       0.95      0.89      0.92       396
soc.religion.christian       0.88      0.92      0.90       398

           avg / total       0.90      0.89      0.89      1502

accuracy: 0.893
eli5.show_prediction(clf, twenty_test.data[0], vec=vec,
                     target_names=twenty_test.target_names,
                     targets=['sci.med'])

y=sci.med (probability 0.939, score 1.910) top features

Contribution? Feature
+5.488 Highlighted in text (sum)
-3.578 <BIAS>

as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.

This starts to look good!

4. Char-based pipeline

Maybe we can get somewhat better quality by choosing a different classifier, but let’s skip it for now. Let’s try other analysers instead - use char n-grams instead of words:

vec = TfidfVectorizer(stop_words='english', analyzer='char',
                      ngram_range=(3,5))
clf = LogisticRegressionCV()
pipe = make_pipeline(vec, clf)
pipe.fit(twenty_train.data, twenty_train.target)

print_report(pipe)
                        precision    recall  f1-score   support

           alt.atheism       0.93      0.79      0.85       319
         comp.graphics       0.81      0.97      0.89       389
               sci.med       0.95      0.86      0.90       396
soc.religion.christian       0.89      0.91      0.90       398

           avg / total       0.89      0.89      0.89      1502

accuracy: 0.888
eli5.show_prediction(clf, twenty_test.data[0], vec=vec,
                     target_names=twenty_test.target_names)

y=alt.atheism (probability 0.002, score -7.318) top features

Contribution? Feature
-0.838 Highlighted in text (sum)
-6.480 <BIAS>

as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.

y=comp.graphics (probability 0.017, score -5.118) top features

Contribution? Feature
+0.934 <BIAS>
-6.052 Highlighted in text (sum)

as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.

y=sci.med (probability 0.963, score -0.656) top features

Contribution? Feature
+4.493 Highlighted in text (sum)
-5.149 <BIAS>

as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.

y=soc.religion.christian (probability 0.018, score -5.048) top features

Contribution? Feature
+0.600 Highlighted in text (sum)
-5.648 <BIAS>

as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.

It works, but quality is a bit worse. Also, it takes ages to train.

It looks like stop_words have no effect now - in fact, this is documented in scikit-learn docs, so our stop_words=’english’ was useless. But at least it is now more obvious how the text looks like for a char ngram-based classifier. Grab a cup of tea and see how char_wb looks like:

vec = TfidfVectorizer(analyzer='char_wb', ngram_range=(3,5))
clf = LogisticRegressionCV()
pipe = make_pipeline(vec, clf)
pipe.fit(twenty_train.data, twenty_train.target)

print_report(pipe)
                        precision    recall  f1-score   support

           alt.atheism       0.93      0.79      0.85       319
         comp.graphics       0.87      0.96      0.91       389
               sci.med       0.91      0.90      0.90       396
soc.religion.christian       0.89      0.91      0.90       398

           avg / total       0.90      0.89      0.89      1502

accuracy: 0.894
eli5.show_prediction(clf, twenty_test.data[0], vec=vec,
                     target_names=twenty_test.target_names)

y=alt.atheism (probability 0.000, score -8.878) top features

Contribution? Feature
-2.560 Highlighted in text (sum)
-6.318 <BIAS>

as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.

y=comp.graphics (probability 0.005, score -6.007) top features

Contribution? Feature
+0.974 <BIAS>
-6.981 Highlighted in text (sum)

as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.

y=sci.med (probability 0.834, score -0.440) top features

Contribution? Feature
+2.134 Highlighted in text (sum)
-2.573 <BIAS>

as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.

y=soc.religion.christian (probability 0.160, score -2.510) top features

Contribution? Feature
+3.263 Highlighted in text (sum)
-5.773 <BIAS>

as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.

The result is similar, with some minor changes. Quality is better for unknown reason; maybe cross-word dependencies are not that important.

5. Debugging HashingVectorizer

To check that we can try fitting word n-grams instead of char n-grams. But let’s deal with efficiency first. To handle large vocabularies we can use HashingVectorizer from scikit-learn; to make training faster we can employ SGDCLassifier:

from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier

vec = HashingVectorizer(stop_words='english', ngram_range=(1,2))
clf = SGDClassifier(n_iter=10, random_state=42)
pipe = make_pipeline(vec, clf)
pipe.fit(twenty_train.data, twenty_train.target)

print_report(pipe)
                        precision    recall  f1-score   support

           alt.atheism       0.90      0.80      0.85       319
         comp.graphics       0.88      0.96      0.92       389
               sci.med       0.93      0.90      0.92       396
soc.religion.christian       0.89      0.91      0.90       398

           avg / total       0.90      0.90      0.90      1502

accuracy: 0.899

It was super-fast! We’re not choosing regularization parameter using cross-validation though. Let’s check what model learned:

eli5.show_prediction(clf, twenty_test.data[0], vec=vec,
                     target_names=twenty_test.target_names,
                     targets=['sci.med'])

y=sci.med (score 0.097) top features

Contribution? Feature
+0.678 Highlighted in text (sum)
-0.581 <BIAS>

as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.

Result looks similar to CountVectorizer. But with HashingVectorizer we don’t even have a vocabulary! Why does it work?

eli5.show_weights(clf, vec=vec, top=10,
                  target_names=twenty_test.target_names)
y=alt.atheism top features y=comp.graphics top features y=sci.med top features y=soc.religion.christian top features
Weight? Feature
+2.836 x199378
+2.378 x938889
+1.776 x718537
+1.625 x349126
+1.554 x242643
+1.509 x71928
… 50341 more positive …
… 50567 more negative …
-1.634 x683213
-1.795 x741207
-1.872 x199709
-2.132 x641063
Weight? Feature
+3.737 x580586
+2.056 x342790
+1.956 x771885
+1.787 x363686
+1.717 x111283
… 32081 more positive …
… 31710 more negative …
-1.760 x857427
-1.779 x85557
-1.813 x693269
-2.021 x120354
-2.447 x814572
Weight? Feature
+2.209 x988761
+2.194 x337555
+2.162 x154565
+1.818 x806262
… 44124 more positive …
… 43892 more negative …
-1.704 x790864
-1.750 x580586
-1.851 x34701
-2.085 x85557
-2.147 x365313
-2.150 x494508
Weight? Feature
+3.034 x641063
+3.016 x199709
+2.977 x741207
+2.092 x396081
+1.901 x274863
… 51475 more positive …
… 51717 more negative …
-1.963 x672777
-2.096 x199378
-2.143 x443433
-2.963 x718537
-3.245 x970058

Ok, we don’t have a vocabulary, so we don’t have feature names. Are we out of luck? Nope, eli5 has an answer for that: InvertableHashingVectorizer It can be used to get feature names for HahshingVectorizer without fitiing a huge vocabulary. It still needs some data to learn words -> hashes mapping though; we can use a random subset of data to fit it.

from eli5.sklearn import InvertableHashingVectorizer
import numpy as np
ivec = InvertableHashingVectorizer(vec)
sample_size = len(twenty_train.data) // 10
X_sample = np.random.choice(twenty_train.data, size=sample_size)
ivec.fit(X_sample);
eli5.show_weights(clf, vec=ivec, top=20,
                  target_names=twenty_test.target_names)
y=alt.atheism top features y=comp.graphics top features y=sci.med top features y=soc.religion.christian top features
Weight? Feature
+2.836 atheism
+2.378 writes
+1.634 morality
+1.625 motto
+1.554 religion
+1.509 islam
+1.489 keith
+1.476 religious
+1.439 objective
+1.414 wrote
+1.405 said
+1.361 punishment
+1.335 livesey
+1.332 mathew
+1.324 atheist
+1.320 agree
… 47696 more positive …
… 53202 more negative …
-1.776 rutgers edu
-1.795 rutgers
-1.872 christ
-2.132 christians
Weight? Feature
+3.737 graphics
+2.447 image
+2.056 code
+2.021 files
+1.956 images
+1.813 3d
+1.787 software
+1.717 file
+1.701 ftp
+1.587 video
+1.572 keywords
+1.572 card
+1.509 points
+1.500 line
+1.494 need
+1.483 computer
+1.470 hi
… 30146 more positive …
… 33635 more negative …
-1.654 people
-1.760 keyboard
-1.779 god
Weight? Feature
+2.209 health
+2.194 msg
+2.162 doctor
+2.150 disease
+2.147 treatment
+1.851 medical
+1.818 com
+1.704 pain
+1.663 effects
+1.616 cancer
+1.513 case
+1.453 diet
+1.447 blood
+1.439 information
+1.435 keyboard
+1.407 pitt
… 42291 more positive …
… 45715 more negative …
-1.462 church
-1.697 FEATURE[354651]
-1.750 graphics
-2.085 god
Weight? Feature
+3.245 church
+3.034 christians
+3.016 christ
+2.977 rutgers
+2.963 rutgers edu
+2.143 christian
+2.092 heaven
+1.963 love
+1.901 athos rutgers
+1.901 athos
+1.741 satan
+1.714 authority
+1.653 faith
+1.644 1993
+1.643 article apr
+1.633 understanding
+1.541 sin
+1.509 god
… 49948 more positive …
… 53234 more negative …
-1.525 graphics
-2.096 atheism

There are collisions (hover mouse over features with ”...”), and there are important features which were not seen in the random sample (FEATURE[...]), but overall it looks fine.

“rutgers edu” bigram feature is suspicious though, it looks like a part of URL.

rutgers_example = [x for x in twenty_train.data if 'rutgers' in x.lower()][0]
print(rutgers_example)
In article <Apr.8.00.57.41.1993.28246@athos.rutgers.edu> REXLEX@fnal.gov writes:
>In article <Apr.7.01.56.56.1993.22824@athos.rutgers.edu> shrum@hpfcso.fc.hp.com
>Matt. 22:9-14 'Go therefore to the main highways, and as many as you find
>there, invite to the wedding feast.'...

>hmmmmmm.  Sounds like your theology and Christ's are at odds. Which one am I
>to believe?

Yep, it looks like model learned this address instead of learning something useful.

eli5.show_prediction(clf, rutgers_example, vec=vec,
                     target_names=twenty_test.target_names,
                     targets=['soc.religion.christian'])

y=soc.religion.christian (score 2.044) top features

Contribution? Feature
+2.706 Highlighted in text (sum)
-0.662 <BIAS>

in article <apr.8.00.57.41.1993.28246@athos.rutgers.edu> rexlex@fnal.gov writes: >in article <apr.7.01.56.56.1993.22824@athos.rutgers.edu> shrum@hpfcso.fc.hp.com >matt. 22:9-14 'go therefore to the main highways, and as many as you find >there, invite to the wedding feast.'... >hmmmmmm. sounds like your theology and christ's are at odds. which one am i >to believe?

Quoted text makes it too easy for model to classify some of the messages; that won’t generalize to new messages. So to improve the model next step could be to process the data further, e.g. remove quoted text or replace email addresses with a special token.

You get the idea: looking at features helps to understand how classifier works. Maybe even more importantly, it helps to notice preprocessing bugs, data leaks, issues with task specification - all these nasty problems you get in a real world.