Lets play with Reuters collection in NLTK¶
In [2]:
from nltk.corpus import reuters
# List of document ids
documents = reuters.fileids()
print("Documents: {}".format(len(documents)))
# Train documents
train_docs_id = list(filter(lambda doc: doc.startswith("train"), documents))
print("Total train documents: {}".format(len(train_docs_id)))
# Test documents
test_docs_id = list(filter(lambda doc: doc.startswith("test"), documents))
print("Total test documents: {}".format(len(test_docs_id)))
Documents: 10788 Total train documents: 7769 Total test documents: 3019
In [6]:
# Let's get a document with multiple labels
doc = 'training/9865'
print(reuters.raw(doc))
FRENCH FREE MARKET CEREAL EXPORT BIDS DETAILED
French operators have requested licences
to export 675,500 tonnes of maize, 245,000 tonnes of barley,
22,000 tonnes of soft bread wheat and 20,000 tonnes of feed
wheat at today's European Community tender, traders said.
Rebates requested ranged from 127.75 to 132.50 European
Currency Units a tonne for maize, 136.00 to 141.00 Ecus a tonne
for barley and 134.25 to 141.81 Ecus for bread wheat, while
rebates requested for feed wheat were 137.65 Ecus, they said.
In [7]:
print(reuters.categories(doc))
['barley', 'corn', 'grain', 'wheat']
In [10]:
from operator import itemgetter
from pprint import pprint
# List categories
categories = reuters.categories()
print("Number of categories: ", len(categories))
Number of categories: 90
In [15]:
# Document per category
category_dist = [(category, len(reuters.fileids(category))) for category in categories]
category_dist = sorted(category_dist, key=itemgetter(1), reverse=True)
print("Most common categories: ")
pprint(category_dist[-5:])
Most common categories:
[('castor-oil', 2),
('groundnut-oil', 2),
('lin-oil', 2),
('rye', 2),
('sun-meal', 2)]
In [17]:
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier
stop_words = stopwords.words("english")
train_docs_id = list(filter(lambda doc: doc.startswith("train"), documents))
test_doc_id = list(filter(lambda doc: doc.startswith("test"), documents))
train_docs = [reuters.raw(doc_id) for doc_id in train_docs_id]
test_docs = [reuters.raw(doc_id) for doc_id in test_docs_id]
# Tokenize
vectorizer = TfidfVectorizer(stop_words = stop_words)
# Learn and transform train documents
vectorized_train_docs = vectorizer.fit_transform(train_docs)
vectorized_test_docs = vectorizer.transform(test_docs)
# Transform multi-labels labels
multilabelbin = MultiLabelBinarizer()
train_labels = multilabelbin.fit_transform([reuters.categories(doc_id) for doc_id in train_docs_id])
test_labels = multilabelbin.transform([reuters.categories(doc_id) for doc_id in test_docs_id])
# Classification
classifier = OneVsRestClassifier(LinearSVC(random_state=52)) #why this random state?
classifier.fit(vectorized_train_docs, train_labels)
# Predict
predictions = classifier.predict(vectorized_test_docs)
# Print
print("Number of labels assigned: {}".format(sum([sum(prediction) for prediction in predictions])))
Number of labels assigned: 3126
In [25]:
# Lets check ou some metrics
from sklearn.metrics import f1_score, precision_score, recall_score
# How's the quality?
precision = precision_score(test_labels, predictions, average='micro')
recall = recall_score(test_labels, predictions, average='micro')
f1 = f1_score(test_labels, predictions, average='micro')
print("Micro average quality metrics")
print("Precision: {:.4f}, Recall: {:.4f}, F1-measure: {:.4f}".format(precision,
recall,
f1))
precision = precision_score(test_labels, predictions, average='macro')
recall = recall_score(test_labels, predictions, average='macro')
f1 = f1_score(test_labels, predictions, average='macro')
print("Macro-average quality numbers")
print("Precision: {:.4f}, Recall: {:.4f}, F1-measure: {:.4f}".format(precision,
recall,
f1))
Micro average quality metrics Precision: 0.9517, Recall: 0.7946, F1-measure: 0.8661 Macro-average quality numbers Precision: 0.6305, Recall: 0.3715, F1-measure: 0.4451
/Users/tarrysingh/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1135: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. 'precision', 'predicted', average, warn_for) /Users/tarrysingh/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1135: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples. 'precision', 'predicted', average, warn_for)
More fun facts about f1_score etc.¶
In [21]:
from sklearn.metrics import f1_score
y_true = [0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 1]
f1_score(y_true, y_pred, average='macro')
f1_score(y_true, y_pred, average='micro')
f1_score(y_true, y_pred, average='weighted')
f1_score(y_true, y_pred, average=None)
Out[21]:
array([ 0.8, 0. , 0. ])
In [24]:
help(f1_score)
Help on function f1_score in module sklearn.metrics.classification:
f1_score(y_true, y_pred, labels=None, pos_label=1, average='binary', sample_weight=None)
Compute the F1 score, also known as balanced F-score or F-measure
The F1 score can be interpreted as a weighted average of the precision and
recall, where an F1 score reaches its best value at 1 and worst score at 0.
The relative contribution of precision and recall to the F1 score are
equal. The formula for the F1 score is::
F1 = 2 * (precision * recall) / (precision + recall)
In the multi-class and multi-label case, this is the weighted average of
the F1 score of each class.
Read more in the :ref:`User Guide <precision_recall_f_measure_metrics>`.
Parameters
----------
y_true : 1d array-like, or label indicator array / sparse matrix
Ground truth (correct) target values.
y_pred : 1d array-like, or label indicator array / sparse matrix
Estimated targets as returned by a classifier.
labels : list, optional
The set of labels to include when ``average != 'binary'``, and their
order if ``average is None``. Labels present in the data can be
excluded, for example to calculate a multiclass average ignoring a
majority negative class, while labels not present in the data will
result in 0 components in a macro average. For multilabel targets,
labels are column indices. By default, all labels in ``y_true`` and
``y_pred`` are used in sorted order.
.. versionchanged:: 0.17
parameter *labels* improved for multiclass problem.
pos_label : str or int, 1 by default
The class to report if ``average='binary'`` and the data is binary.
If the data are multiclass or multilabel, this will be ignored;
setting ``labels=[pos_label]`` and ``average != 'binary'`` will report
scores for that label only.
average : string, [None, 'binary' (default), 'micro', 'macro', 'samples', 'weighted']
This parameter is required for multiclass/multilabel targets.
If ``None``, the scores for each class are returned. Otherwise, this
determines the type of averaging performed on the data:
``'binary'``:
Only report results for the class specified by ``pos_label``.
This is applicable only if targets (``y_{true,pred}``) are binary.
``'micro'``:
Calculate metrics globally by counting the total true positives,
false negatives and false positives.
``'macro'``:
Calculate metrics for each label, and find their unweighted
mean. This does not take label imbalance into account.
``'weighted'``:
Calculate metrics for each label, and find their average, weighted
by support (the number of true instances for each label). This
alters 'macro' to account for label imbalance; it can result in an
F-score that is not between precision and recall.
``'samples'``:
Calculate metrics for each instance, and find their average (only
meaningful for multilabel classification where this differs from
:func:`accuracy_score`).
sample_weight : array-like of shape = [n_samples], optional
Sample weights.
Returns
-------
f1_score : float or array of float, shape = [n_unique_labels]
F1 score of the positive class in binary classification or weighted
average of the F1 scores of each class for the multiclass task.
References
----------
.. [1] `Wikipedia entry for the F1-score
<https://en.wikipedia.org/wiki/F1_score>`_
Examples
--------
>>> from sklearn.metrics import f1_score
>>> y_true = [0, 1, 2, 0, 1, 2]
>>> y_pred = [0, 2, 1, 0, 0, 1]
>>> f1_score(y_true, y_pred, average='macro') # doctest: +ELLIPSIS
0.26...
>>> f1_score(y_true, y_pred, average='micro') # doctest: +ELLIPSIS
0.33...
>>> f1_score(y_true, y_pred, average='weighted') # doctest: +ELLIPSIS
0.26...
>>> f1_score(y_true, y_pred, average=None)
array([ 0.8, 0. , 0. ])
In [ ]: