Lets play with Reuters collection in NLTK¶

In [2]:

from nltk.corpus import reuters

# List of document ids
documents = reuters.fileids()
print("Documents: {}".format(len(documents)))

# Train documents
train_docs_id = list(filter(lambda doc: doc.startswith("train"), documents))
print("Total train documents: {}".format(len(train_docs_id)))

# Test documents
test_docs_id = list(filter(lambda doc: doc.startswith("test"), documents))
print("Total test documents: {}".format(len(test_docs_id)))

Documents: 10788
Total train documents: 7769
Total test documents: 3019

In [6]:

# Let's get a document with multiple labels
doc = 'training/9865'
print(reuters.raw(doc))

FRENCH FREE MARKET CEREAL EXPORT BIDS DETAILED
  French operators have requested licences
  to export 675,500 tonnes of maize, 245,000 tonnes of barley,
  22,000 tonnes of soft bread wheat and 20,000 tonnes of feed
  wheat at today's European Community tender, traders said.
      Rebates requested ranged from 127.75 to 132.50 European
  Currency Units a tonne for maize, 136.00 to 141.00 Ecus a tonne
  for barley and 134.25 to 141.81 Ecus for bread wheat, while
  rebates requested for feed wheat were 137.65 Ecus, they said.

In [7]:

print(reuters.categories(doc))

['barley', 'corn', 'grain', 'wheat']

In [10]:

from operator import itemgetter
from pprint import pprint

# List categories
categories = reuters.categories()
print("Number of categories: ", len(categories))

Number of categories:  90

In [15]:

# Document per category
category_dist = [(category, len(reuters.fileids(category))) for category in categories]
category_dist = sorted(category_dist, key=itemgetter(1), reverse=True)

print("Most common categories: ")
pprint(category_dist[-5:])

Most common categories: 
[('castor-oil', 2),
 ('groundnut-oil', 2),
 ('lin-oil', 2),
 ('rye', 2),
 ('sun-meal', 2)]

In [17]:

from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier

stop_words = stopwords.words("english")

train_docs_id = list(filter(lambda doc: doc.startswith("train"), documents))
test_doc_id = list(filter(lambda doc: doc.startswith("test"), documents))

train_docs = [reuters.raw(doc_id) for doc_id in train_docs_id]
test_docs = [reuters.raw(doc_id) for doc_id in test_docs_id]

# Tokenize
vectorizer = TfidfVectorizer(stop_words = stop_words)

# Learn and transform train documents
vectorized_train_docs = vectorizer.fit_transform(train_docs)
vectorized_test_docs = vectorizer.transform(test_docs)

# Transform multi-labels labels
multilabelbin = MultiLabelBinarizer()
train_labels = multilabelbin.fit_transform([reuters.categories(doc_id) for doc_id in train_docs_id])
test_labels = multilabelbin.transform([reuters.categories(doc_id) for doc_id in test_docs_id])

# Classification
classifier = OneVsRestClassifier(LinearSVC(random_state=52)) #why this random state?
classifier.fit(vectorized_train_docs, train_labels)

# Predict
predictions = classifier.predict(vectorized_test_docs)

# Print
print("Number of labels assigned: {}".format(sum([sum(prediction) for prediction in predictions])))

Number of labels assigned: 3126

In [25]:

# Lets check ou some metrics
from sklearn.metrics import f1_score, precision_score, recall_score

# How's the quality?
precision = precision_score(test_labels, predictions, average='micro')
recall = recall_score(test_labels, predictions, average='micro') 
f1 = f1_score(test_labels, predictions, average='micro')
print("Micro average quality metrics")
print("Precision: {:.4f}, Recall: {:.4f}, F1-measure: {:.4f}".format(precision, 
                                                                     recall, 
                                                                     f1))

precision = precision_score(test_labels, predictions, average='macro')
recall = recall_score(test_labels, predictions, average='macro')
f1 = f1_score(test_labels, predictions, average='macro')
print("Macro-average quality numbers")
print("Precision: {:.4f}, Recall: {:.4f}, F1-measure: {:.4f}".format(precision, 
                                                                     recall, 
                                                                     f1))

Micro average quality metrics
Precision: 0.9517, Recall: 0.7946, F1-measure: 0.8661
Macro-average quality numbers
Precision: 0.6305, Recall: 0.3715, F1-measure: 0.4451

/Users/tarrysingh/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1135: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/tarrysingh/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1135: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

More fun facts about f1_score etc.¶

In [21]:

from sklearn.metrics import f1_score
y_true = [0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 1]
f1_score(y_true, y_pred, average='macro')  

f1_score(y_true, y_pred, average='micro')  

f1_score(y_true, y_pred, average='weighted')  

f1_score(y_true, y_pred, average=None)

Out[21]:

array([ 0.8,  0. ,  0. ])

In [24]:

help(f1_score)

Help on function f1_score in module sklearn.metrics.classification:

f1_score(y_true, y_pred, labels=None, pos_label=1, average='binary', sample_weight=None)
    Compute the F1 score, also known as balanced F-score or F-measure
    
    The F1 score can be interpreted as a weighted average of the precision and
    recall, where an F1 score reaches its best value at 1 and worst score at 0.
    The relative contribution of precision and recall to the F1 score are
    equal. The formula for the F1 score is::
    
        F1 = 2 * (precision * recall) / (precision + recall)
    
    In the multi-class and multi-label case, this is the weighted average of
    the F1 score of each class.
    
    Read more in the :ref:`User Guide <precision_recall_f_measure_metrics>`.
    
    Parameters
    ----------
    y_true : 1d array-like, or label indicator array / sparse matrix
        Ground truth (correct) target values.
    
    y_pred : 1d array-like, or label indicator array / sparse matrix
        Estimated targets as returned by a classifier.
    
    labels : list, optional
        The set of labels to include when ``average != 'binary'``, and their
        order if ``average is None``. Labels present in the data can be
        excluded, for example to calculate a multiclass average ignoring a
        majority negative class, while labels not present in the data will
        result in 0 components in a macro average. For multilabel targets,
        labels are column indices. By default, all labels in ``y_true`` and
        ``y_pred`` are used in sorted order.
    
        .. versionchanged:: 0.17
           parameter *labels* improved for multiclass problem.
    
    pos_label : str or int, 1 by default
        The class to report if ``average='binary'`` and the data is binary.
        If the data are multiclass or multilabel, this will be ignored;
        setting ``labels=[pos_label]`` and ``average != 'binary'`` will report
        scores for that label only.
    
    average : string, [None, 'binary' (default), 'micro', 'macro', 'samples',                        'weighted']
        This parameter is required for multiclass/multilabel targets.
        If ``None``, the scores for each class are returned. Otherwise, this
        determines the type of averaging performed on the data:
    
        ``'binary'``:
            Only report results for the class specified by ``pos_label``.
            This is applicable only if targets (``y_{true,pred}``) are binary.
        ``'micro'``:
            Calculate metrics globally by counting the total true positives,
            false negatives and false positives.
        ``'macro'``:
            Calculate metrics for each label, and find their unweighted
            mean.  This does not take label imbalance into account.
        ``'weighted'``:
            Calculate metrics for each label, and find their average, weighted
            by support (the number of true instances for each label). This
            alters 'macro' to account for label imbalance; it can result in an
            F-score that is not between precision and recall.
        ``'samples'``:
            Calculate metrics for each instance, and find their average (only
            meaningful for multilabel classification where this differs from
            :func:`accuracy_score`).
    
    sample_weight : array-like of shape = [n_samples], optional
        Sample weights.
    
    Returns
    -------
    f1_score : float or array of float, shape = [n_unique_labels]
        F1 score of the positive class in binary classification or weighted
        average of the F1 scores of each class for the multiclass task.
    
    References
    ----------
    .. [1] `Wikipedia entry for the F1-score
           <https://en.wikipedia.org/wiki/F1_score>`_
    
    Examples
    --------
    >>> from sklearn.metrics import f1_score
    >>> y_true = [0, 1, 2, 0, 1, 2]
    >>> y_pred = [0, 2, 1, 0, 0, 1]
    >>> f1_score(y_true, y_pred, average='macro')  # doctest: +ELLIPSIS
    0.26...
    >>> f1_score(y_true, y_pred, average='micro')  # doctest: +ELLIPSIS
    0.33...
    >>> f1_score(y_true, y_pred, average='weighted')  # doctest: +ELLIPSIS
    0.26...
    >>> f1_score(y_true, y_pred, average=None)
    array([ 0.8,  0. ,  0. ])

In [ ]: