Text Classification¶

There are several types of classification:

Binary : 2 mutually exclusive categories (Detecting spam etc)
Multiclass: More than 2 mutually exclusive categories (Language detection etc)
Multilabel: non-mutually exclusive categories (like movie genres, tV shows etc)

Binary text classification problem¶

In [1]:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

In [2]:

# Train and test data set

train_data = ['Football: a great sport', 
              'The referee has been very bad this season', 
              'Our team scored 5 goals', 'I love tenis',
              'Politics is in decline in the UK', 
              'Brexit means Brexit', 
              'The parlament wants to create new legislation',
              'I so want to travel the world']

train_labels = ["Sports","Sports","Sports","Sports", 
                "Non Sports", "Non Sports", "Non Sports", "Non Sports"]

test_data = ['Swimming is a great sport', 
             'A lot of policy changes will happen after Brexit', 
             'The table tenis team will travel to the UK soon for the European Championship']
test_labels = ["Sports", "Non Sports", "Sports"]

In [3]:

# Representation of data using Tf-IDF
vectorizer = TfidfVectorizer()
vectorized_train_data = vectorizer.fit_transform(train_data)
vectorized_test_data = vectorizer.transform(test_data)

In [4]:

# Train the classifier given the training data
classifier = LinearSVC()
classifier.fit(vectorized_train_data, train_labels)

Out[4]:

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

In [5]:

# Predict the labels for the test documents 
print(classifier.predict(vectorized_test_data))

['Sports' 'Non Sports' 'Non Sports']

Nice. We build our text classifier :)¶

Matching problems
Cases never seen below
"Spurious" correlations and bias ("car" appears only in the +ve category)

In [10]:

from pprint import pprint # This way we print pretty :)

def feature_values(doc, representer):
    doc_rep = representer.transform([doc])
    features = representer.get_feature_names()
    return [(features[index], doc_rep[0, index]) for index in doc_rep.nonzero()[1]]
pprint([feature_values(doc, vectorizer) for doc in test_data])

[[('sport', 0.57735026918962584),
  ('is', 0.57735026918962584),
  ('great', 0.57735026918962584)],
 [('brexit', 1.0)],
 [('uk', 0.34666892278432909),
  ('travel', 0.34666892278432909),
  ('to', 0.29053561299308733),
  ('the', 0.6594480187891556),
  ('tenis', 0.34666892278432909),
  ('team', 0.34666892278432909)]]

Let's try with remove with stop-word¶

In [14]:

from nltk.corpus import stopwords

# Load the list of english / stop words from nltk
stop_words = stopwords.words("english")

# Represent, train, predict and print it out
vectorizer = TfidfVectorizer(stop_words=stop_words)
vectorized_train_data = vectorizer.fit_transform(train_data)
vectorized_test_data = vectorizer.transform(test_data)

# Assign SVC classifier

classifier = LinearSVC()

# fit the classifier with vectorized train data set and their labels.

classifier.fit(vectorized_train_data, train_labels)

# Lets print and see what comes out, should give a Sports, Non Sports, Sports

print(classifier.predict(vectorized_test_data))

['Sports' 'Non Sports' 'Sports']

Ok, cool.¶

Multi-Class Classification Challenge¶

Here lets address the multi-class problem of detecting the language of a sentence based on 3 mutually exclusive languages such as English, Spanish, French. Lets assume that we can only have three languages that the documents can contain.

So, lets get on and create a sample artificial text...

In [15]:

train_data = ['PyCon es una gran conferencia', 
              'Aprendizaje automatico esta listo para dominar el mundo dentro de poco',
             'This is a great conference with a lot of amazing talks', 
              'AI will dominate the world in the near future',
             'Dix chiffres por resumer le feuilleton de la loi travail']
train_labels = ["SP", "SP", "EN", "EN", "FR"]
test_data = ['Estoy preparandome para dominar las olimpiadas', 
             'Me gustaria mucho aprender el lenguage de programacion Scala',
             'Machine Learning is amazing',
             'Hola a todos']
test_labels = ["SP", "SP", "EN", "SP"]

# Representation
vectorizer = TfidfVectorizer()
vectorized_train_data = vectorizer.fit_transform(train_data)
vectorized_test_data = vectorizer.transform(test_data)

# Training
classifier = LinearSVC()
classifier.fit(vectorized_train_data, train_labels)

# Predicting
predictions = classifier.predict(vectorized_test_data)
pprint(predictions)

array(['SP', 'SP', 'EN', 'EN'],
      dtype='<U2')

So, what happened above?¶

Why didn't is show SP in the end as per the test label but EN?

Multi-Label Problem¶

Here we try to figure out the multi-label problem of labelling documents with their relevance to sports, politics etc. As previously demonstrated, we create a small collection.

We will try to do it differently this time in:

Change the representation of the data viewing every document as a list of bits -- with them representing of weither being OR not to each category. We'll need a MultiLabelBinarizer from the sklearn library
We'll run the classifier N times, once for each category where the negative cases will be documents in all other categories. for this we'll need a OneVsRestClassifier from sklearn. [Note: There is also a OneVsOneClassifier, but we'll discuss this another time]

So, lets get started...

In [21]:

from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier

train_data = ['Soccer: a great sport', 
              'The referee has been very bad this season', 
              'Our team scored 5 goals', 'I love tenis',
              'Politics is in decline in the UK', 'Brexit means Brexit', 
              'The parlament wants to create new legislation',
              'I so want to travel the world', 
              'The government will increase the budget for sports in the NL after great sport medal tally!',
              "O'Reilly has a great conference this year"]
train_labels = [["Sports"], ["Sports"], ["Sports"], ["Sports"],
                ["Politics"],["Politics"],["Politics"],[],["Politics", "Sports"],[]]

test_data = ['Swimming is a great sport', 
             'A lot of policy changes will happen after Brexit', 
             'The table tenis team will travel to the UK soon for the European Championship',
             'The government will increase the budget for sports in the NL after great sport medal tally!',
             'PyCon is my favourite conference']
test_labels = [["Sports"], ["Politics"], ["Sports"], ["Politics","Sports"],[]]

# We change the representation of the data as a list of bit lists
multilabelbin = MultiLabelBinarizer()
binary_train_labels = multilabelbin.fit_transform(train_labels)
binary_test_labels = multilabelbin.transform(test_labels)

print("These are Binary Train Labels: ", binary_train_labels)
print("These are Binary Test Labels: ", binary_test_labels)

These are Binary Train Labels:  [[0 1]
 [0 1]
 [0 1]
 [0 1]
 [1 0]
 [1 0]
 [1 0]
 [0 0]
 [1 1]
 [0 0]]
These are Binary Test Labels:  [[0 1]
 [1 0]
 [0 1]
 [1 1]
 [0 0]]

In [25]:

# Doing same with OneVsRest
# Represent first
vectorizer = TfidfVectorizer(stop_words=stop_words)
vectorized_train_data = vectorizer.fit_transform(train_data)
vectorized_test_data = vectorizer.transform(test_data)

# Build one classifier per category
classifier = OneVsRestClassifier(LinearSVC())
classifier.fit(vectorized_train_data, binary_train_labels)

# Predict
predictions = classifier.predict(vectorized_test_data)
print(predictions)
print()

[[0 1]
 [1 0]
 [0 1]
 [1 1]
 [0 0]]

In [26]:

print(multilabelbin.inverse_transform(predictions))

[('Sports',), ('Politics',), ('Sports',), ('Politics', 'Sports'), ()]

In [ ]: