Step 1: Get and load the data¶

Go to Gutenberg press and https://www.gutenberg.org/ebooks/24407, get all the data and put it innto your /data/recipes folder.

In [4]:

import os

data_folder = os.path.join('data/recipes')
all_recipe_files = [os.path.join(data_folder, fname)
                    for fname in os.listdir(data_folder)]

documents = {}
for recipe_fname in all_recipe_files:
    bname = os.path.basename(recipe_fname)
    recipe_number = os.path.splitext(bname)[0]
    with open(recipe_fname, 'r') as f:
        documents[recipe_number] = f.read()

corpus_all_in_one = ' '.join([doc for doc in documents.values()])

print("Number of docs: {}".format(len(documents)))
print("Corpus size (char): {}".format(len(corpus_all_in_one)))

Number of docs: 220
Corpus size (char): 161146

Step 2: Let's tokenize¶

What this actually means is that we will be splitting raw string into a list of tokens. Where A "token" seentially is meaningful units of text such as words, phrases, punctuation, numbers, dates,...

In [5]:

from nltk.tokenize import word_tokenize
all_tokens = [token for token in word_tokenize(corpus_all_in_one)]
print("Total number of tokens: {}".format(len(all_tokens)))

Total number of tokens: 33719

Step 3: Let's do a word count¶

We start with a simple word count using collections.Counter function.

Why we're doing this?

We want to know the number times a word occurs in the whole corpus and in home many docs it occurs.

In [8]:

from collections import Counter

total_word_freq = Counter(all_tokens)

for word, freq in total_word_freq.most_common(20):
    # Let's try the top 25 words in descending order
    print("{}\t{}".format(word, freq))

the	1933
,	1726
.	1568
and	1435
a	1076
of	988
in	811
with	726
it	537
to	452
or	389
is	337
(	295
)	295
be	266
them	248
butter	231
on	220
water	205
little	198

Step 4: Stop words¶

Obviously you can see that a lot of words above were expected. Actually also quite boring as (, ) or full stop is something one would expect. If it were a scary novel a lot of ! would appear.

Wwe call these type of words stop words and they are pretty meaningless in themselves, right?

Also you will see that there is no universal list of stop words and removing them can have a certain desirable or undesirable effect, right?

So lets's import stop words from the big and mighty nltk library

In [10]:

from nltk.corpus import stopwords
import string

print(stopwords.words('english'))
print(len(stopwords.words('english')))
print(string.punctuation)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn']
153
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

Tip: A little bit about strings and digits btw¶

There is a pythonic way to do stuff as well but that's for another time and you can play a little game by creating a password generator and checking out all kinds og modules in string as well as crypt (there is cryptography as well)

In [11]:

string.ascii_letters

Out[11]:

'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

In [12]:

string.ascii_lowercase

Out[12]:

'abcdefghijklmnopqrstuvwxyz'

In [13]:

string.ascii_uppercase

Out[13]:

'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

In [26]:

# How to get them all including symbols and make a cool password
import random
char_set = string.ascii_letters + string.digits + string.punctuation
print("".join(random.sample(char_set*9, 9)))

,o8r_xqAR

In [25]:

import crypt
passwd = input("Enter your email: ")
value = '$1$' + ''.join([random.choice(string.ascii_letters + string.digits) for _ in range(16)])
# print("%s" % value)
print(crypt.crypt(passwd, value))

Enter your email: me@me.com
$1ocjj.wZDJpw

OK, we got distracted a bit, so we're back 😅

So, back to where we were...

In [27]:

stop_list = stopwords.words('english') + list(string.punctuation)
tokens_no_stop = [token for token in all_tokens if token not in stop_list]
total_term_freq_no_stop = Counter(tokens_no_stop)

for word, freq in total_term_freq_no_stop.most_common(25):
    print("{}\t{}".format(word, freq))

butter	231
water	205
little	198
put	197
one	186
salt	185
fire	169
half	169
two	157
When	132
sauce	128
pepper	128
add	125
cut	125
flour	116
piece	116
The	111
sugar	100
saucepan	100
oil	99
pieces	95
well	94
meat	90
brown	88
small	87

Do you see capitalized When and The?

In [29]:

print(total_term_freq_no_stop['olive'])
print(total_term_freq_no_stop['olives'])
print(total_term_freq_no_stop['Olive'])
print(total_term_freq_no_stop['Olives'])
print(total_term_freq_no_stop['OLIVE'])
print(total_term_freq_no_stop['OLIVES'])

Step 5: Text Normalization¶

Replacing tokens with a canonical form, so we can group them together different spelling / variations of the same word:

lowercases
stemming
UStoGB mapping
synonym mapping

Stemming, btw - is a process of reducing the words -- nenerally modifed of derived, to their word stem or root form. The ain goal of stemmnig is to reduce related words to the same stem even when the stem isn't a dictionary word.

As a simple example, lets take this for instance:

handsome and handsomly would be stemmed as "handsom" - so it does not end up being a word you know!
Nice, cool, awesome would be stemmed as nice, cool and awesome

You must also be careful with one-way transformations as well such as lowercasing (these you should be able to imporve post your training/epochs and loading the computation graph when done)

Lets take a deeper look at this...

In [32]:

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
all_tokens_lowercase = [token.lower() for token in all_tokens]
tokens_normalized = [stemmer.stem(token) for token in all_tokens_lowercase if token not in stop_list]

total_term_freq_normalized = Counter(tokens_normalized)

for word, freq in total_term_freq_normalized.most_common(25):
    print("{}\t{}".format(word, freq))

put	286
butter	245
salt	215
piec	211
one	210
water	209
cook	208
littl	198
cut	175
half	170
brown	169
fire	169
egg	163
two	162
add	160
boil	154
sauc	152
pepper	130
serv	128
remov	127
flour	123
season	123
sugar	116
slice	102
saucepan	101

Clearly you see the effect we just discussed aboce such as "littl" and so on...

n-grams -- What are they?¶

n-gram is a sequence of n-items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs.

n-grams of texts are used quite heavily in text mining and NLP tasks. They basically are a set of words that co-occur within a given sentence and typically move one word forward. for instance the dog jumps over the car, and say if N=2(a bi-gram), then n-grams would be as such:

th dog
dog jumps
jumps over
over the
the car

So, we have 5-ngrams in this case.

And if N = 3 (tri-gram), then you have four n-grams and so on...

the dog jumps
dog jumps over
jumps over the
over the car

So, how many N-grams can be in a sentence?

If X= number of words in a sentence K then the number of n-grams for sentence K would be: $$N_{gramsK} = X - (N - 1)$$

Two popular uses of N-grams:

For buildin g language models (unigram, bigram, trigram). Google Yahoo Microsoft Amazon Netflix etc. use web scale n-gram models to do stuff like fix spellings, word breaking and text summarization
For developing features for supervised Deep Learningh models such as SVM, MaxEnt model, Naive Bayes etc

OK, enough lecture, we move on to the next...

In [33]:

from nltk import ngrams

phrases = Counter(ngrams(all_tokens_lowercase, 2)) # N = 2
for phrase, frew in phrases.most_common(25):
    print(phrase, freq)
    # Sorry, I know its elegant to write like this => print()"{}\t{}".format(phrase, freq)), but too non-intuitive!

('in', 'the') 101
('in', 'a') 101
('of', 'the') 101
('with', 'a') 101
('.', 'when') 101
('the', 'fire') 101
('on', 'the') 101
(',', 'and') 101
('with', 'the') 101
('salt', 'and') 101
('it', 'is') 101
('a', 'little') 101
('piece', 'of') 101
('and', 'a') 101
('of', 'butter') 101
('and', 'pepper') 101
('.', 'the') 101
('and', 'the') 101
('when', 'the') 101
('with', 'salt') 101
('and', 'put') 101
('to', 'be') 101
('from', 'the') 101
('butter', ',') 101
(',', 'a') 101

In [34]:

phrases = Counter(ngrams(tokens_no_stop, 3)) # N = 3
for phrase, freq in phrases.most_common(25):
    print(phrase, freq)

('season', 'salt', 'pepper') 28
('Season', 'salt', 'pepper') 16
('pinch', 'grated', 'cheese') 11
('bread', 'crumbs', 'ground') 11
('cut', 'thin', 'slices') 11
('good', 'olive', 'oil') 10
('saucepan', 'piece', 'butter') 9
('another', 'piece', 'butter') 9
('cut', 'small', 'pieces') 9
('salt', 'pepper', 'When') 9
('half', 'inch', 'thick') 9
('greased', 'butter', 'sprinkled') 9
('small', 'piece', 'butter') 9
('tomato', 'sauce', 'No') 8
('sauce', 'No', '12') 8
('medium', 'sized', 'onion') 8
('ounces', 'Sweet', 'almonds') 8
('three', 'half', 'ounces') 8
('piece', 'butter', 'When') 7
('seasoning', 'salt', 'pepper') 7
('put', 'back', 'fire') 7
('oil', 'salt', 'pepper') 7
('butter', 'salt', 'pepper') 7
('tomato', 'paste', 'diluted') 7
('crumbs', 'ground', 'fine') 7

In [ ]: