Step 1: Get and load the data¶
Go to Gutenberg press and https://www.gutenberg.org/ebooks/24407, get all the data and put it innto your /data/recipes folder.
import os
data_folder = os.path.join('data/recipes')
all_recipe_files = [os.path.join(data_folder, fname)
for fname in os.listdir(data_folder)]
documents = {}
for recipe_fname in all_recipe_files:
bname = os.path.basename(recipe_fname)
recipe_number = os.path.splitext(bname)[0]
with open(recipe_fname, 'r') as f:
documents[recipe_number] = f.read()
corpus_all_in_one = ' '.join([doc for doc in documents.values()])
print("Number of docs: {}".format(len(documents)))
print("Corpus size (char): {}".format(len(corpus_all_in_one)))
Number of docs: 220 Corpus size (char): 161146
Step 2: Let's tokenize¶
What this actually means is that we will be splitting raw string into a list of tokens. Where A "token" seentially is meaningful units of text such as words, phrases, punctuation, numbers, dates,...
from nltk.tokenize import word_tokenize
all_tokens = [token for token in word_tokenize(corpus_all_in_one)]
print("Total number of tokens: {}".format(len(all_tokens)))
Total number of tokens: 33719
Step 3: Let's do a word count¶
We start with a simple word count using collections.Counter function.
Why we're doing this?
We want to know the number times a word occurs in the whole corpus and in home many docs it occurs.
from collections import Counter
total_word_freq = Counter(all_tokens)
for word, freq in total_word_freq.most_common(20):
# Let's try the top 25 words in descending order
print("{}\t{}".format(word, freq))
the 1933 , 1726 . 1568 and 1435 a 1076 of 988 in 811 with 726 it 537 to 452 or 389 is 337 ( 295 ) 295 be 266 them 248 butter 231 on 220 water 205 little 198
Step 4: Stop words¶
Obviously you can see that a lot of words above were expected. Actually also quite boring as (, ) or full stop is something one would expect. If it were a scary novel a lot of ! would appear.
Wwe call these type of words stop words and they are pretty meaningless in themselves, right?
Also you will see that there is no universal list of stop words and removing them can have a certain desirable or undesirable effect, right?
So lets's import stop words from the big and mighty nltk library
from nltk.corpus import stopwords
import string
print(stopwords.words('english'))
print(len(stopwords.words('english')))
print(string.punctuation)
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn']
153
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
Tip: A little bit about strings and digits btw¶
There is a pythonic way to do stuff as well but that's for another time and you can play a little game by creating a password generator and checking out all kinds og modules in string as well as crypt (there is cryptography as well)
string.ascii_letters
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
string.ascii_lowercase
'abcdefghijklmnopqrstuvwxyz'
string.ascii_uppercase
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
# How to get them all including symbols and make a cool password
import random
char_set = string.ascii_letters + string.digits + string.punctuation
print("".join(random.sample(char_set*9, 9)))
,o8r_xqAR
import crypt
passwd = input("Enter your email: ")
value = '$1$' + ''.join([random.choice(string.ascii_letters + string.digits) for _ in range(16)])
# print("%s" % value)
print(crypt.crypt(passwd, value))
Enter your email: me@me.com $1ocjj.wZDJpw
OK, we got distracted a bit, so we're back 😅
So, back to where we were...
stop_list = stopwords.words('english') + list(string.punctuation)
tokens_no_stop = [token for token in all_tokens if token not in stop_list]
total_term_freq_no_stop = Counter(tokens_no_stop)
for word, freq in total_term_freq_no_stop.most_common(25):
print("{}\t{}".format(word, freq))
butter 231 water 205 little 198 put 197 one 186 salt 185 fire 169 half 169 two 157 When 132 sauce 128 pepper 128 add 125 cut 125 flour 116 piece 116 The 111 sugar 100 saucepan 100 oil 99 pieces 95 well 94 meat 90 brown 88 small 87
Do you see capitalized When and The?
print(total_term_freq_no_stop['olive'])
print(total_term_freq_no_stop['olives'])
print(total_term_freq_no_stop['Olive'])
print(total_term_freq_no_stop['Olives'])
print(total_term_freq_no_stop['OLIVE'])
print(total_term_freq_no_stop['OLIVES'])
27 3 1 0 0 1
Step 5: Text Normalization¶
Replacing tokens with a canonical form, so we can group them together different spelling / variations of the same word:
- lowercases
- stemming
- UStoGB mapping
- synonym mapping
Stemming, btw - is a process of reducing the words -- nenerally modifed of derived, to their word stem or root form. The ain goal of stemmnig is to reduce related words to the same stem even when the stem isn't a dictionary word.
As a simple example, lets take this for instance:
- handsome and handsomly would be stemmed as "handsom" - so it does not end up being a word you know!
- Nice, cool, awesome would be stemmed as nice, cool and awesome
- You must also be careful with one-way transformations as well such as lowercasing (these you should be able to imporve post your training/epochs and loading the computation graph when done)
Lets take a deeper look at this...
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
all_tokens_lowercase = [token.lower() for token in all_tokens]
tokens_normalized = [stemmer.stem(token) for token in all_tokens_lowercase if token not in stop_list]
total_term_freq_normalized = Counter(tokens_normalized)
for word, freq in total_term_freq_normalized.most_common(25):
print("{}\t{}".format(word, freq))
put 286 butter 245 salt 215 piec 211 one 210 water 209 cook 208 littl 198 cut 175 half 170 brown 169 fire 169 egg 163 two 162 add 160 boil 154 sauc 152 pepper 130 serv 128 remov 127 flour 123 season 123 sugar 116 slice 102 saucepan 101
Clearly you see the effect we just discussed aboce such as "littl" and so on...
n-grams -- What are they?¶
n-gram is a sequence of n-items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs.
n-grams of texts are used quite heavily in text mining and NLP tasks. They basically are a set of words that co-occur within a given sentence and typically move one word forward. for instance the dog jumps over the car, and say if N=2(a bi-gram), then n-grams would be as such:
- th dog
- dog jumps
- jumps over
- over the
- the car
So, we have 5-ngrams in this case.
And if N = 3 (tri-gram), then you have four n-grams and so on...
- the dog jumps
- dog jumps over
- jumps over the
- over the car
So, how many N-grams can be in a sentence?
If X= number of words in a sentence K then the number of n-grams for sentence K would be:
$$N_{gramsK} = X - (N - 1)$$
Two popular uses of N-grams:
- For buildin g language models (unigram, bigram, trigram). Google Yahoo Microsoft Amazon Netflix etc. use web scale n-gram models to do stuff like fix spellings, word breaking and text summarization
- For developing features for supervised Deep Learningh models such as SVM, MaxEnt model, Naive Bayes etc
OK, enough lecture, we move on to the next...
from nltk import ngrams
phrases = Counter(ngrams(all_tokens_lowercase, 2)) # N = 2
for phrase, frew in phrases.most_common(25):
print(phrase, freq)
# Sorry, I know its elegant to write like this => print()"{}\t{}".format(phrase, freq)), but too non-intuitive!
('in', 'the') 101
('in', 'a') 101
('of', 'the') 101
('with', 'a') 101
('.', 'when') 101
('the', 'fire') 101
('on', 'the') 101
(',', 'and') 101
('with', 'the') 101
('salt', 'and') 101
('it', 'is') 101
('a', 'little') 101
('piece', 'of') 101
('and', 'a') 101
('of', 'butter') 101
('and', 'pepper') 101
('.', 'the') 101
('and', 'the') 101
('when', 'the') 101
('with', 'salt') 101
('and', 'put') 101
('to', 'be') 101
('from', 'the') 101
('butter', ',') 101
(',', 'a') 101
phrases = Counter(ngrams(tokens_no_stop, 3)) # N = 3
for phrase, freq in phrases.most_common(25):
print(phrase, freq)
('season', 'salt', 'pepper') 28
('Season', 'salt', 'pepper') 16
('pinch', 'grated', 'cheese') 11
('bread', 'crumbs', 'ground') 11
('cut', 'thin', 'slices') 11
('good', 'olive', 'oil') 10
('saucepan', 'piece', 'butter') 9
('another', 'piece', 'butter') 9
('cut', 'small', 'pieces') 9
('salt', 'pepper', 'When') 9
('half', 'inch', 'thick') 9
('greased', 'butter', 'sprinkled') 9
('small', 'piece', 'butter') 9
('tomato', 'sauce', 'No') 8
('sauce', 'No', '12') 8
('medium', 'sized', 'onion') 8
('ounces', 'Sweet', 'almonds') 8
('three', 'half', 'ounces') 8
('piece', 'butter', 'When') 7
('seasoning', 'salt', 'pepper') 7
('put', 'back', 'fire') 7
('oil', 'salt', 'pepper') 7
('butter', 'salt', 'pepper') 7
('tomato', 'paste', 'diluted') 7
('crumbs', 'ground', 'fine') 7