I need to count number of words in sentence. I do it with
word_matrix[i][j] = sentences[i].count([*words_dict][j])
But it also counts when a word is included in other word, for example 'in' is included in 'interactive'. How to avoid it?
You could use collections.Counter for this:
from collections import Counter
s = 'This is a sentence'
Counter(s.lower().split())
# Counter({'this': 1, 'is': 1, 'a': 1, 'sentence': 1})
Count is not to count how many words, but rather how many times each word occurs....use split to tokenise the words of statement, then use logic if word exist in dict then increment the value by one otherwise add the word with count as one :
paragraph='Nory was a Catholic because her mother was a Catholic, and Nory’s mother was a Catholic because her father was a Catholic, and her father was a Catholic because his mother was a Catholic, or had been'
words=paragraph.split()
word_count={}
counter=0
for i in words:
if i in word_count:
word_count[i]+=1
else:
word_count[i]=1
print(word_count)
Depending on the situation, the most efficient solution would be using collection.Counter, but you will miss all the words with a symbol:
i.e. in will be different from interactive (as you want), but will also be different from in:.
An alternative solution that consider this problem could be counting the matched pattern of a RegEx:
import re
my_count = re.findall(r"(?:\s|^)({0})(?:[\s$\.,;:])".format([*words_dict][j]), sentences[i])
print(len(my_count))
What is the RegEx doing?
For a given word, you match:
the same word preceded by a space or start of line (\s|^)
and followed by a space, end of the line, a dot, comma, and any symbol in the square brackets ([\s$\.,;:])
word_matrix = np.zeros(shape=(n, d)) for i in range(n): for j in range(d): word_matrix[i][j] = sentences[i].count([*words_dict][j])