I'm using SpaCy to process a series of sentences and return the five most common words in each sentence. My goal is to store the output of that frequency analysis (using Counter) in a column beside each corresponding sentence. I think this is just the lack of coffee and sleep talking here, but I'm stuck on why this keeps outputting a dataframe that has the first value filling all the way down (and repeating) instead of unique values that match the output for the sentence itself.
Code:
# test_data is a Dataframe with three columns: a unique identifier, a title, and a sentence for each title. #
for value in test_data['desc']: # for each sentence in dataset
desc = nlp(value) # run spacy natural language processing on the description
words = [
token.text # for each token, etc
for token in desc
if not token.is_stop and not token.is_punct # essentially, just keywords, no filler
]
keys = list(Counter(words).most_common(5)) # store values from Counter
key_list = ", ".join(map(str, keys)) # convert list to string
test_data['key'] = key_list # carry list over to dataframe
The output I'm getting is something like:
| uniq | title | desc | key |
|---|---|---|---|
| 1 | Title one... | Sentence one.. | ('kword1', 12), ('kword2', 8), ('kword3', 7) |
| 2 | Title two... | Sentence two... | ('kword1', 12), ('kword2', 8), ('kword3', 7) |
| 3 | Title three... | Sentence three... | ('kword1', 12), ('kword2', 8), ('kword3', 7) |
| 4 | Title four ... | Sentence four... | ('kword1', 12), ('kword2', 8), ('kword3', 7) |
Where kword1, 2 and 3 all are perfect for the first row (eg, it's the correct output for Sentence One), but duplicated across all rows filling down (not the correct output for Sentence two, three, four).
I'm not sure if this makes any sense and I'm a bit of a Python novice without a comp sci background/foundation so I am all ears for help. Thank you in advance!!