1

I'm using SpaCy to process a series of sentences and return the five most common words in each sentence. My goal is to store the output of that frequency analysis (using Counter) in a column beside each corresponding sentence. I think this is just the lack of coffee and sleep talking here, but I'm stuck on why this keeps outputting a dataframe that has the first value filling all the way down (and repeating) instead of unique values that match the output for the sentence itself.

Code:

# test_data is a Dataframe with three columns: a unique identifier, a title, and a sentence for each title. #

for value in test_data['desc']: # for each sentence in dataset
    desc = nlp(value) # run spacy natural language processing on the description
    words = [
        token.text # for each token, etc
        for token in desc
        if not token.is_stop and not token.is_punct # essentially, just keywords, no filler
    ]
    keys = list(Counter(words).most_common(5)) # store values from Counter 
    key_list = ", ".join(map(str, keys)) # convert list to string
    test_data['key'] = key_list # carry list over to dataframe

The output I'm getting is something like:

uniq title desc key
1 Title one... Sentence one.. ('kword1', 12), ('kword2', 8), ('kword3', 7)
2 Title two... Sentence two... ('kword1', 12), ('kword2', 8), ('kword3', 7)
3 Title three... Sentence three... ('kword1', 12), ('kword2', 8), ('kword3', 7)
4 Title four ... Sentence four... ('kword1', 12), ('kword2', 8), ('kword3', 7)

Where kword1, 2 and 3 all are perfect for the first row (eg, it's the correct output for Sentence One), but duplicated across all rows filling down (not the correct output for Sentence two, three, four).

I'm not sure if this makes any sense and I'm a bit of a Python novice without a comp sci background/foundation so I am all ears for help. Thank you in advance!!

1 Answer 1

2

Your mistake is here:

test_data['key'] = key_list

You rewrite the entire column on each iteration.

You can use a function and let Pandas create the columns :

def count5(row):
    desc = nlp(row)
    words = [token.text for token in desc  if not token.is_stop and not token.is_punct]
    keys = list(Counter(words).most_common(5))
    key_list = ", ".join(map(str, keys))
    return key_list
    
test_data["key"] = test_data["desc"].map(count5)

Output:

>>> test_data
                                                desc                                                key
0  Recent years have brought a revolution in the ...  ('languages', 2), ('Recent', 1), ('years', 1),...
1  The latest AI models are unlocking these areas...  ('latest', 1), ('AI', 1), ('models', 1), ('unl...
2  The examples of NLP use cases in everyday live...  ('examples', 1), ('NLP', 1), ('use', 1), ('cas...
3  Natural language processing algorithms emphasi...  ('Natural', 1), ('language', 1), ('processing'...
4  The outline of NLP examples in real world for ...  ('translation', 3), ('outline', 1), ('NLP', 1)...
Sign up to request clarification or add additional context in comments.

1 Comment

Oh, duh duh duh duh duh duh. Thank you @Corralien!!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.