0

I have a corpus of text that needs to be analysed. I have a data frame with the below headers.

print((df.columns.values))
>>>> ['Unique ID' 'Quarter' 'Theme' 'Subtheme' 'Driver' 'Ticker' 'Company'
'Sub-sector' 'Issue weight' 'Quote' 'Executive name' 'Designation'
'Quote_len' 'word_count']

I have written a function to find Top 20 words in the 'Quote' column after removing stop words.

def get_top_n_words(corpus, n=None):
    vec = CountVectorizer(stop_words = 'english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]
common_words = get_top_n_words(df['Quote'].values.astype('U'), 20)
for word, freq in common_words:
    print(word, freq)
df2 = pd.DataFrame(common_words, columns = ['ReviewText' , 'count'])
df2.groupby('ReviewText').sum()['count'].sort_values(ascending=False).iplot(
    kind='bar', yTitle='Count', linecolor='black', title='Top 20 words in review after removing stop words')

Now is wish to use a where clause within the code to find results for the column "Theme".

For eg. Theme= 'Competitive advantage'

How to do that?

1
  • 1
    result = df[ df['Theme'] == 'Competitive advantage' ] ? Commented Jan 21, 2020 at 12:27

1 Answer 1

1

Use DataFrame.loc[...] to filter down your results.

For example df = df.loc[df.Theme == 'Competitive advantage'].

Then continue with common_words = get_top_n_words(df['Quote'].values.astype('U'), 20), but now the dataframe will only include results where Theme == 'Competitive advantage'.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.