Python: Finding unknown repeated word(s) in a list of strings

Question

I have a list of strings, which are subjects from different email conversations. I would like to see if there are words or word combinations which are being used frequently.

An example list would be:

subjects = [
              'Proposal to cooperate - Company Name',
              'Company Name Introduction',
              'Into Other Firm / Company Name',
              'Request for Proposal'
           ]

The function would have to detect that "Company Name" as combination is used more than once, and that "Proposal" is being used more than once. These words won't be known in advance though, so I guess it would have to start trying all possible combinations.

The actual list is of course a lot longer than this example, so manually trying all combinations doesn't seem like the best way to go. What would be the best way to go about this?

UPDATE

I've used Tim Pietzcker's answer to start developing a function for this, but I get stuck on applying the Counter correctly. It keeps returning the length of the list as count for all phrases.

The phrases function, including punctuation filter and a check if this phrase has already been checked, and a max length per phrase of 3 words:

def phrases(string, phrase_list):
  words = string.split()
  result = []
  punctuation = '\'\"-_,.:;!? '
  for number in range(len(words)):
      for start in range(len(words)-number):
        if number+1 <= 3:
          phrase = " ".join(words[start:start+number+1])
          if phrase in phrase_list:
            pass
          else:
            phrase_list.append(phrase)
            phrase = phrase.strip(punctuation).lower()
            if phrase:
               result.append(phrase)
  return result, phrase_list

And then the loop through the list of subjects:

phrase_list = []
ranking = {}
for s in subjects:
    result, phrase_list = phrases(s, phrase_list)
    all_phrases = collections.Counter(phrase.lower() for s in subjects for phrase in result)

"all_phrases" returns a list with tuples where each count value is 167, which is the length of the subject list I'm using. Not sure what I'm missing here...

This is not a duplicate. At least not of that particular question. This is not about items in a list, it's about common phrases in a list of strings. Please read more than the title before closing. — André Laszlo
– André Laszlo, Commented Mar 3, 2016 at 15:07
@InbarRose: That's the whole point of the question. Don't close questions as duplicates if you're not sure beforehand they actually are duplicate. It's not a race. — Vincent Savard
– Vincent Savard, Commented Mar 3, 2016 at 15:16
It's not strictly counting items in a list... is this of any assistance? stackoverflow.com/questions/18715688/… — David Zemens
– David Zemens, Commented Mar 3, 2016 at 15:18
@InbarRose: If you think the question is unclear, either vote to close it as unclear or ask for clarification in comments. A duplicate means that the question is the same. Just because you also have to count elements to achieve a solution for this problem doesn't mean it's the same question. In case of doubt, don't do anything. — Vincent Savard
– Vincent Savard, Commented Mar 3, 2016 at 15:20
@InbarRose: No, you just completely misunderstand the point of closing as duplicate. A gold Python badge doesn't give you the right to arbitrarily close Python questions as you see fit. That's not how moderation on Stack Overflow works. — Vincent Savard
– Vincent Savard, Commented Mar 3, 2016 at 15:23

Tim Pietzcker · Accepted Answer · 2016-03-04 07:20:15Z

2

You also want to find phrases that are composed of more than single words. No problem. This should even scale quite well.

import collections

subjects = [
              'Proposal to cooperate - Company Name',
              'Company Name Introduction',
              'Into Other Firm / Company Name',
              'Request for Proposal',
              'Some more Firm / Company Names'
           ]

def phrases(string):
    words = string.split()
    result = []
    for number in range(len(words)):
        for start in range(len(words)-number):
             result.append(" ".join(words[start:start+number+1]))
    return result

The function phrases() splits the input string on whitespace and returns all possible substrings of any length:

In [2]: phrases("A Day in the Life")
Out[2]:
['A',
 'Day',
 'in',
 'the',
 'Life',
 'A Day',
 'Day in',
 'in the',
 'the Life',
 'A Day in',
 'Day in the',
 'in the Life',
 'A Day in the',
 'Day in the Life',
 'A Day in the Life']

Now you can count how many times each of these phrases are found in all your subjects:

all_phrases = collections.Counter(phrase for subject in subjects for phrase in phrases(subject))

Result:

In [3]: print([(phrase, count) for phrase, count in all_phrases.items() if count > 1])
Out [3]:
[('Company', 4), ('Proposal', 2), ('Firm', 2), ('Name', 3), ('Company Name', 3), 
 ('Firm /', 2), ('/', 2), ('/ Company', 2), ('Firm / Company', 2)]

Note that you might want to use other criteria than simply splitting on whitespace, maybe ignore punctuation and case etc.

answered Mar 4, 2016 at 7:20

Tim Pietzcker

337k59 gold badges520 silver badges572 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Vincent Over a year ago

Thanks, this was a great start. I've implemented this in the loop but am having some troubles with the counter. I've updated the question with the latest status.

pp_ · Accepted Answer · 2016-03-03 15:39:39Z

0

I would suggest that you use space as a separator, otherwise there are too many possibilities if you don't specify how an allowed 'phrase' should look like.

To count word occurrences you can use Counter from the collections module:

import operator
from collections import Counter

d = Counter(' '.join(subjects).split())

# create a list of tuples, ordered by occurrence frequency
sorted_d = sorted(d.items(), key=operator.itemgetter(1), reverse=True)

# print all entries that occur more than once
for x in sorted_d:
    if x[1] > 1:
        print(x[1], x[0])

Output:

3 Name
3 Company
2 Proposal

edited Mar 3, 2016 at 15:39

answered Mar 3, 2016 at 15:20

pp_

3,5134 gold badges21 silver badges27 bronze badges

2 Comments

Vincent Over a year ago

Thanks, this is helpful. Probably by first getting the repeated words, I can then start looking for word combinations, using the words this function finds. I'll play around with this a bit and post my result here.

BrockLee Over a year ago

A potential alternative to using split() to tokenize a sentence, you could also use the work_tokenize() function from the nltk. nltk.org/book/ch03.html

Faller · Accepted Answer · 2016-03-03 15:42:18Z

0

Similar to pp_'s answer. Using Split.

import operator

subjects = [
          'Proposal to cooperate - Company Name',
          'Company Name Introduction',
          'Into Other Firm / Company Name',
          'Request for Proposal'
       ]
flat_list = [item for i in subjects for item in i.split() ]
count_dict = {i:flat_list.count(i) for i in flat_list}
sorted_dict = sorted(count_dict.items(), reverse=True, key=operator.itemgetter(1))

Output:

[('Name', 3),
('Company', 3),
('Proposal', 2),
('Other', 1),
('/', 1),
('for', 1),
('cooperate', 1),
('Request', 1),
('Introduction', 1),
('Into', 1),
('-', 1),
('to', 1),
('Firm', 1)]

answered Mar 3, 2016 at 15:42

Faller

1,7184 gold badges18 silver badges29 bronze badges

Collectives™ on Stack Overflow

Python: Finding unknown repeated word(s) in a list of strings

3 Answers 3

1 Comment

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related