0

I have a list of strings, which are subjects from different email conversations. I would like to see if there are words or word combinations which are being used frequently.

An example list would be:

subjects = [
              'Proposal to cooperate - Company Name',
              'Company Name Introduction',
              'Into Other Firm / Company Name',
              'Request for Proposal'
           ]

The function would have to detect that "Company Name" as combination is used more than once, and that "Proposal" is being used more than once. These words won't be known in advance though, so I guess it would have to start trying all possible combinations.

The actual list is of course a lot longer than this example, so manually trying all combinations doesn't seem like the best way to go. What would be the best way to go about this?

UPDATE

I've used Tim Pietzcker's answer to start developing a function for this, but I get stuck on applying the Counter correctly. It keeps returning the length of the list as count for all phrases.

The phrases function, including punctuation filter and a check if this phrase has already been checked, and a max length per phrase of 3 words:

def phrases(string, phrase_list):
  words = string.split()
  result = []
  punctuation = '\'\"-_,.:;!? '
  for number in range(len(words)):
      for start in range(len(words)-number):
        if number+1 <= 3:
          phrase = " ".join(words[start:start+number+1])
          if phrase in phrase_list:
            pass
          else:
            phrase_list.append(phrase)
            phrase = phrase.strip(punctuation).lower()
            if phrase:
               result.append(phrase)
  return result, phrase_list

And then the loop through the list of subjects:

phrase_list = []
ranking = {}
for s in subjects:
    result, phrase_list = phrases(s, phrase_list)
    all_phrases = collections.Counter(phrase.lower() for s in subjects for phrase in result)

"all_phrases" returns a list with tuples where each count value is 167, which is the length of the subject list I'm using. Not sure what I'm missing here...

18
  • 3
    This is not a duplicate. At least not of that particular question. This is not about items in a list, it's about common phrases in a list of strings. Please read more than the title before closing. Commented Mar 3, 2016 at 15:07
  • 1
    @InbarRose: That's the whole point of the question. Don't close questions as duplicates if you're not sure beforehand they actually are duplicate. It's not a race. Commented Mar 3, 2016 at 15:16
  • 1
    It's not strictly counting items in a list... is this of any assistance? stackoverflow.com/questions/18715688/… Commented Mar 3, 2016 at 15:18
  • 2
    @InbarRose: If you think the question is unclear, either vote to close it as unclear or ask for clarification in comments. A duplicate means that the question is the same. Just because you also have to count elements to achieve a solution for this problem doesn't mean it's the same question. In case of doubt, don't do anything. Commented Mar 3, 2016 at 15:20
  • 1
    @InbarRose: No, you just completely misunderstand the point of closing as duplicate. A gold Python badge doesn't give you the right to arbitrarily close Python questions as you see fit. That's not how moderation on Stack Overflow works. Commented Mar 3, 2016 at 15:23

3 Answers 3

2

You also want to find phrases that are composed of more than single words. No problem. This should even scale quite well.

import collections

subjects = [
              'Proposal to cooperate - Company Name',
              'Company Name Introduction',
              'Into Other Firm / Company Name',
              'Request for Proposal',
              'Some more Firm / Company Names'
           ]

def phrases(string):
    words = string.split()
    result = []
    for number in range(len(words)):
        for start in range(len(words)-number):
             result.append(" ".join(words[start:start+number+1]))
    return result

The function phrases() splits the input string on whitespace and returns all possible substrings of any length:

In [2]: phrases("A Day in the Life")
Out[2]:
['A',
 'Day',
 'in',
 'the',
 'Life',
 'A Day',
 'Day in',
 'in the',
 'the Life',
 'A Day in',
 'Day in the',
 'in the Life',
 'A Day in the',
 'Day in the Life',
 'A Day in the Life']

Now you can count how many times each of these phrases are found in all your subjects:

all_phrases = collections.Counter(phrase for subject in subjects for phrase in phrases(subject))

Result:

In [3]: print([(phrase, count) for phrase, count in all_phrases.items() if count > 1])
Out [3]:
[('Company', 4), ('Proposal', 2), ('Firm', 2), ('Name', 3), ('Company Name', 3), 
 ('Firm /', 2), ('/', 2), ('/ Company', 2), ('Firm / Company', 2)]

Note that you might want to use other criteria than simply splitting on whitespace, maybe ignore punctuation and case etc.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, this was a great start. I've implemented this in the loop but am having some troubles with the counter. I've updated the question with the latest status.
0

I would suggest that you use space as a separator, otherwise there are too many possibilities if you don't specify how an allowed 'phrase' should look like.

To count word occurrences you can use Counter from the collections module:

import operator
from collections import Counter

d = Counter(' '.join(subjects).split())

# create a list of tuples, ordered by occurrence frequency
sorted_d = sorted(d.items(), key=operator.itemgetter(1), reverse=True)

# print all entries that occur more than once
for x in sorted_d:
    if x[1] > 1:
        print(x[1], x[0])

Output:

3 Name
3 Company
2 Proposal

2 Comments

Thanks, this is helpful. Probably by first getting the repeated words, I can then start looking for word combinations, using the words this function finds. I'll play around with this a bit and post my result here.
A potential alternative to using split() to tokenize a sentence, you could also use the work_tokenize() function from the nltk. nltk.org/book/ch03.html
0

Similar to pp_'s answer. Using Split.

import operator

subjects = [
          'Proposal to cooperate - Company Name',
          'Company Name Introduction',
          'Into Other Firm / Company Name',
          'Request for Proposal'
       ]
flat_list = [item for i in subjects for item in i.split() ]
count_dict = {i:flat_list.count(i) for i in flat_list}
sorted_dict = sorted(count_dict.items(), reverse=True, key=operator.itemgetter(1))

Output:

[('Name', 3),
('Company', 3),
('Proposal', 2),
('Other', 1),
('/', 1),
('for', 1),
('cooperate', 1),
('Request', 1),
('Introduction', 1),
('Into', 1),
('-', 1),
('to', 1),
('Firm', 1)]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.