2

I have a CSV Excel file example:

Receipt Name    Address      Date       Time    Items
25007   A      ABC pte ltd   4/7/2016   10:40   Cheese, Cookie, Pie
.
.
25008   B      CCC pte ltd   4/7/2016   12:40   Cheese, Cookie

What is a simple way to compare the 'Items' column and find out the most common pattern of the items people buy together and display the top combinations? In this case the similar pattern is Cheese, Cookie.

4
  • What is the actual format of your file? Commented Oct 9, 2016 at 18:11
  • I think you need a more complete example. What if someone else bought Cheese and Chocolate and another bought just Cheese? It is unclear what you are looking for... Commented Oct 9, 2016 at 18:19
  • Some questions: In Items, do you have comma separated products? You do not know all products? The most common pattern could be in any order? Commented Oct 9, 2016 at 18:31
  • @Darryl Dan, are you looking for just the pairs or what exactly is the criteria? Commented Oct 9, 2016 at 18:39

2 Answers 2

2

Suppose after processing the CSV file you find the list of items from the CSV file to be:

>>> items=['Cheese,Cookie,Pie', 'Cheese,Cookie,Pie', 'Cake,Cookie,Cheese', 
... 'Cheese,Mousetrap,Pie', 'Cheese,Jam','Cheese','Cookie,Cheese,Mousetrap']

First determine all possible pairs:

>>> from itertools import combinations
>>> all_pairs={frozenset(t) for e in items for t in combinations(e.split(','),2)}

Then you can do:

from collections import Counter
pair_counts=Counter()
for s in items:
    for pair in {frozenset(t) for t in combinations(s.split(','), 2)}:
        pair_counts.update({tuple(pair):1})

>>> pair_counts
Counter({('Cheese', 'Cookie'): 4, ('Cheese', 'Pie'): 3, ('Cookie', 'Pie'): 2, ('Cheese', 'Mousetrap'): 2, ('Cookie', 'Mousetrap'): 1, ('Cheese', 'Jam'): 1, ('Mousetrap', 'Pie'): 1, ('Cake', 'Cheese'): 1, ('Cake', 'Cookie'): 1})

Which can be extended to a more general case:

max_n=max(len(e.split(',')) for e in items)
for n in range(max_n, 1, -1):
    all_groups={frozenset(t) for e in items for t in combinations(e.split(','),n)}
    group_counts=Counter()
    for s in items:
        for group in {frozenset(t) for t in combinations(s.split(','), n)}:
            group_counts.update({tuple(group):1})      
    print 'group length: {}, most_common: {}'.format(n, group_counts.most_common())     

Prints:

group length: 3, most_common: [(('Cheese', 'Cookie', 'Pie'), 2), (('Cheese', 'Mousetrap', 'Pie'), 1), (('Cheese', 'Cookie', 'Mousetrap'), 1), (('Cake', 'Cheese', 'Cookie'), 1)]
group length: 2, most_common: [(('Cheese', 'Cookie'), 4), (('Cheese', 'Pie'), 3), (('Cookie', 'Pie'), 2), (('Cheese', 'Mousetrap'), 2), (('Cookie', 'Mousetrap'), 1), (('Cheese', 'Jam'), 1), (('Mousetrap', 'Pie'), 1), (('Cake', 'Cheese'), 1), (('Cake', 'Cookie'), 1)]
Sign up to request clarification or add additional context in comments.

Comments

0

Presuming you have comma separated values, you can use a frozenset of the pairings and use a Counter dict to get the counts:

from collections import Counter
import csv

with open("test.csv") as f:
    next(f)
    counts = Counter(frozenset(tuple(row[-1].split(",")))
                     for row in csv.reader(f))
    print(counts.most_common())

If you want all combinations or pairs as per your updated input:

from collections import Counter
from itertools import combinations

def combs(s):
    return  combinations(s.split(","), 2)

import csv
with open("test.csv") as f:
    next(f)
    counts = Counter(frozenset(t)
                     for row in csv.reader(f)
                            for t in combs(row[-1]))
    # counts -> Counter({frozenset(['Cheese', 'Cookie']): 2, frozenset(['Cheese', 'Pie']): 1, frozenset(['Cookie', 'Pie']): 1})
    print(counts.most_common())

The order of the pairings is irrelevant as frozenset([1,2]) and frozenset([2,1]) would be considered the same.

If you want to consider all combinations from 2-n:

def combs(s):
    indiv_items = s.split(",")
    return chain.from_iterable(combinations(indiv_items, i) for i in range(2, len(indiv_items) + 1))


import csv

with open("test.csv") as f:
    next(f)
    counts = Counter(frozenset(t)
                     for row in csv.reader(f)
                         for t in combs(row[-1]))
    print(counts)
    print(counts.most_common())

Which for:

Receipt,Name,Address,Date,Time,Items
25007,A,ABC,pte,ltd,4/7/2016,10:40,"Cheese,Cookie,Pie"
25008,B,CCC,pte,ltd,4/7/2016,12:40,"Cheese,Cookie"
25009,B,CCC,pte,ltd,4/7/2016,12:40,"Cookie,Cheese,pizza"
25010,B,CCC,pte,ltd,4/7/2016,12:40,"Pie,Cheese,pizza"

would give you:

Counter({frozenset(['Cheese', 'Cookie']): 3, frozenset(['Cheese', 'pizza']): 2, frozenset(['Cheese', 'Pie']): 2, frozenset(['Cookie', 'Pie']): 1, frozenset(['Cheese', 'Cookie', 'Pie']): 1, frozenset(['Cookie', 'pizza']): 1, frozenset(['Pie', 'pizza']): 1, frozenset(['Cheese', 'Cookie', 'pizza']): 1, frozenset(['Cheese', 'Pie', 'pizza']): 1})
[(frozenset(['Cheese', 'Cookie']), 3), (frozenset(['Cheese', 'pizza']), 2), (frozenset(['Cheese', 'Pie']), 2), (frozenset(['Cookie', 'Pie']), 1), (frozenset(['Cheese', 'Cookie', 'Pie']), 1), (frozenset(['Cookie', 'pizza']), 1), (frozenset(['Pie', 'pizza']), 1), (frozenset(['Cheese', 'Cookie', 'pizza']), 1), (frozenset(['Cheese', 'Pie', 'pizza']), 1)]

2 Comments

apparently it only work if there are only 2 items but if there are 3 items where 2 of it are the same, it doesnt count inside the pattern.
@DarrylDan, of course not but you only only have pairs in your sample input so the answer is based on that fact

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.