finding pattern within csv file

Question

I have a CSV Excel file example:

Receipt Name    Address      Date       Time    Items
25007   A      ABC pte ltd   4/7/2016   10:40   Cheese, Cookie, Pie
.
.
25008   B      CCC pte ltd   4/7/2016   12:40   Cheese, Cookie

What is a simple way to compare the 'Items' column and find out the most common pattern of the items people buy together and display the top combinations? In this case the similar pattern is Cheese, Cookie.

I think you need a more complete example. What if someone else bought Cheese and Chocolate and another bought just Cheese? It is unclear what you are looking for... — dawg
– dawg, Commented Oct 9, 2016 at 18:19
Some questions: In Items, do you have comma separated products? You do not know all products? The most common pattern could be in any order? — Jose Raul Barreras
– Jose Raul Barreras, Commented Oct 9, 2016 at 18:31
@Darryl Dan, are you looking for just the pairs or what exactly is the criteria? — Padraic Cunningham
– Padraic Cunningham, Commented Oct 9, 2016 at 18:39

dawg · Accepted Answer · 2016-10-09 21:03:09Z

Suppose after processing the CSV file you find the list of items from the CSV file to be:

>>> items=['Cheese,Cookie,Pie', 'Cheese,Cookie,Pie', 'Cake,Cookie,Cheese', 
... 'Cheese,Mousetrap,Pie', 'Cheese,Jam','Cheese','Cookie,Cheese,Mousetrap']

First determine all possible pairs:

>>> from itertools import combinations
>>> all_pairs={frozenset(t) for e in items for t in combinations(e.split(','),2)}

Then you can do:

from collections import Counter
pair_counts=Counter()
for s in items:
    for pair in {frozenset(t) for t in combinations(s.split(','), 2)}:
        pair_counts.update({tuple(pair):1})

>>> pair_counts
Counter({('Cheese', 'Cookie'): 4, ('Cheese', 'Pie'): 3, ('Cookie', 'Pie'): 2, ('Cheese', 'Mousetrap'): 2, ('Cookie', 'Mousetrap'): 1, ('Cheese', 'Jam'): 1, ('Mousetrap', 'Pie'): 1, ('Cake', 'Cheese'): 1, ('Cake', 'Cookie'): 1})

Which can be extended to a more general case:

max_n=max(len(e.split(',')) for e in items)
for n in range(max_n, 1, -1):
    all_groups={frozenset(t) for e in items for t in combinations(e.split(','),n)}
    group_counts=Counter()
    for s in items:
        for group in {frozenset(t) for t in combinations(s.split(','), n)}:
            group_counts.update({tuple(group):1})      
    print 'group length: {}, most_common: {}'.format(n, group_counts.most_common())

Prints:

group length: 3, most_common: [(('Cheese', 'Cookie', 'Pie'), 2), (('Cheese', 'Mousetrap', 'Pie'), 1), (('Cheese', 'Cookie', 'Mousetrap'), 1), (('Cake', 'Cheese', 'Cookie'), 1)]
group length: 2, most_common: [(('Cheese', 'Cookie'), 4), (('Cheese', 'Pie'), 3), (('Cookie', 'Pie'), 2), (('Cheese', 'Mousetrap'), 2), (('Cookie', 'Mousetrap'), 1), (('Cheese', 'Jam'), 1), (('Mousetrap', 'Pie'), 1), (('Cake', 'Cheese'), 1), (('Cake', 'Cookie'), 1)]

Padraic Cunningham · Accepted Answer · 2016-10-09 18:50:01Z

Presuming you have comma separated values, you can use a frozenset of the pairings and use a Counter dict to get the counts:

from collections import Counter
import csv

with open("test.csv") as f:
    next(f)
    counts = Counter(frozenset(tuple(row[-1].split(",")))
                     for row in csv.reader(f))
    print(counts.most_common())

If you want all combinations or pairs as per your updated input:

from collections import Counter
from itertools import combinations

def combs(s):
    return  combinations(s.split(","), 2)

import csv
with open("test.csv") as f:
    next(f)
    counts = Counter(frozenset(t)
                     for row in csv.reader(f)
                            for t in combs(row[-1]))
    # counts -> Counter({frozenset(['Cheese', 'Cookie']): 2, frozenset(['Cheese', 'Pie']): 1, frozenset(['Cookie', 'Pie']): 1})
    print(counts.most_common())

The order of the pairings is irrelevant as frozenset([1,2]) and frozenset([2,1]) would be considered the same.

If you want to consider all combinations from 2-n:

def combs(s):
    indiv_items = s.split(",")
    return chain.from_iterable(combinations(indiv_items, i) for i in range(2, len(indiv_items) + 1))


import csv

with open("test.csv") as f:
    next(f)
    counts = Counter(frozenset(t)
                     for row in csv.reader(f)
                         for t in combs(row[-1]))
    print(counts)
    print(counts.most_common())

Which for:

Receipt,Name,Address,Date,Time,Items
25007,A,ABC,pte,ltd,4/7/2016,10:40,"Cheese,Cookie,Pie"
25008,B,CCC,pte,ltd,4/7/2016,12:40,"Cheese,Cookie"
25009,B,CCC,pte,ltd,4/7/2016,12:40,"Cookie,Cheese,pizza"
25010,B,CCC,pte,ltd,4/7/2016,12:40,"Pie,Cheese,pizza"

would give you:

Counter({frozenset(['Cheese', 'Cookie']): 3, frozenset(['Cheese', 'pizza']): 2, frozenset(['Cheese', 'Pie']): 2, frozenset(['Cookie', 'Pie']): 1, frozenset(['Cheese', 'Cookie', 'Pie']): 1, frozenset(['Cookie', 'pizza']): 1, frozenset(['Pie', 'pizza']): 1, frozenset(['Cheese', 'Cookie', 'pizza']): 1, frozenset(['Cheese', 'Pie', 'pizza']): 1})
[(frozenset(['Cheese', 'Cookie']), 3), (frozenset(['Cheese', 'pizza']), 2), (frozenset(['Cheese', 'Pie']), 2), (frozenset(['Cookie', 'Pie']), 1), (frozenset(['Cheese', 'Cookie', 'Pie']), 1), (frozenset(['Cookie', 'pizza']), 1), (frozenset(['Pie', 'pizza']), 1), (frozenset(['Cheese', 'Cookie', 'pizza']), 1), (frozenset(['Cheese', 'Pie', 'pizza']), 1)]

apparently it only work if there are only 2 items but if there are 3 items where 2 of it are the same, it doesnt count inside the pattern.
@DarrylDan, of course not but you only only have pairs in your sample input so the answer is based on that fact

Collectives™ on Stack Overflow

finding pattern within csv file

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related