Find duplicate items in all lists of a list of lists and remove them

Question

I've read loads of examples but not quite finding what i'm looking for. Tried several ways of doing this but looking for the best.

So the idea is that given:

s1 = ['a','b','c']
s2 = ['a','potato','d']
s3 = ['a','b','h']
strings=[s1,s2,s3]

the results should be:

['c']
['potato','d']
['h']

because these items are unique across the whole list of lists.

Thank you for any suggestions :)

stackoverflow.com/questions/2213923/… -already answered pls check this link.... — Lakshmi Ram
– Lakshmi Ram, Commented Apr 2, 2020 at 11:30

Kasravnd · Accepted Answer · 2020-04-02 11:39:37Z

As a general approach you can keep a counter of all items and then keep those that have appeared only once.

In [21]: from collections import Counter 

In [23]: counts = Counter(s1 + s2 + s3)                                                                                                                                                                     

In [24]: [i for i in s1 if counts[i] == 1]                                                                                                                                                                  
Out[24]: ['c']

In [25]: [i for i in s2 if counts[i] == 1]                                                                                                                                                                  
Out[25]: ['potato', 'd']

In [26]: [i for i in s3 if counts[i] == 1]                                                                                                                                                                  
Out[26]: ['h']

And if you have a nested list you can do the following:

In [28]: s = [s1, s2, s3]                                                                                                                                                                                   

In [30]: from itertools import chain                                                                                                                                                                        

In [31]: counts = Counter(chain.from_iterable(s))                                                                                                                                                           

In [32]: [[i for i in lst if counts[i] == 1] for lst in s]                                                                                                                                                  
Out[32]: [['c'], ['potato', 'd'], ['h']]

What a beautiful and elegant solution. I'm going to replace my own function for removing duplicates with this. Thank you.

some_programmer · Accepted Answer · 2020-04-02 11:25:00Z

1

How about:

[i for i in s1 if i not in s2+s3] #gives ['c']
[j for j in s2 if j not in s1+s3] #gives ['potato', 'd']
[k for k in s3 if k not in s1+s2] #gives ['h']

If you want all of them in a list:

uniq = [[i for i in s1 if i not in s2+s3],
[j for j in s2 if j not in s1+s3],
[k for k in s3 if k not in s1+s2]]

#output
[['c'], ['potato', 'd'], ['h']]

answered Apr 2, 2020 at 11:25

some_programmer

3,6308 gold badges33 silver badges74 bronze badges

Comments

Mohamed Shabeer kp · Accepted Answer · 2020-04-02 11:36:42Z

1

To find out the unique elements across the 3 lists you can use the set Symmetric difference(^) operation along with union(|) operation since you have 3 lists.

>>> s1 = ['a','b','c']
>>> s2 = ['a','potato','d']
>>> s3 = ['a','b','h']

>>> (set(s1) | (set(s2)) ^ set(s3)

answered Apr 2, 2020 at 11:36

Mohamed Shabeer kp

8928 silver badges15 bronze badges

2 Comments

Alain T. Over a year ago

This doesn't work because symmetric_difference will return values that are present an odd number of times (e.g. 'a')

Mohamed Shabeer kp Over a year ago

if we use union along with the symmetric_difference its possible.

Alain T. · Accepted Answer · 2020-04-02 12:35:28Z

Counter (from collections) is the way to go for this:

from collections import Counter

s1 = ['a','b','c']
s2 = ['a','potato','d']
s3 = ['a','b','h']
strings=[s1,s2,s3]

counts  = Counter(s for sList in strings for s in sList)
uniques = [ [s for s in sList if counts[s]==1] for sList in strings ]

print(uniques) # [['c'], ['potato', 'd'], ['h']]

If you're not allowed to use an imported module, you could do it with the list's count() method but it would be much less efficient:

allStrings = [ s for sList in strings for s in sList ]
unique     = [[ s for s in sList if allStrings.count(s)==1] for sList in strings]

This can be made more efficient using a set to identify repeated values:

allStrings = ( s for sList in strings for s in sList )
seen       = set()
repeated   = set( s for s in allStrings if s in seen or seen.add(s))
unique     = [ [ s for s in sList if s not in repeated] for sList in strings ]

norok2 · Accepted Answer · 2020-04-02 15:20:28Z

Assuming that you want this to work for an arbitrary number of sequences, a direct (but likely not the most efficient -- probably the others object can be constructed from the last iteration) way to solve this would be:

def deep_unique_set(*seqs):
    for i, seq in enumerate(seqs):
        others = set(x for seq_ in (seqs[:i] + seqs[i + 1:]) for x in seq_)
        yield [x for x in seq if x not in others]

or the slightly faster but less memory efficient and otherwise identical:

def deep_unique_preset(*seqs):
    pile = list(x for seq in seqs for x in seq)
    k = 0
    for seq in seqs:
        num = len(seq)
        others = set(pile[:k] + pile[k + num:])
        yield [x for x in seq if x not in others]
        k += num

Testing it with the provided input:

s1 = ['a', 'b', 'c']
s2 = ['a', 'potato', 'd']
s3 = ['a', 'b', 'h']

print(list(deep_unique_set(s1, s2, s3)))
# [['c'], ['potato', 'd'], ['h']]
print(list(deep_unique_preset(s1, s2, s3)))
# [['c'], ['potato', 'd'], ['h']]

Note that if the input contain duplicates within one of the lists, they are not removed, i.e.:

s1 = ['a', 'b', 'c', 'c']
s2 = ['a', 'potato', 'd']
s3 = ['a', 'b', 'h']

print(list(deep_unique_set(s1, s2, s3)))
# [['c', 'c'], ['potato', 'd'], ['h']]
print(list(deep_unique_preset(s1, s2, s3)))
# [['c', 'c'], ['potato', 'd'], ['h']]

If all duplicates should be removed, a better approach is to count the values. The method of choice for this is by using collections.Counter, as proposed in @Kasramvd answer:

def deep_unique_counter(*seqs):
    counts = collections.Counter(itertools.chain.from_iterable(seqs))
    for seq in seqs:
        yield [x for x in seq if counts[x] == 1]

s1 = ['a', 'b', 'c', 'c']
s2 = ['a', 'potato', 'd']
s3 = ['a', 'b', 'h']
print(list(deep_unique_counter(s1, s2, s3)))
# [[], ['potato', 'd'], ['h']]

Alternatively, one could keep track of repeats, e.g.:

def deep_unique_repeat(*seqs):
    seen = set()
    repeated = set(x for seq in seqs for x in seq if x in seen or seen.add(x))
    for seq in seqs:
        yield [x for x in seq if x not in repeated]

which will have the same behavior as the collections.Counter-based approach:

s1 = ['a', 'b', 'c', 'c']
s2 = ['a', 'potato', 'd']
s3 = ['a', 'b', 'h']
print(list(deep_unique_repeat(s1, s2, s3)))
# [[], ['potato', 'd'], ['h']]

but is slightly faster, since it does not need to keep track of unused counts.

Another, highly inefficient, make use of list.count() for counting instead of a global counter:

def deep_unique_count(*seqs):
    pile = list(x for seq in seqs for x in seq)
    for seq in seqs:
        yield [x for x in seq if pile.count(x) == 1]

These last two approaches are also proposed in @AlainT. answer.

Some timings for these are provided below:

n = 100
m = 100
s = tuple([random.randint(0, 10 * n * m) for _ in range(n)] for _ in range(m))
for func in funcs:
    print(func.__name__)
    %timeit list(func(*s))
    print()

# deep_unique_set
# 10 loops, best of 3: 86.2 ms per loop

# deep_unique_preset
# 10 loops, best of 3: 57.3 ms per loop

# deep_unique_count
# 1 loop, best of 3: 1.76 s per loop

# deep_unique_repeat
# 1000 loops, best of 3: 1.87 ms per loop

# deep_unique_counter
# 100 loops, best of 3: 2.32 ms per loop

Collectives™ on Stack Overflow

Find duplicate items in all lists of a list of lists and remove them

5 Answers 5

1 Comment

Comments

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related