Remove duplicates while merging lists of lists into single list in python

Question

So, I have lists of lists like following:

data = [
['foo', 'bar'],
['one', 'two']
]

And, I want to flatten these lists by alternating between two lists. So, output like

flattened = ['foo', 'one', 'bar', 'two']

I am using the list(chain.from_iterable(zip_longest(*data))) which works fine.

But, I am trying to figure out how to handle scenarios where there are duplicates that I want to get rid of.

data = [
['foo', 'bar'],
['foo', 'two']
]

I want something like

flatted = ['foo', 'two', 'bar']

rather than ['foo', 'foo', 'bar', 'two']

How do I do this?

@Marat, well if they want to alternate between lists, then inherently yes. — Brian
– Brian, Commented Sep 23, 2019 at 19:21
I don't have time at the moment but you can implement an ordereddict from your output list and then convert back or dict in 3.7+ — MyNameIsCaleb
– MyNameIsCaleb, Commented Sep 23, 2019 at 19:33
The correct output is ['foo', 'bar', 'bar'] or ['foo', 'two', 'bar']? — Massifox
– Massifox, Commented Sep 23, 2019 at 19:46

Alexander · Accepted Answer · 2019-09-23 20:07:51Z

4

Use a set to keep track of what you've already seen, which is an O(1) membership test.

result = []
seen = set()
for item in chain.from_iterable(zip_longest(*data)):
    if item not in seen:
        seen.add(item)
        result.append(item)
>>> result
['foo', 'bar', 'two']

Note that this question talks about removing duplicates from a list: Removing duplicates in lists

TL;DR

For Python 3.7+ (or Cython 3.6+):

>>> list(dict.fromkeys(chain.from_iterable(zip_longest(*data))))
['foo', 'bar', 'two']

edited Sep 23, 2019 at 20:07

answered Sep 23, 2019 at 19:39

Alexander

111k32 gold badges212 silver badges208 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Brian Over a year ago

OP is requesting ['foo', 'two', 'bar'] , not ['foo', 'bar', 'two']

Alexander Over a year ago

"something like" ['foo', 'two', 'bar']

Brian Over a year ago

yes I thought that initially as well but technically only 1 of those lists satisfies the values "alternating between two lists" as OP specified earlier. It's ambiguous at the end of the day so who knows.

Brian · Accepted Answer · 2019-09-23 19:36:50Z

0

Hmm, this might be a bit more overhead than you're looking for but this should work and guarantees list-wise order unlike sets:

from itertools import cycle
from collections import Counter

output = []
checker = Counter()

for lst in cycle(data):
    if not data:
        break

    while lst:
        item = lst.pop(0)
        if not checker[item]:
            output.append(item)
            checker[item] += 1
            break            

    if not lst:
        data.remove(lst)
        continue

output:

['foo', 'two', 'bar']

answered Sep 23, 2019 at 19:36

Brian

1,5951 gold badge11 silver badges19 bronze badges

Comments

Massifox · Accepted Answer · 2019-09-23 22:21:00Z

0

Try this code:

list(dict.fromkeys(sum(data, [])))

EDIT: As pointed out in the comments, the sum is not the most efficient method to flatten a list, you can use itertools.chain.from_iterable to get the flattened list, then do the following:

list(dict.fromkeys(chain.from_iterable(data)))

In both cases the output is the following:

['foo', 'bar', 'two']

Comparison of execution times

Below they propose a comparison of the execution times of the main proposed solutions:

Benchmark data1:

 data1 = [['foo', 'bar'],['foo', 'two']] * 1000000

@Massifox's solution with `itertools.chain.from_iterable`

%%timeit
list(chain.from_iterable(data1))
# 128 ms ± 11.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@Marat's solution with `dedup`

%%timeit
list(dedup(chain.from_iterable(zip_longest(*data1))))
# 579 ms ± 116 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@Alexander's solution with `zip_longest`

%%timeit
list(dict.fromkeys(chain.from_iterable(zip_longest(*data1))))
# 456 ms ± 149 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Benchmark data2:

x = 10
y = 500000
n_max = 1000
data2 = [[np.random.randint(1, n_max) for _ in range(0, x)] for _ in range(0, y)]

@Massifox's solution with `itertools.chain.from_iterable`

%%timeit
list(chain.from_iterable(data2))
# 241 ms ± 20 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@Marat's solution with `dedup`

%%timeit
list(dedup(chain.from_iterable(zip_longest(*data2))))
# 706 ms ± 18.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@Alexander's solution with `zip_longest`

%%timeit
list(dict.fromkeys(chain.from_iterable(zip_longest(*data2))))
# 674 ms ± 56.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The implementations based on sum and on Counter are decisively less efficient and take tens of seconds already with instances of smaller benchmark of [['foo', 'bar'],['foo', 'two']] * 100k.

With benchmark data1, the solution based on itertools.chain.from_iterable proposed by me seems to be about 4-5 times faster than the others.

With benchmark data2, the solution based on itertools.chain.from_iterable proposed by me seems to be about 2-3 times faster than the others.

edited Sep 23, 2019 at 22:21

answered Sep 23, 2019 at 19:35

Massifox

4,5071 gold badge13 silver badges33 bronze badges

9 Comments

moltarze Over a year ago

This is pretty good but sum is a terribly slow way to flatten a list.

Brian Over a year ago

This produces "bar" before it produces "two", OP is requesting that the opposite happens.

Massifox Over a year ago

@BrianJoseph Probably the author of the post was wrong to expect that output, I'm asking for confirmation in the comments

Alexander Over a year ago

These timings tests are so wrong... You have four million data points, but only three of which are unique. Furthermore, your timed solution gives the wrong result.

Massifox Over a year ago

are correct on my machine on that benchmark instance. If you have another instance to offer, write it in the comments that the text and add it to the comparison. Thanks

|

juancarlos · Accepted Answer · 2019-09-23 19:33:37Z

-3

you can create a set and then convert to list again some like this:

l1 = ['foo', 'foo', 'bar', 'two']
l2 = list(set(l1))

it's create a second list without repeated items

if you want keept the order you can do this

ordered = dict.fromkeys(l1).keys()

edited Sep 23, 2019 at 19:33

answered Sep 23, 2019 at 19:25

juancarlos

6314 silver badges10 bronze badges

4 Comments

MyNameIsCaleb Over a year ago

Will this always maintain order?

Joshua Nixon Over a year ago

@MyNameIsCaleb Sets are unordered but you can probs find a OrderedSet datatype somewhere.

MyNameIsCaleb Over a year ago

Agree @JoshuaNixon which was my point. Dict does in 3.7+ based on insertion order.

juancarlos Over a year ago

so, you can use same example using ordereddict instance of dict

Collectives™ on Stack Overflow

Remove duplicates while merging lists of lists into single list in python

4 Answers 4

3 Comments

Comments

Comparison of execution times

@Massifox's solution with `itertools.chain.from_iterable`

@Marat's solution with `dedup`

@Alexander's solution with `zip_longest`

@Massifox's solution with `itertools.chain.from_iterable`

@Marat's solution with `dedup`

@Alexander's solution with `zip_longest`

9 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

Comments

Comparison of execution times

@Massifox's solution with itertools.chain.from_iterable

@Marat's solution with dedup

@Alexander's solution with zip_longest

@Massifox's solution with itertools.chain.from_iterable

@Marat's solution with dedup

@Alexander's solution with zip_longest

9 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related

@Massifox's solution with `itertools.chain.from_iterable`

@Marat's solution with `dedup`

@Alexander's solution with `zip_longest`

@Massifox's solution with `itertools.chain.from_iterable`

@Marat's solution with `dedup`

@Alexander's solution with `zip_longest`