Scala: Remove duplicates in list of objects

Question

I've got a list of objects List[Object] which are all instantiated from the same class. This class has a field which must be unique Object.property. What is the cleanest way to iterate the list of objects and remove all objects(but the first) with the same property?

What about using a Set instead of a List? Also, why are you dealing with Object, i.e. nearly the top of the class hierarchy? — Kevin Meredith
– Kevin Meredith, Commented Jan 19, 2018 at 13:58

samthebest · Accepted Answer · 2014-12-29 08:32:45Z

146

list.groupBy(_.property).map(_._2.head)

Explanation: The groupBy method accepts a function that converts an element to a key for grouping. _.property is just shorthand for elem: Object => elem.property (the compiler generates a unique name, something like x$1). So now we have a map Map[Property, List[Object]]. A Map[K,V] extends Traversable[(K,V)]. So it can be traversed like a list, but elements are a tuple. This is similar to Java's Map#entrySet(). The map method creates a new collection by iterating each element and applying a function to it. In this case the function is _._2.head which is shorthand for elem: (Property, List[Object]) => elem._2.head. _2 is just a method of Tuple that returns the second element. The second element is List[Object] and head returns the first element

To get the result to be a type you want:

import collection.breakOut
val l2: List[Object] = list.groupBy(_.property).map(_._2.head)(breakOut)

To explain briefly, map actually expects two arguments, a function and an object that is used to construct the result. In the first code snippet you don't see the second value because it is marked as implicit and so provided by the compiler from a list of predefined values in scope. The result is usually obtained from the mapped container. This is usually a good thing. map on List will return List, map on Array will return Array etc. In this case however, we want to express the container we want as result. This is where the breakOut method is used. It constructs a builder (the thing that builds results) by only looking at the desired result type. It is a generic method and the compiler infers its generic types because we explicitly typed l2 to be List[Object] or, to preserve order (assuming Object#property is of type Property):

list.foldRight((List[Object](), Set[Property]())) {
  case (o, cum@(objects, props)) => 
    if (props(o.property)) cum else (o :: objects, props + o.property))
}._1

foldRight is a method that accepts an initial result and a function that accepts an element and returns an updated result. The method iterates each element, updating the result according to applying the function to each element and returning the final result. We go from right to left (rather than left to right with foldLeft) because we are prepending to objects - this is O(1), but appending is O(N). Also observe the good styling here, we are using a pattern match to extract the elements.

In this case, the initial result is a pair (tuple) of an empty list and a set. The list is the result we're interested in and the set is used to keep track of what properties we already encountered. In each iteration we check if the set props already contains the property (in Scala, obj(x) is translated to obj.apply(x). In Set, the method apply is def apply(a: A): Boolean. That is, accepts an element and returns true / false if it exists or not). If the property exists (already encountered), the result is returned as-is. Otherwise the result is updated to contain the object (o :: objects) and the property is recorded (props + o.property)

Update: @andreypopp wanted a generic method:

import scala.collection.IterableLike
import scala.collection.generic.CanBuildFrom

class RichCollection[A, Repr](xs: IterableLike[A, Repr]){
  def distinctBy[B, That](f: A => B)(implicit cbf: CanBuildFrom[Repr, A, That]) = {
    val builder = cbf(xs.repr)
    val i = xs.iterator
    var set = Set[B]()
    while (i.hasNext) {
      val o = i.next
      val b = f(o)
      if (!set(b)) {
        set += b
        builder += o
      }
    }
    builder.result
  }
}

implicit def toRich[A, Repr](xs: IterableLike[A, Repr]) = new RichCollection(xs)

to use:

scala> list.distinctBy(_.property)
res7: List[Obj] = List(Obj(1), Obj(2), Obj(3))

Also note that this is pretty efficient as we are using a builder. If you have really large lists, you may want to use a mutable HashSet instead of a regular set and benchmark the performance.

edited Dec 29, 2014 at 8:32

samthebest

31.7k25 gold badges106 silver badges153 bronze badges

answered Oct 12, 2010 at 8:37

IttayD

29.3k28 gold badges129 silver badges180 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Sudhir Jonathan Over a year ago

Would be awesome if you can provide a quick explanation. I think Scala is sufficiently new that not everyone will understand this immediately.

Sudhir Jonathan Over a year ago

Specifically, what does _2 do in this context?

Landei Over a year ago

@Sudhir: _1 and _2 are methods that return the first and second element of a tuple.

andreypopp Over a year ago

Maybe scala collection needs distinct(A => B), that do distinct by key?

missingfaktor Over a year ago

+1, This method - distinctBy - should be added to the stdlib, methinks.

|

Xavier Guihot · Accepted Answer · 2018-10-02 00:18:09Z

39

Starting Scala 2.13, most collections are now provided with a distinctBy method which returns all elements of the sequence ignoring the duplicates after applying a given transforming function:

list.distinctBy(_.property)

For instance:

List(("a", 2), ("b", 2), ("a", 5)).distinctBy(_._1) // List((a,2), (b,2))
List(("a", 2.7), ("b", 2.1), ("a", 5.4)).distinctBy(_._2.floor) // List((a,2.7), (a,5.4))

answered Oct 2, 2018 at 0:18

Xavier Guihot

62.8k26 gold badges320 silver badges202 bronze badges

1 Comment

Stanislav Sobolev Over a year ago

The answer to every one

Garrett Hall · Accepted Answer · 2013-11-21 19:11:20Z

14

Here is a little bit sneaky but fast solution that preserves order:

list.filterNot{ var set = Set[Property]()
    obj => val b = set(obj.property); set += obj.property; b}

Although it uses internally a var, I think it is easier to understand and to read than the foldLeft-solution.

edited Nov 21, 2013 at 19:11

Garrett Hall

30.1k10 gold badges64 silver badges77 bronze badges

answered Oct 12, 2010 at 9:00

Landei

54.6k13 gold badges105 silver badges195 bronze badges

2 Comments

parsa Over a year ago

I'm clearly missing something here. What is Property exactly?

Landei Over a year ago

@parsa28: Property is the type of obj.property

Abel Terefe · Accepted Answer · 2019-01-11 14:07:43Z

10

A lot of good answers above. However, distinctBy is already in Scala, but in a not-so-obvious place. Perhaps you can use it like

def distinctBy[A, B](xs: List[A])(f: A => B): List[A] =
  scala.reflect.internal.util.Collections.distinctBy(xs)(f)

edited Jan 11, 2019 at 14:07

answered Mar 20, 2018 at 13:10

Abel Terefe

1,50021 silver badges17 bronze badges

1 Comment

Pedro Correia Luis Over a year ago

I came here just to upvote and say that those functions being in the reflect package makes 0 to no sense.

Timothy Klim · Accepted Answer · 2015-12-24 18:20:01Z

7

With preserve order:

def distinctBy[L, E](list: List[L])(f: L => E): List[L] =
  list.foldLeft((Vector.empty[L], Set.empty[E])) {
    case ((acc, set), item) =>
      val key = f(item)
      if (set.contains(key)) (acc, set)
      else (acc :+ item, set + key)
  }._1.toList

distinctBy(list)(_.property)

answered Dec 24, 2015 at 18:20

Timothy Klim

1,27716 silver badges25 bronze badges

1 Comment

Dushan Gajik Over a year ago

You can use Seq[L] for a more generic solution.

samthebest · Accepted Answer · 2014-12-29 08:36:08Z

6

One more solution

@tailrec
def collectUnique(l: List[Object], s: Set[Property], u: List[Object]): List[Object] = l match {
  case Nil => u.reverse
  case (h :: t) => 
    if (s(h.property)) collectUnique(t, s, u) else collectUnique(t, s + h.prop, h :: u)
}

edited Dec 29, 2014 at 8:36

samthebest

31.7k25 gold badges106 silver badges153 bronze badges

answered Oct 12, 2010 at 9:54

walla

1,0938 silver badges9 bronze badges

Comments

samthebest · Accepted Answer · 2014-12-29 08:41:23Z

I found a way to make it work with groupBy, with one intermediary step:

def distinctBy[T, P, From[X] <: TraversableLike[X, From[X]]](collection: From[T])(property: T => P): From[T] = {
  val uniqueValues: Set[T] = collection.groupBy(property).map(_._2.head)(breakOut)
  collection.filter(uniqueValues)
}

Use it like this:

scala> distinctBy(List(redVolvo, bluePrius, redLeon))(_.color)
res0: List[Car] = List(redVolvo, bluePrius)

Similar to IttayD's first solution, but it filters the original collection based on the set of unique values. If my expectations are correct, this does three traversals: one for groupBy, one for map and one for filter. It maintains the ordering of the original collection, but does not necessarily take the first value for each property. For example, it could have returned List(bluePrius, redLeon) instead.

Of course, IttayD's solution is still faster since it does only one traversal.

My solution also has the disadvantage that, if the collection has Cars that are actually the same, both will be in the output list. This could be fixed by removing filter and returning uniqueValues directly, with type From[T]. However, it seems like CanBuildFrom[Map[P, From[T]], T, From[T]] does not exist... suggestions are welcome!

F. P. Freely · Accepted Answer · 2020-01-26 17:59:36Z

0

With a collection and a function from a record to a key this yields a list of records distinct by key. It's not clear whether groupBy will preserve the order in the original collection. It may even depend on the type of collection. I'm guessing either head or last will consistently yield the earliest element.

collection.groupBy(keyFunction).values.map(_.head)

When will Scala get a nubBy? It's been in Haskell for decades.

answered Jan 26, 2020 at 17:59

F. P. Freely

1,16616 silver badges27 bronze badges

Comments

swdev · Accepted Answer · 2020-04-02 02:28:18Z

0

If you want to remove duplicates and preserve the order of the list you can try this two liner:

val tmpUniqueList = scala.collection.mutable.Set[String]()
val myUniqueObjects = for(o <- myObjects if tmpUniqueList.add(o.property)) yield o

answered Apr 2, 2020 at 2:28

swdev

3,1212 gold badges27 silver badges39 bronze badges

Comments

AdamAbrahams · Accepted Answer · 2022-09-30 14:06:10Z

0

this is entirely a rip of @IttayD 's answer, but unfortunately I don't have enough reputation to comment. Rather than creating an implicit function to convert your iteratble, you can simply create an implicit class:

import scala.collection.IterableLike
import scala.collection.generic.CanBuildFrom

implicit class RichCollection[A, Repr](xs: IterableLike[A, Repr]){
  def distinctBy[B, That](f: A => B)(implicit cbf: CanBuildFrom[Repr, A, That]) = {
    val builder = cbf(xs.repr)
    val i = xs.iterator
    var set = Set[B]()
    while (i.hasNext) {
      val o = i.next
      val b = f(o)
      if (!set(b)) {
        set += b
        builder += o
      }
    }
    builder.result
  }
}

answered Sep 30, 2022 at 14:06

AdamAbrahams

312 bronze badges

Collectives™ on Stack Overflow

Scala: Remove duplicates in list of objects

10 Answers 10

10 Comments

1 Comment

2 Comments

1 Comment

1 Comment

Comments

Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

10 Answers 10

10 Comments

1 Comment

2 Comments

1 Comment

1 Comment

Comments

Comments

Comments

Comments

Comments

Linked

Related