1

This should be simple, I thought, but I can't get it to work. I have a case class with a property name. I want to check to see if their are duplicate names in it...

employees.groupBy(_.name).mapValues(_.size).filter(_._2 == 1).toSeq.isEmpty

That doesn't work... Should it?

0

3 Answers 3

4

The other answers will work, but groupBy is an expensive operation and inefficient if you have 10k employees and the first two have the same name.

1) Short method: (distinct internally builds a HashSet of all the values)

val names = employees.map(_.name)
names.distinct != names

2) Efficient method: (only traverses as much of the seq as it needs to before finding dupe)

def hasDupes(employees: Seq[Employee]): Boolean = {
  val names = collection.mutable.Set.empty[String]
  employees.foreach { e => 
    if (names(e.name)) return true
    names += e.name
  }
  false
}

You can write this as a one-liner using a tail-recursive method, but the fact is that immutable sets are much slower than mutable ones so not much good if we're going for efficiency.

3) Off-the-wall method: (doesn't build a set at all, but starts a quicksort and quits (via a cheap exception) as soon as it finds two names the same. Should be efficient. )

class Dupe extends Throwable with util.control.NoStackTrace
val dupeOrd = new Ordering[Employee] {
  def compare(x: Employee, y: Employee) =
    if (x.name == y.name) throw new Dupe else x.name compare y.name
}
def hasDupes(employees: Seq[Employee]) =
  try { employees.sorted(dupeOrd); false } catch { case e: Dupe => true }

Benchmark of results, with method 0 being om-nom-nom's groupBy one, (time in ms for 1000 runs on 1000 employee Vector, no duplicates):

method             time  Early return if duplicate found
0  (groupBy)       1560  No
1  (distinct)      329   No
2a (mutable.Set)   255   Yes
2b (immutable.Set) 1414  Yes
3  (sorting)       666   Yes        //242 if employees already in name order
Sign up to request clarification or add additional context in comments.

Comments

2

Perhaps you wanted:

case class Employee(name: String)
val bob = Employee("Bob")
val joe = Employee("Joe")

def haveDupes(emps: Seq[Employee]) =  emps.groupBy(_.name).exists {
 case (name, group) => group.size > 1
}

scala> haveDupes(Seq(bob))
res11: Boolean = false

scala> haveDupes(Seq(bob, bob))
res12: Boolean = true

scala> haveDupes(Seq(bob, joe))
res13: Boolean = false

scala> haveDupes(Seq(bob, joe, bob))
res14: Boolean = true

1 Comment

haveDupes worked great and does exactly what I was trying to do. Thank you!
0

Because here

employees.groupBy(_.name).mapValues(_.size).filter(_._2 == 1).toSeq.isEmpty

You don't want a == but a > sign

employees.groupBy(_.name).mapValues(_.size).filter(_._2 > 1).toSeq.isEmpty

First you groupBy which will return a set of tuples (K, V), then you map each of those to see how many of each there are with mapValues(_.size) and then you want to check if there are any records with size greater than 1.

2 Comments

Thanks for the answer. I'm actually wanting to determine if there are duplicates, so to do that I filter out where the count is 1 and see if it's empty right?
Actually that's not how filter works, it will leave the elements that pass the condition you give to the filter function. So in this case it will return a list that passes (._2 > 1). It does not filter *out* (remove) the elements that pass the condition you give to the function so saying filter(._2 == 1) returns all elements that have ._2 equal to 1.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.