11

Can anybody optimize following statement in Scala:

// maybe large
val someArray = Array(9, 1, 6, 2, 1, 9, 4, 5, 1, 6, 5, 0, 6) 

// output a sorted list which contains unique element from the array without 0
val newList=(someArray filter (_>0)).toList.distinct.sort((e1, e2) => (e1 > e2))

Since the performance is critical, is there a better way?

Thank you.

7 Answers 7

22

This simple line is one of the fastest codes so far:

someArray.toList.filter (_ > 0).sortWith (_ > _).distinct

but the clear winner so far is - due to my measurement - Jed Wesley-Smith. Maybe if Rex' code is fixed, it looks different.

bench diagram

Typical disclaimer 1 + 2:

  1. I modified the codes to accept an Array and return an List.
  2. Typical benchmark considerations:
    • This was random data, equally distributed. For 1 Million elements, I created an Array of 1 Million ints between 0 and 1 Million. So with more or less zeros, and more or less duplicates, it might vary.
    • It might depend on the machine etc.. I used a single core CPU, Intel-Linux-32bit, jdk-1.6, scala 2.9.0.1

Here is the underlying benchcoat-code and the concrete code to produce the graph (gnuplot). Y-axis: time in seconds. X-axis: 100 000 to 1 000 000 elements in Array.

update:

After finding the problem with Rex' code, his code is as fast as Jed's code, but the last operation is a transformation of his Array to a List (to fullfill my benchmark-interface). Using a var result = List [Int], and result = someArray (i) :: result speeds his code up, so that it is about twice as fast as the Jed-Code.

Another, maybe interesting, finding is: If I rearrange my code in the order of filter/sort/distinct (fsd) => (dsf, dfs, fsd, ...), all 6 possibilities don't differ significantly.

Sign up to request clarification or add additional context in comments.

4 Comments

+1 - Benchmarks are great when answering performance questions!
Thank you for your work. That explains everything. Does it suggest that Scala is more willing to deal with functional programming style?
If you want to see a difference, you need to have a significant number of negative or zero values. If they are, say, half of the total numbers, then filtering or distincting first will be a win over sorting. (I think filter-first will be best, but I don't precisely recall the implementation of distinct, so I'm not positive.)
@RexKerr: You're right. If my generating function creates N values random.nextInt (N/10), but with about 15% of them being 0, the performance of ufs (unique-filter-sort) and usf is much better than suf/sfu. Oh, I have much too much diagrams now, I can not schow all of them here! :)
7

I haven't measured, but I'm with Duncan, sort in place then use something like:

util.Sorting.quickSort(array)
array.foldRight(List.empty[Int]){ 
  case (a, b) => 
    if (!b.isEmpty && b(0) == a) 
      b 
    else 
      a :: b 
}

In theory this should be pretty efficient.

1 Comment

Thank you, Jed. It seems the best version using Scala style.
4

Without benchmarking I can't be sure, but I imagine the following is pretty efficient:

val list = collection.SortedSet(someArray.filter(_>0) :_*).toList

Also try adding .par after someArray in your version. It's not guaranteed to be quicker, bit it might be. You should run a benchmark and experiment.

sort is deprecated. Use .sortWith(_ > _) instead.

3 Comments

+1 I like the conciseness, but you need to add a filter in there as well.
@LuigiPlinge How to do it in descendent ordering?
@TianyiLiang sortWith(_ < _)
3

Boxing primitives is going to give you a 10-30x performance penalty. Therefore if you really are performance limited, you're going to want to work off of raw primitive arrays:

def arrayDistinctInts(someArray: Array[Int]) = {    
  java.util.Arrays.sort(someArray)
  var overzero = 0
  var ndiff = 0
  var last = 0
  var i = 0
  while (i < someArray.length) {
    if (someArray(i)<=0) overzero = i+1
    else if (someArray(i)>last) {
      last = someArray(i)
      ndiff += 1
    }
    i += 1
  }
  val result = new Array[Int](ndiff)
  var j = 0
  i = overzero
  last = 0
  while (i < someArray.length) {
    if (someArray(i) > last) {
      result(j) = someArray(i)
      last = someArray(i)
      j += 1
    }
    i += 1
  }
  result
}

You can get slightly better than this if you're careful (and be warned, I typed this off the top of my head; I might have typoed something, but this is the style to use), but if you find the existing version too slow, this should be at least 5x faster and possibly a lot more.


Edit (in addition to fixing up the previous code so it actually works):

If you insist on ending with a list, then you can build the list as you go. You could do this recursively, but I don't think in this case it's any clearer than the iterative version, so:

def listDistinctInts(someArray: Array[Int]): List[Int] = {
  if (someArray.length == 0 || someArray(someArray.length-1) <= 0) List[Int]()
  else {
    java.util.Arrays.sort(someArray)
    var last = someArray(someArray.length-1)
    var list = last :: Nil
    var i = someArray.length-2
    while (i >= 0) {
      if (someArray(i) < last) {
        last = someArray(i)
        if (last <= 0) return list;
        list = last :: list
      }
      i -= 1
    }
    list
  }
}

Also, if you may not destroy the original array by sorting, you are by far best off if you duplicate the array and destroy the copy (array copies of primitives are really fast).

And keep in mind that there are special-case solutions that are far faster yet depending on the nature of the data. For example, if you know that you have a long array, but the numbers will be in a small range (e.g. -100 to 100), then you can use a bitset to track which ones you've encountered.

5 Comments

ArrayIndexOOB in second while: result(j) = someArray(i)
I guess I found 2 errors: if (someArray (i) <= 0) overzero = i has to end in ` += 1` in the 1st while, while in the second while a last = someArray (i) is missing. In my personal benchmark against Jed's solution with up to 6 million elements, you win in 9 of 10 cases, but use more memory, but to take part in my competition, I had to transform the result with toList to List in the end. Using a list from the beginning helps here much (5 vs. 10 seconds for 8M elements in the array).
@userunknown - Thanks! This is why I should never try to type more than a couple of lines of code without the REPL handy.
But you know SimplyScala and IDEONE! :)
@userunknown - True enough. I usually write things using my personal libraries and then strip them out for examples, but not in this case, so that would have worked.
2

For efficiency, depending on your value of large:

val a = someArray.toSet.filter(_>0).toArray
java.util.Arrays.sort(a) // quicksort, mutable data structures bad :-)
res15: Array[Int] = Array(1, 2, 4, 5, 6, 9)

Note that this does the sort using qsort on an unboxed array.

2 Comments

List#sortWith uses this very same method under the covers (check out SeqLike source)
I agree. I'm fairly sure the sort would be done on object references, not primitives like in my version though.
1

I'm not in a position to measure, but some more suggestions...

Sorting the array in place before converting to a list might well be more efficient, and you might look at removing dups from the sorted list manually, as they will be grouped together. The cost of removing 0's before or after the sort will also depend on their ratio to the other entries.

6 Comments

I can't sort before converting because the array will be used for another iteration in a big program unless it generates a new array.
@TianyiLiang In that case just use Array#copy to make a copy of the array first. It's a very fast operation.
@LuigiPlinge What about Array.clone? Does it have the same performance as Array copy. By the way, do you have any suggestion for problem 8173329.
@TianyiLiang I would be surprised if it didn't, but I don't know
One of the things that amazed me about video processing is just how many times you can copy multi- megabyte buffers before things slow down appreciably.
|
1

How about adding everything to a sorted set?

val a = scala.collection.immutable.SortedSet(someArray filter (0 !=): _*)

Of course, you should benchmark the code to check what is faster, and, more importantly, that this is truly a hot spot.

4 Comments

What does this mean: (0 !=): _*?
@nedim In Scala, : is almost always the separator between a value and a type (exceptions being :, <: and >: on type parameters), so someArray filter (0 !=) is a value, and _* is a type. The type _* is used to tell Scala to pass a sequence as multiple parameters on a method with variable number of arguments. So Set(Seq(1, 2, 3)) is a set with one element of type Seq[Int], and Set(Seq(1, 2, 3): _*) is a set with three elements of type Int. (0 !=) is the method != applied on the object 0.
@daniel-c-sobral Wow, I know that Scala likes to brag that it's concise, but this snippet is really something! I understand how _? works now (what is the name of this type/construct?). It looks similar to what I've seen in Python with parameter (un)packing, only that it's performed by a type cast. Cool, I missed that from Python. However, it still puzzles me how (0 !=) works without the underscore, i.e., (0 != _). I'm still learning Scala and am really excited by it.
@nedim It works pretty much as in Java 8. If you write x::m on Java 8, you are referring to the method x on the object m. In Scala, because it doesn't have static methods to mess everything, you only use .. If I write somevar.equals, I'm referring to the method equals on somevar. This is pretty much the same thing, except 0 is a literal instead of a variable, and != is a symbolic method name.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.