2

I try to solve one problem on codeforces. And I get Time limit exceeded judjment. The only time consuming operation is calculation sum of big array. So I've tried to optimize it, but with no result.

What I want: Optimize the next function:

//array could be Integer.MAX_VALUE length
private long canocicalSum(int[] array) { 
    int sum = 0;
    for (int i = 0; i < array.length; i++)
        sum += array[i];
    return sum;
}

Question1 [main]: Is it possible to optimize canonicalSum?

I've tried: to avoid operations with very big numbers. So i decided to use auxiliary data. For instance, I convert array1[100] to array2[10], where array2[i] = array1[i] + array1[i+1] + array1[i+9].

private long optimizedSum(int[] array, int step) {
    do {
        array = sumItr(array, step);
    } while (array.length != 1);
    return array[0];
}

private  int[] sumItr(int[] array, int step) {
    int length = array.length / step + 1;
    boolean needCompensation = (array.length % step == 0) ? false : true;
    int aux[] = new int[length];
    for (int i = 0, auxSum = 0, auxPointer = 0; i < array.length; i++) {
        auxSum += array[i];
        if ((i + 1) % step == 0) {
            aux[auxPointer++] = auxSum;
            auxSum = 0;
        }
        if (i == array.length - 1 && needCompensation) {
            aux[auxPointer++] = auxSum;
        }
    }
    return aux;
}

Problem: But it appears that canonicalSum is ten times faster than optimizedSum. Here my test:

@Test
public void sum_comparison() {
    final int ARRAY_SIZE = 100000000;
    final int STEP = 1000;
    int[] array = genRandomArray(ARRAY_SIZE);

    System.out.println("Start canonical Sum");
    long beg1 = System.nanoTime();
    long sum1 = canocicalSum(array);
    long end1 = System.nanoTime();
    long time1 = end1 - beg1;
    System.out.println("canon:" + TimeUnit.MILLISECONDS.convert(time1, TimeUnit.NANOSECONDS) + "milliseconds");

    System.out.println("Start optimizedSum");
    long beg2 = System.nanoTime();
    long sum2 = optimizedSum(array, STEP);
    long end2 = System.nanoTime();
    long time2 = end2 - beg2;
    System.out.println("custom:" + TimeUnit.MILLISECONDS.convert(time2, TimeUnit.NANOSECONDS) + "milliseconds");

    assertEquals(sum1, sum2);
    assertTrue(time2 <= time1);
}

private int[] genRandomArray(int size) {
    int[] array = new int[size];
    Random random = new Random();
    for (int i = 0; i < array.length; i++) {
        array[i] = random.nextInt();
    }
    return array;
}

Question2: Why optimizedSum works slower than canonicalSum?

17
  • 2
    I don't understand why you expect your "optimized" sum to perform better. It's way more complex, branches all over the place, contains heap allocations, and doesn't do any less work. Commented May 4, 2014 at 12:18
  • 4
    At some point it's going to add large numbers together too. And that's completely irrelevant as long as you stay within int ranges. Adding 0 and 1 is as expensive as adding 9999999 and 12329834. Commented May 4, 2014 at 12:22
  • 1
    It is possible, using parallel pipelines (I think they were implemented in Java too), using multithreading and unrolling the loop. Also switching to a while instead of a for. Commented May 4, 2014 at 12:25
  • 1
    A 32-bit int is always 32-bits no matter how many 0s or 1s it has. The time taken will be in there memory access not the summation. Add can 10+ faster than a memory access. Commented May 4, 2014 at 12:47
  • 2
    @VolodymyrBakhmatiuk (i & 1) == 0 is at least 10x faster. With loop unrolling it can often be eliminated. Commented May 4, 2014 at 13:03

3 Answers 3

5

As of Java 9, vectorisation of this operation has been implemented but disabled, based on benchmarks measuring the all-in cost of the code plus its compilation. Depending on your processor, this leads to the relatively entertaining result that if you introduce artificial complications into your reduction loop, you can trigger autovectorisation and get a quicker result! So the fastest code, for now, assuming numbers small enough not to overflow, is:

public int sum(int[] data) {
    int value = 0;
    for (int i = 0; i < data.length; ++i) {
        value += 2 * data[i];
    }
    return value / 2;
}

This isn't intended as a recommendation! This is more to illustrate that the speed of your code in Java is dependent on the JIT, its trade-offs, and its bugs/features in any given release. Writing cute code to optimise problems like this is at best vain and will put a shelf life on the code you write. For instance, had you manually unrolled a loop to optimise for an older version of Java, your code would be much slower in Java 8 or 9 because this decision would completely disable autovectorisation. You'd better really need that performance to do it.

Sign up to request clarification or add additional context in comments.

Comments

4

Question1 [main]: Is it possible to optimize canonicalSum?

Yes, it is. But I have no idea with what factor.

Some things you can do are:

  • use the parallel pipelines introduced in Java 8. The processor has instruction for doing parallel sum of 2 arrays (and more). This can be observed in Octave when you sum two vectors with ".+" (parallel addition) or "+" it is way faster than using a loop.

  • use multithreading. You could use a divide and conquer algorithm. Maybe like this:

    • divide the array into 2 or more
    • keep dividing recursively until you get an array with manageable size for a thread.
    • start computing the sum for the sub arrays (divided arrays) with separate threads.
    • finally add the sum generated (from all the threads) for all sub arrays together to produce final result
  • maybe unrolling the loop would help a bit, too. By loop unrolling I mean reducing the steps the loop will have to make by doing more operations in the loop manually.

An example from http://en.wikipedia.org/wiki/Loop_unwinding :

for (int x = 0; x < 100; x++)
{
    delete(x);
}

becomes

for (int x = 0; x < 100; x+=5)
{
    delete(x);
    delete(x+1);
    delete(x+2);
    delete(x+3);
    delete(x+4);
}

but as mentioned this must be done with caution and profiling since the JIT could do this kind of optimizations itself probably.

A implementation for mathematical operations for the multithreaded approach can be seen here.

The example implementation with the Fork/Join framework introduced in java 7 that basically does what the divide and conquer algorithm above does would be:

public class ForkJoinCalculator extends RecursiveTask<Double> {

   public static final long THRESHOLD = 1_000_000;

   private final SequentialCalculator sequentialCalculator;
   private final double[] numbers;
   private final int start;
   private final int end;

   public ForkJoinCalculator(double[] numbers, SequentialCalculator sequentialCalculator) {
     this(numbers, 0, numbers.length, sequentialCalculator);
   }

   private ForkJoinCalculator(double[] numbers, int start, int end, SequentialCalculator sequentialCalculator) {
     this.numbers = numbers;
     this.start = start;
     this.end = end;
     this.sequentialCalculator = sequentialCalculator;
   }

   @Override
   protected Double compute() {
     int length = end - start;
     if (length <= THRESHOLD) {
         return sequentialCalculator.computeSequentially(numbers, start, end);
     }
     ForkJoinCalculator leftTask = new ForkJoinCalculator(numbers, start, start + length/2, sequentialCalculator);
     leftTask.fork();
     ForkJoinCalculator rightTask = new ForkJoinCalculator(numbers, start + length/2, end, sequentialCalculator);
     Double rightResult = rightTask.compute();
     Double leftResult = leftTask.join();
     return leftResult + rightResult;
  }
}

Here we develop a RecursiveTask splitting an array of doubles until the length of a subarray doesn't go below a given threshold. At this point the subarray is processed sequentially applying on it the operation defined by the following interface

The interface used is this:

public interface SequentialCalculator {
  double computeSequentially(double[] numbers, int start, int end);
}

And the usage example:

public static double varianceForkJoin(double[] population){
   final ForkJoinPool forkJoinPool = new ForkJoinPool();
   double total = forkJoinPool.invoke(new ForkJoinCalculator(population, new SequentialCalculator() {
     @Override
     public double computeSequentially(double[] numbers, int start, int end) {
       double total = 0;
       for (int i = start; i < end; i++) {
         total += numbers[i];
       }
       return total;
     }
  }));
  final double average = total / population.length;
  double variance = forkJoinPool.invoke(new ForkJoinCalculator(population, new SequentialCalculator() {
    @Override
    public double computeSequentially(double[] numbers, int start, int end) {
      double variance = 0;
      for (int i = start; i < end; i++) {
        variance += (numbers[i] - average) * (numbers[i] - average);
      }
      return variance;
    }
 }));
 return variance / population.length;
}

4 Comments

what do you mean by unrolling a loop?
The divide and conquer is supported by the Fork-Join framework.
Note: the JIT can do loop unrolling already, so be careful that you don't harm performance when trying to do it manually.
Updated the answer to add an example of the multithreaded approach and explanation of the unrolling.
4

If you want to add N numbers then the runtime is O(N). So in this aspect your canonicalSum can not be "optimized".
What you can do to reduce runtime is make the summation parallel. I.e. break the array to parts and pass it to separate threads and in the end sum the result returned by each thread.
Update: This implies multicore system but there is a java api to get the number of cores

7 Comments

Due to context switching and the inherent overhead of threads, given the simplicity of the loop, multithreading has a very good chance of making the code much slower. And if it just has 1 core, it's guaranteed to make it much slower.
@Cratylus due Doug Lea, multithreading get benefits only with 16+ cores
@JBNizet:I updated to clarify that this makes sense for multicore of course.Concerning the overhead of context switching, well one can always benchmark.What you say is fair but I prefer benchmarks and tests over theory
@VolodymyrBakhmatiuk:I always opt test and measure than theory. But that is my opinion
@Cratylus: I didn't downvote. I just wanted to point out that doing the sum with 2 threads won't magically make the code twice as fast, or even won't make the code faster at all. As you said, it has to be measured.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.