How to benchmark host multi-threaded CPU performance in Java?

Question

I need to create a simple Java app that returns just one number: estimated CPU performance. For example when I run it on machine with 4 cores I will roughly get twice as big number than if run with 2 cores. This app should use 100% CPU for several seconds to measure that. I'm really not worried about accuracy.

I was really surprised that I couldn't find any Java library that already does that. Of course there are tools in other languages, but in my environment only Java is approved.

My current idea is to use classes from SciMark 2.0 in my code and run it from multiple threads, however this tool looks very messy (e.g. class names beginning with lowercase letters) and I need to write custom code to run these threads and combine the results.

Can I do any better to solve that problem?

CPU performance while doing what? It might matter what you are actually trying to measure. The normal way to do this is to measure the total time to complete a task. — markspace
– markspace, Commented Apr 10, 2019 at 15:00
If you're on linux, just read bogomips value from /proc/cpuinfo — rkosegi
– rkosegi, Commented Apr 10, 2019 at 15:28
@markspace I don't care. As I said accuracy doesn't matter at all for me, just rough numbers. Ideally I'm looking for ready solution with whatever assumptions. There will be various tasks to perform as these are Jenkins agens — Michal Kordas
– Michal Kordas, Commented Apr 10, 2019 at 15:35
@rkosegi I cannot use /proc/cpuinfo for that, this benchmark must run on demand (VM performance may change without restart) — Michal Kordas
– Michal Kordas, Commented Apr 10, 2019 at 15:38
Then I would just benchmark the task at hand, and record its performance. If that performance changes over time, you can investigate the change then. This is better because it measures the time for your actual task, not some arbitrary benchmark. — markspace
– markspace, Commented Apr 10, 2019 at 15:47

Michal Kordas · Accepted Answer · 2019-04-15 14:50:48Z

3

This is simplest piece of code that does what I wanted. It tries to estimate CPU performance for multiple threads by calculating sum of square roots for subsequent integers. Variable iterations could be adjusted to increase/decrease length of benchmark. On my machine with default values it takes about 7 seconds.

import static java.util.stream.IntStream.rangeClosed;

class Benchmark {
    public static void main(String[] args) {
        final int iterations = 100_000_000;
        long start = System.currentTimeMillis();
        rangeClosed(1, 50).parallel()
                .forEach(i -> rangeClosed(1, iterations).mapToDouble(Math::sqrt).sum());
        System.out.println(System.currentTimeMillis() - start);
    }
}

answered Apr 15, 2019 at 14:50

Michal Kordas

11k10 gold badges65 silver badges118 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

tsquared Over a year ago

Nice job providing a useful answer to your own question sans rhetoric

Peter Cordes Over a year ago

So you're measuring FP sqrt throughput. That's oddly specific and not highly correlated with most FP workloads. e.g. you'll see a very big speedup going from Haswell vs. Broadwell and Skylake (like factor of 2) at the same clock speed, while mul/add/FMA throughput hasn't changed. BDW introduced a higher radix divide/sqrt unit. agner.org/optimize instruction tables; look at sqrtsd throughput (8 to 14 vs. 4 to 8) assuming it doesn't vectorize with SIMD. If it does vectorize, Skylake would provide another big speedup for 128 and 256-bit wide vectors.

Michal Kordas Over a year ago

@PeterCordes I'm not using it to measure which virtual machine is better or faster. I need it just to do sanity check that my machine performs roughly the same as it did some time ago and still gets similar amount of CPU cycles from the bare metal.

Peter Cordes Over a year ago

Ok, yes, it should run the same way every time on the same hardware. (Or faster if a JVM ever figures out how to auto-vectorize, or even optimize away a sum that you don't assign anywhere.) Scaling with number of threads probably won't be helped by hyperthreading; a single hardware thread running this (if it JITs anywhere near efficiently) can probably saturate the sqrt unit.

Michal Kordas Over a year ago

@PeterCordes you are right, in that case looks like Docker container with hardcoded Java version would be a solution for the stability of the result

|

Stephen C · Accepted Answer · 2019-04-10 15:24:36Z

2

If I understand you correctly, your goal is to measure system performance rather than application performance.

Here's the problem. System performance cannot be reduced to a single meaningful number. In reality, system performance ... even CPU performance is multi-dimensional.

For example, an application that memory intensive will perform differently on different machines depending on the CPU chip's memory cache size and design ... and the memory speed. But if the application is compute intensive, then the performance will depend more on the clock rate and core count.

Then there are issues like the effects of NUMA cells and thread pinning when the core count is high and/or you have multiple CPU chips.

These and similar issues are why benchmarks that attempt to measure raw CPU performance independent of the application have largely fallen out of favor. (MIPS originally meant million (hardware) instructions per second. It is now often referred as mythical instructions per second ... alluding to the bogosity of the measure as a predictor of real application performance)

edited Apr 10, 2019 at 15:24

answered Apr 10, 2019 at 15:16

Stephen C

723k95 gold badges849 silver badges1.3k bronze badges

3 Comments

Michal Kordas Over a year ago

Fully agreed. As I highlighted I need to have very rough number. I don't care about details. I just need to detect that for some reason performance of this particular virtual machine has dropped over time (e.g. because physical server was over-allocated). And I care only about order of magnitude changes, e.g. this VM was able to calculate 1M digits of PI yesterday in 1 minute but today it took 10 minutes, so something is definitely wrong.

Stephen C Over a year ago

Well ... if you are looking for a random meaningless indicator, calculate the first D digits of Pi N times in N parallel threads. And measure clock time or cpu time using one of these: stackoverflow.com/a/7467299/139985

Stephen C Over a year ago

But if your goal is to measure the performance of a VM whose performance you suspect is dropping due to over-commit, then a benchmark that measures CPU performance of single or multiple threads is not enough. Why? Because you also need to consider RAM over-commit and I/O or device saturation. For a typical Java application, these things can have a much more severe impact on performance than simple CPU <-> VCPU over-commit.

Gonzalo Matheu · Accepted Answer · 2019-04-10 15:15:49Z

0

Java Mcrobenchmark Harness (JMH) is a toolkit to implement benchmarks of Java code.

It measures Throughput or Average Time; you could use that to estimate cpu cycles.

Basically, you need to annotate with @Benchmark the method you want to benchmark. This method

Thare are few JMH usage samples in their repository.

It is always recommended to let the computer alone while it runs the benchmarks, and you should close all other applications (if possible). If your computer is running other applications, these applications may take time from the CPU and give incorrect (lower) performance numbers.

If you want to dig further in CPU performance (cycles, cache usages, instructions, etc) you will probably need to use Linux perf

edited Apr 10, 2019 at 15:15

answered Apr 10, 2019 at 15:02

Gonzalo Matheu

10.2k6 gold badges42 silver badges67 bronze badges

3 Comments

Michal Kordas Over a year ago

I don't need to measure performance of the code. I'm looking for a Java library (or idea how to write such library) that will trigger some CPU-heavy tasks that exercise all available threads for configurable amount of time and as a result I will get number that roughly says something about current CPU performance of this VM.

Gonzalo Matheu Over a year ago

JMH's Blackhole class has consumeCPU method that just consumes CPUs avoiding JIT optimizations

Michal Kordas Over a year ago

OK, but this is still single threaded. I'm looking rather for consumeAllCpus(long tokens) method, otherwise it's perfect.

tsquared · Accepted Answer · 2020-01-24 23:04:15Z

Michal, thanks for your answer, I borrowed and added some threading to help me diagnose a virtual CPU performance issue on a client's AIX machine.

import static java.util.stream.IntStream.rangeClosed;

public class Main {

    public static void main(String[] args) {
        if (args.length < 2) {
            System.out.println("Usage: benchmark [million iterations] [maxThreads]");
            return;
        }

        final int MILLION = 1_000_000;
        final int iterations = Integer.parseInt(args[0]);
        final int maxThreads = Integer.parseInt(args[1]);

        for (int threads = 1; threads < maxThreads; threads++) {
            long start = System.currentTimeMillis();
            int count = iterations * MILLION / threads;
            rangeClosed(1, threads).parallel()
                .forEach(i -> rangeClosed(1, count).mapToDouble(Math::sqrt).sum());

            System.out.println(String.format("Benchmark of %d M iterations on %d thread(s): %d ms", iterations, threads, System.currentTimeMillis() - start));
        }

    }

}

Collectives™ on Stack Overflow

How to benchmark host multi-threaded CPU performance in Java?

4 Answers 4

6 Comments

3 Comments

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

6 Comments

3 Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related