Memory barriers on entry and exit of Java synchronized block

Question

I came across answers, here on SO, about Java flushing the work copy of variables within a synchronized block during exit. Similarly it syncs all the variable from main memory once during the entry into the synchronized section.

However, I have some fundamental questions around this:

What if I access mostly non-volatile instance variables inside my synchronized section? Will the JVM automatically cache those variables into the CPU registers at the time of entering into the block and then carry all the necessary computations before finally flushing them back?
I have a synchronized block as below: The underscored variables _ e.g. _callStartsInLastSecondTracker are all instance variables which I heavily access in this critical section.

public CallCompletion startCall()
{
  long currentTime;
  Pending pending;
  synchronized (_lock)
  {
    currentTime = _clock.currentTimeMillis();
    _tracker.getStatsWithCurrentTime(currentTime);
    _callStartCountTotal++;
    _tracker._callStartCount++;
    if (_callStartsInLastSecondTracker != null)
      _callStartsInLastSecondTracker.addCall();
    _concurrency++;
    if (_concurrency > _tracker._concurrentMax) 
    { 
      _tracker._concurrentMax = _concurrency;
    }
    _lastStartTime = currentTime;
    _sumOfOutstandingStartTimes += currentTime;
    pending = checkForPending();
  }
  if (pending != null) 
  {
    pending.deliver();
  }
  return new CallCompletionImpl(currentTime);
}

Does this mean that all these operations e.g. +=, ++, > etc. requires the JVM to interact with main memory repeatedly? If so, can I use local variables to cache them (preferably stack allocation for primitives) and perform operations and in the end assign them back to the instance variables? Will that help to optimise performance of this block?

I have such blocks in other places as well. On running a JProfiler, it has been observed that most of the time threads are in WAITING state and throughput is also very low. Hence the optimisation necessity.

Appreciate any help here.

I'd assume that the JVM can and will optimize them into registers if you access them repeatedly inside the same synchronized block. i.e. the opening { and closing } are memory barriers (acquiring and releasing the lock), but within that block the normal rules apply. The normal rules for non-volatile vars are like in C++: compiler can keep private copies / temporaries and do full optimization. I don't really know Java so not posting this as an answer in case my assumptions are wrong. — Peter Cordes
– Peter Cordes, Commented May 29, 2020 at 6:41
As I said before, this is an issue of bigger design, not microlevel optimization. It's not about writing to main memory, it's about having a huge mutex with synchronized(lock). I think you're going to need some outside help. Based on the code it looks like your tracker is the bottleneck, but it's hard to know why without seeing the Tracker code (and whether there might be a way to avoid or lessen the bottleneck). — Kayaman
– Kayaman, Commented May 29, 2020 at 6:47
How are you benchmarking this? You have a set of concurrent threads calling the same method in a tight loop? Is this what happens in production? If so, then I would really reconsider the design and try to get rid of the lock. Probably a good idea to set up some micro-benchmarks using JMH and some macro-benchmarks to make sure are not optimizing something that doesn't matter in a production environment. — pveentjer
– pveentjer, Commented May 29, 2020 at 7:26
@deGee the code is not looking great. There seems to be a custom implementation of CopyOnWriteArrayList for example. There are also so many sync blocks on lock, that it's no surprise if the throughput is bad. Finally as a sidenote, the code style shows that it wasn't written by someone who writes primarily Java. It's also hard to see what exactly is the thread-unsafe part that's being guarded by sync blocks, or if it's just being used "for safety" in too many places. — Kayaman
– Kayaman, Commented May 29, 2020 at 7:38
Are you not introducing an artificial problem? So you have confirmed it is actually a problem in production? Many JDK classes are synchronized and will give awful performance when called concurrent in a tight loop. But in most cases it isn't an issue since the rate of calls is low and the number of concurrent threads calling as well. — pveentjer
– pveentjer, Commented May 29, 2020 at 7:39

Peter Cordes · Accepted Answer · 2020-05-29 07:07:58Z

(I don't know Java that well, just the underlying locking and memory-ordering concepts that Java is exposing. Some of this is based on assumptions about how Java works, so corrections welcome.)

I'd assume that the JVM can and will optimize them into registers if you access them repeatedly inside the same synchronized block.

i.e. the opening { and closing } are memory barriers (acquiring and releasing the lock), but within that block the normal rules apply.

The normal rules for non-volatile vars are like in C++: the JIT-compiler can keep private copies / temporaries and do full optimization. The closing } makes any assignments visible before marking the lock as released, so any other thread that runs the same synchronized block will see those changes.

But if you read/write those variables outside a synchronized(_lock) block while this synchronized block is executing, there's no ordering guarantee and only whatever atomicity guarantee Java has. Only volatile would force a JVM to re-read a variable on every access.

most of the time threads are in WAITING state and throughput is also very low. Hence the optimisation necessity.

The things you're worried about wouldn't really explain this. Inefficient code-gen inside the critical section would make it take somewhat longer, and that could lead to extra contention.

But there wouldn't be a big enough effect to make most threads be blocked waiting for locks (or I/O?) most of the time, compared to having most threads actively running most of the time.

@Kayaman's comment is most likely correct: this is a design issue, doing too much work inside one big mutex. I don't see loops inside your critical section, but presumably some of those methods you call contain loops or are otherwise expensive, and no other thread can enter this synchronized(_lock) block while one thread is in it.

The theoretical worst case slowdown for store/reload from memory (like compiling C in anti-optimized debug mode) vs. keeping a variable in a register would be for something like while (--shared_var >= 0) {}, giving maybe a 6x slowdown on current x86 hardware. (1 cycle latency for dec eax vs. that plus 5 cycle store-forwarding latency for a memory-destination dec). But that's only if you're looping on a shared var, or otherwise creating a dependency chain through repeated modification of it.

Note that a store buffer with store-forwarding still keeps it local to the CPU core without even having to commit to L1d cache.

In the much more likely case of code that just reads a var multiple times, anti-optimized code that really loads every time can have all those loads hit in L1d cache very efficiently. On x86 you'd probably barely notice the difference, with modern CPUs having 2/clock load throughput, and efficient handling of ALU instructions with memory source operands, like cmp eax, [rdi] being basically as efficient as cmp eax, edx.

(CPUs have coherent caches so there's no need for flushing or going all the way to DRAM to ensure you "see" data from other cores; a JVM or C compiler only has to make sure the load or store actually happens in asm, not optimized into a register. Registers are thread-private.)

But as I said, there's no reason to expect that your JVM is doing this anti-optimization inside synchronized blocks. But even if it were, it might make a 25% slowdown.

Thanks. I have shared the code here - codeshare.io/adQBBK. Not updating it in the question itself since the link is ephemeral. Please see my reply to Kayaman's comment above for a few more details.
@deGee: I was only really interested in answering the compiler-optimization / how-it-compiles part of the question to rule out that guess, not redesign whatever your code does that holds onto the lock for too much of the time. That's basically a separate question. Good luck with optimizing / redesigning your code.

pveentjer · Accepted Answer · 2020-05-29 06:40:09Z

You are accessing members on a single object. So when the CPU reads the _lock member, it needs to load the cache line containing _lock member first. So probably quite a few of the member variables will be on the same cache line which is already in your cache.

I would be more worried about the synchronized block itself IF you have determined it is actually a problem; it might not be a problem at all. For example Java uses quite a few lock optimization techniques like biased locking, adaptive spin lock to reduce the costs of locks.

But if it is a contended lock, you might want to make the duration of the lock shorter by moving as much out of the lock as possible and perhaps even get rid of the whole lock and switch to a lock free approach.

I would not trust JPofiler for a second. http://psy-lob-saw.blogspot.com/2016/02/why-most-sampling-java-profilers-are.html So it might be that JProfiler is putting you in the wrong direction.

Collectives™ on Stack Overflow

Memory barriers on entry and exit of Java synchronized block

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related