How CPUs Use the LOCK Prefix to Implement Cache Locking and ensure memory consistency

Question

In Java, adding the volatile keyword to a variable guarantees memory consistency (or visibility).

On the x86 platform, the Hotspot virtual machine implements volatile variable memory consistency by adding a Lock prefix instruction.

Like this：lock addl $0x0,(%esp);

Intel® 64 and IA-32 Architectures Software Developer’s Manual：

For the Intel486 and Pentium processors, the LOCK# signal is always asserted on the bus during a LOCK operation, even if the area of memory being locked is cached in the processor.

For the P6 and more recent processor families, if the area of memory being locked during a LOCK operation is cached in the processor that is performing the LOCK operation as write-back memory and is completely contained in a cache line, the processor may not assert the LOCK# signal on the bus. Instead, it will modify the memory location internally and allow it’s cache coherency mechanism to ensure that the operation is carried out atomically. This operation is called “cache locking. “ The cache coherency mechanism automatically prevents two or more processors that have cached the same area of memory from simultaneously modifying data in that area.

I think "if the area of memory being locked during a LOCK operation is cached in the processor" means cache line state is S、E or M（may be E or M?)。

I have read a lot of information, and they say that cache locking is implemented through the MESI protocol.

I have read the MESI protocol, but I still have many doubts that I have not been able to solve.

First, can MESI implement cache locking? Second, how does the lock instruction achieve memory consistency?

// volatile write
if (cache->is_volatile()) {
    //
    if (tos_type == itos) {
        obj->release_int_field_put(field_offset, STACK_INT(-1));
    } else if (tos_type == atos) {
        VERIFY_OOP(STACK_OBJECT(-1));
        obj->release_obj_field_put(field_offset, STACK_OBJECT(-1));
    } else if (tos_type == btos) {
        obj->release_byte_field_put(field_offset, STACK_INT(-1));
    } else if (tos_type == ztos) {
        int bool_field = STACK_INT(-1);  // only store LSB
        obj->release_byte_field_put(field_offset, (bool_field & 1));
    } else if (tos_type == ltos) {
        obj->release_long_field_put(field_offset, STACK_LONG(-1));
    } else if (tos_type == ctos) {
        obj->release_char_field_put(field_offset, STACK_INT(-1));
    } else if (tos_type == stos) {
        obj->release_short_field_put(field_offset, STACK_INT(-1));
    } else if (tos_type == ftos) {
        obj->release_float_field_put(field_offset, STACK_FLOAT(-1));
    } else {
        obj->release_double_field_put(field_offset, STACK_DOUBLE(-1));
    }
    // after volatile write，insert storeload memory barrier
    OrderAccess::storeload();
} 
================================
inline void OrderAccess::storeload()  { 
    fence(); 
}
================================
inline void OrderAccess::fence() {
  if (os::is_MP()) {
    // always use locked addl since mfence is sometimes expensive
#ifdef AMD64
    __asm__ volatile ("lock; addl $0,0(%%rsp)" : : : "cc", "memory");
#else
    __asm__ volatile ("lock; addl $0,0(%%esp)" : : : "cc", "memory");
#endif
  }
}

(One thing that confuses me is that this Lock prefix is executed after the volatile write, and it's all already after the volatile write, so what can be locked?)

I Think MESI cannot guarantee memory consistency after introducing Store Buffer and Invalidate queue.

I think "release_int_field_put" means write value into store buffer?

so,What does "asm volatile ("lock; addl $0,0(%%rsp)" : : : "cc", "memory");" do to ensure memory consistency?

Suppose there is a volatile variable that has been loaded into the cache by 3 CPUs，CPU1，CPU2，CPU3，now their Cache Line state are all S。

According to the MESI protocol, CPU3 starts to write the volatile variable. Because the status of CPU3's Cache Line is S, so while writing the modification to the Store buffer, it also sends an Invalidate message to the address bus to invalidate the Cache Lines of other CPUs**（RFO）**. After receiving the Invalidate message, CPU1 and CPU2 send Invalidate ACK, and then change the status of their Cache Line to I. Under normal circumstances, after CPU3 receives ACKs from all other CPUs, it will change its Cache Line status to E, then flush the Store Buffer into the Cache Line, and finally change the Cache Line status to M. At this time, if CPU1 initiates a local read, because its Cache Line status is invalid, it initiates a read request to the bus, and the bus broadcasts the read request. After CPU3 receives the read request, it finds that its Cache Line status is M. According to the MESI protocol, It needs to be written back to the memory first, and then CPU1 reads from the memory. This ensures that CPU1 reads the latest data.

Will the following described situation happen? If CPU1 has already sent ACK, but CPU2 is busy and has not sent ACK, CPU3 will not flush the Store Buffer to the Cache Line until it receives ACKs from all other processors. Therefore, the Store Buffer contains the new value at this time, but the Cache Line state of CPU3 is still S, and the old value is still inside. Similarly, the memory still holds the old value. However, at this time, CPU1 starts a local read, finds that its Cache Line state is I, and therefore initiates a read request to the bus. Since the Cache Line of CPU3 is still S at this time, the old value is finally obtained from the memory. This is why MESI cannot guarantee memory consistency after adding Store Buffer.

The above process is shown below：

CPU1 read A CPU2 read A CPU3 read A CPU 3 modifies the contents of A in the register and write it to the Store Buffer.

There must be a mechanism that guarantees that when CPU3 writes to a Cache Line, other processors cannot read or write to this Cache Line, so that CPU can complete a complete write, write the value to the Cache, and then change the state of the Cache Line to M. So who provides this mechanism? MESI protocol? From my understanding, MESI doesn't seem to provide this mechanism. Lock prefix instruction? But from Intel's documentation, if it has been cached, it will not lock the bus, but the cache, and this cache lock, according to some people, is implemented by MESI. Therefore, I am confused.

I know that disabling reordering requires a memory barrier, but I'm wondering if for cache coherency, isn't the lock pair effectively a null operation when there's no need to send the lock# signal to the bus? Is the lock implemented through bus arbitration?

I'm even starting to suspect that the volatile guarantee of reading the latest value every time is wrong, and that the premise of being able to always get to the latest value every time is that every volatile write is written to memory (or in the Cache Line in the state of M) as an atomic operation, but this can't be done as I understand it now. As I understand it, it only guarantees that each read must be from main memory.

Here's what I think a cache lock looks like, rather than relying solely on MESI: When implementing a cache-based locking mechanism, the lock control unit usually consists of a cache controller and a lock status word

When a CPU needs to access a shared resource, it sends a request to the cache and sets a lock flag on the lock status word.
When a CPU needs to access a shared resource, it sends a request to the cache and sets a lock flag on the lock status word. The cache controller checks if another CPU has acquired the lock and prevents the other CPUs from accessing the shared resource until the current CPU releases the lock.

A terminology mistake in the title: lock is not an instruction or "prefix instruction", it is instruction prefix (It doesn't make sense alone without being followed by one of the instructions that support interlocked operation) — Alex Guteniev
– Alex Guteniev, Commented Mar 15, 2024 at 13:31
lock add $0, (%rsp) is a full memory barrier like any locked instruction: it waits for all earlier loads to complete and the store buffer to drain before doing the RMW, and no later loads can take a value from L1d cache until after it runs. (Later stores could potentially be added to the store buffer; x86 commits stores in order, and any ISA only does so after they retire from the ROB = reorder buffer.) Related: Does lock xchg have the same behavior as mfence? / Which is a better write barrier on x86: lock+addl or xchgl? — Peter Cordes
– Peter Cordes, Commented Mar 15, 2024 at 18:03
Also related: Is incrementing an int effectively atomic in specific cases? - my answer mentions how a CPU running lock add will keep the cache line pinned in Modified or Exclusive state from the load to the store, so it's Invalid in all other caches. i.e. this core keeps exclusive ownership of the line the whole time. (Not just Shared state, that would allow other reads to happen, or worse other atomic RMWs to start!) — Peter Cordes
– Peter Cordes, Commented Mar 15, 2024 at 18:03
In MESI, a core wanting to store needs to do a read-for-ownership to get a current copy of the line as well as invalidating other copies. When it gets a reply to that, it has Exclusive ownership of that cache line, and committing the store flips the line from E to M state while no other caches have a valid copy. x86's strong memory ordering rules already give acquire/release semantics (and no IRIW reordering), so recovering sequential consistency on top of that just requires blocking StoreLoad reordering between volatile accesses, e.g. use (implicit lock) xchg for pure stores. — Peter Cordes
– Peter Cordes, Commented Mar 15, 2024 at 19:20
See C++ How is release-and-acquire achieved on x86 only using MOV? re: the other x86 memory-ordering rules that give that part. — Peter Cordes
– Peter Cordes, Commented Mar 15, 2024 at 19:20

Collectives™ on Stack Overflow

How CPUs Use the LOCK Prefix to Implement Cache Locking and ensure memory consistency

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked