I have code that has worked well when compiled with gcc but fails to work correctly when compiled with clang. I tracked down the issue to a read of volatile memory (on a microcontroller). I found by stepping in assembly that the memory associated with pma_address[i] is read into a register but the values set in the register are never used. So the handle_setup_packet is never called presumably because the optimizer determined that buffer will never change. By making buffer_out into a volatile uint8_t* the code does function correctly. But I don't see why it would need to be volatile in this case. I would think one volatile would be enough to convince the compiler that buffer is worth looking at. I'm wondering what my logic error is in the following code:
void read_pma(uint8_t byte_count, volatile uint16_t * pma_address, uint8_t *buffer_out) {
int count_received_16 = (byte_count + 1) >> 1;
for(int i=0; i<count_received_16; i++) {
((uint16_t *) buffer_out)[i] = pma_address[i];
}
}
void fun() {
uint8_t buffer[64];
read_pma(byte_count, USBPMA->buffer[0].EP_RX, buffer);
handle_setup_packet(reinterpret_cast<usb_control_request *>(buffer));
}
Updates
I'm adding additional information to address the comments. From the suggestions in the comments I have learned more about strict aliasing and have found what I think is the main problem. The compiler is clang 20.1.0 provided by ARM: https://github.com/arm/arm-toolchain with the options -target=armv7m-none-eabi -O3 -flto. The assembly generated is quite large due to the loop unrolling it is doing, but generally there a number of sections like below, where it is loading multiple times out the pma_address memory location and writing repeatedly to a scratch register.
; ((uint16_t *) buffer_out)[i] = pma_address[i];
10000776: f835 3c1e ldrh r3, [r5, #-30]
1000077a: f835 3c1c ldrh r3, [r5, #-28]
1000077e: f835 3c1a ldrh r3, [r5, #-26]
10000782: f835 3c18 ldrh r3, [r5, #-24]
10000786: f835 3c16 ldrh r3, [r5, #-22]
1000078a: f835 3c14 ldrh r3, [r5, #-20]
I found several several ways that I can get it to generate the intended assembly below:
; ((uint16_t *) buffer_out)[i] = pma_address[i];
8004370: 885f ldrh r7, [r3, #0x2]
8004372: eb00 0144 add.w r1, r0, r4, lsl #1
8004376: f8a1 704d strh.w r7, [r1, #0x4d]
800437a: eb08 0744 add.w r7, r8, r4, lsl #1
800437e: f8b7 20a2 ldrh.w r2, [r7, #0xa2]
; for(int i=0; i<count_received_16; i++) {
8004382: 3404 adds r4, #0x4
; ((uint16_t *) buffer_out)[i] = pma_address[i];
8004384: f8a1 204f strh.w r2, [r1, #0x4f]
8004388: f8b7 20a4 ldrh.w r2, [r7, #0xa4]
Where the assembly is both loading and storing the information obtained from pma_address. The solutions I found (without adding volatile) that result in a correctly operating program are 1) compile without -flto, 2) compile with -fno-strict-aliasing, and 3) change the code to address the aliasing violations.
void read_pma(uint8_t byte_count, volatile uint16_t * pma_address, uint8_t *buffer_out) {
int count_received_16 = (byte_count + 1) >> 1;
for(int i=0; i<count_received_16; i++) {
uint16_t word = pma_address[i];
buffer_out[2*i] = word & 0xFF;
buffer_out[2*i+1] = (word >> 8) & 0xFF;
}
}
fun() {
read_pma(byte_count, USBPMA->buffer[0].EP_RX, buffer);
usb_control_request ctrl_req {};
std::memcpy(&ctrl_req, buffer, sizeof(usb_control_request));
handle_setup_packet(&ctrl_req);
}
I tried but could not create a version of this code in Godbolt Compiler Explorer and could only obtain the unintended behavior in the context of the full program. The main thing that surprised me in the debugging is that with LTO on the compiler can fully ignore the handle_setup_packet function due to the aliasing violation. I tested this by placing invalid code in that function asm(" invalid_op;"); in that code which generated no errors when compiled in LTO mode.
reinterpret_cast<…>, so that's the appropriate tag to use.uint16_t, depending how compiler trat cstdint types. You may accidenty causing UB by going out of bounds also, some combination of compiler\ISA for some reason have agressive alterations of code (e.g. make loop infinite or throw it out, can see that only in assembly). I would also make whatever serves as buffer volatile to prevent optimization infun()body.