Assembly memory math and looping

Question

I'm struggling to figure out how a certain block would function. With the following address on the heap

004B0000 73 6D 67 66 74 smgft

and the following assembly:

77A701B8 xor eax, eax
77A701BA mov ecx, 4
77A701BF lea edi, DWORD PTR DS:[ecx+4B0000]
77A701C5 xor DWORD PTR DS:[edi], ecx
77A701C5 loopd short ntdll.77A701BF

The problem is to provide the value of the five bytes on the heap in ASCII after the instructions have executed. What I can understand from it is as follows

xor eax, eax ; 0 out eax

mov ecx, 4 ; set ecx 4

lea edi, dword ptr ds:[ecx+4b0000] ; this loads into EDI whatever is stored at ecx+4b0000, so 4b0004. I'm not sure what this would grab. I'm not even sure what 4b0000 would get, since it's 5 bytes. mgft, or smgf? I think smgf? And how does the +4h affect this? Makes it 736D676678?

xor dword ptr ds:[edi], ecx ; So this will xor 4h with the newly loaded dword at edi, but what does it do with it in the loopd?

loopd short ntdll.77A701BF ; So this is a "loop while equal" but I'm not sure what that translates to with a xor above it. And does it decrement ecx? But then it jumps back to the lea line.

It's a loop, not loope (loop while equal), so it doesn't care about EFLAGS (which xor sets). See felixcloutier.com/x86/LOOP:LOOPcc.html — Peter Cordes
– Peter Cordes, Commented Aug 6, 2016 at 22:56

Brendan · Accepted Answer · 2016-08-06 23:01:16Z

4

The lea edi, dword ptr ds:[ecx+4b0000] loads the value ecx+0x004b0000 into EDI, and doesn't access memory at all. The loop instruction is like "ecx = ecx - 1; if(ecx != 0) goto ntdll.77A701BF".

Not that this code can be unrolled, so that it becomes:

    xor eax, eax

    lea edi, DWORD PTR DS:[4+0x004B0000]
    xor DWORD PTR DS:[edi], 0x00000004

    lea edi, DWORD PTR DS:[3+0x004B0000]
    xor DWORD PTR DS:[edi], 0x00000003

    lea edi, DWORD PTR DS:[2+0x004B0000]
    xor DWORD PTR DS:[edi], 0x00000002

    lea edi, DWORD PTR DS:[1+0x004B0000]
    xor DWORD PTR DS:[edi], 0x00000001

    xor ecx,ecx

Which can be optimised more, so it becomes:

    xor BYTE PTR DS:[0x004B0004], 0x04
    xor BYTE PTR DS:[0x004B0003], 0x03
    xor BYTE PTR DS:[0x004B0002], 0x02
    xor BYTE PTR DS:[0x004B0001], 0x01

    xor eax, eax       ;May be unnecessary if value unused by later code
    mov edi,0x004B0001 ;May be unnecessary if value unused by later code
    xor ecx, ecx       ;May be unnecessary if value unused by later code

Which can be optimised a little more by combining the XORs:

    xor DWORD PTR DS:[0x004B0001], 0x04030201

    xor eax, eax       ;May be unnecessary if value unused by later code
    mov edi,0x004B0001 ;May be unnecessary if value unused by later code
    xor ecx, ecx       ;May be unnecessary if value unused by later code

Note: Yes, this is a misaligned XOR, but likely faster than multiple smaller aligned XORs on modern CPUs as it doesn't cross a cache line boundary.

Essentially; the entire loop can be reduced to a single instruction.

answered Aug 6, 2016 at 23:01

Brendan

37.7k2 gold badges45 silver badges72 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Peter Cordes Over a year ago

Even if it crossed a cache line boundary, it would probably be faster than 4 memory-destination byte XORs on most CPUs for most kinds of surrounding code. (memory-destination ALU insns always decode to multiple uops on Intel CPUs). If it crossed a page boundary (on pre-Skylake), then separate XORs would win for latency, but still not for throughput unless that slow-to-retire instruction stalled out-of-order execution later on.

Ped7g Over a year ago

which means the content of memory will turn into 004B0000 73 6C 65 65 70 sleep

Collectives™ on Stack Overflow

Assembly memory math and looping

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related