x86 32bit Assembly Parser | logical problem

Question

I'm currently working on an Obfuscator for assembled x86 assembly (working with the raw bytes).

To do that I first need to build a simple parser, to "understand" the bytes. I'm using a database that I create for myself mostly with the website: https://defuse.ca/online-x86-assembler.htm

Now my question: Some bytes can be interpreted in two ways, for example (intel syntax):

1. f3 00 00                repz add BYTE PTR [eax],al
2. f3                      repz

My idea way to loop through the bytes and work with every instruction as single, but when I reach byte '0xf3' I have 2 ways of interpreting it.

I know there are working x86 disassemblers out there, how do I know what case this is?

Both ways are invalid instructions, so I'm not sure why it matters. A rep prefix has to be followed by one of the specific instructions for which it's defined, and add isn't one of them. — Nate Eldredge
– Nate Eldredge, Commented Sep 6, 2021 at 18:21
Related: How does an instruction decoder tell the difference between a prefix and a primary opcode? — Peter Cordes
– Peter Cordes, Commented Sep 6, 2021 at 19:09
Also related: "mandatory prefixes" as part of encoding instructions like SSE2 movdqa: Combining prefixes in SSE — Peter Cordes
– Peter Cordes, Commented Sep 6, 2021 at 19:13

Alex Guteniev · Accepted Answer · 2021-09-06 18:27:00Z

4

Prefixes, including repz prefix, are not meaningful without subsequent instruction. The subsequent instruction may incorporate the prefix (repz nop is pause), change its meaning (repz is xrelease if used before some interlocked instruction), or the prefix may be just invalid.

The decoding is always unambiguous, otherwise the CPU could not execute instructions. It may be ambiguous only if you don't know exact byte offset where to begin decoding (as x86 uses variable instruction length).

answered Sep 6, 2021 at 18:27

Alex Guteniev

14.3k2 gold badges46 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Peter Cordes Over a year ago

decoding is always unambiguous - or at least, any given CPU will pick one way of decoding. Intel's manual says it's "illegal" to have multiple REX prefixes on one instruction, but their Skylake CPUs for example will take the last one (like with other repeated prefixes), not #UD fault. There is AFAIK no Intel documentation that says this is what will happen. But yes, they're still REX prefixes, so unambiguous in that sense.

Peter Cordes Over a year ago

Finally found the Q&A where I'd tested repeated REX prefixes: Segmentation fault when using DB (define byte) inside a function

Happy Jerry Over a year ago

@PeterCordes Just clarifying, when parsing the subsequent instructions, all you need to do is look for the prefix bytes? To get all the bytes for an instruction, you simply go from the prefix to the the next prefix - 1?

Peter Cordes Over a year ago

@HappyJerry: Yeah, any number of prefixes can be part of one instruction. The first non-prefix byte is the opcode. (There's a length limit of 15 bytes per instruction, so #UD if you don't get to the end of an instruction before then, even if you've seen opcode + modrm which tell you how many more bytes of disp32 and/or imm32 there are.)

Happy Jerry Over a year ago

@PeterCordes So the delimiter for an instruction may look like: current_position == prefix AND current_position -1 != prefix . Once these conditions are met, I could assume that that I've reached the end of an instruction?

|

Collectives™ on Stack Overflow

x86 32bit Assembly Parser | logical problem

1 Answer 1

9 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

9 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related