1

I'm trying to learn Assembly language. I noticed that it's totally different compared to high-level programming languages like Java.

So I read that a data transfer instruction follows this syntax:

mnemonic destination, source

which I see as destination = source In other words, assignment of value to a memory.

I saw an example in the book this data segment declaration.

.data 
var1 SBYTE -4,-2,3,1 
var2 WORD 1000h,2000h,3000h,4000h 
var3 SWORD -16,-42 
var4 DWORD 1,2,3,4,5

Howcome there are > 1 value for the variables? What does that exactly mean?

I'd appreciate any explanation.

Thanks.

1
  • It means assemble that list of bytes, or list of DWORDs, into the output file. Maybe you should keep reading the book that contains the example, since it probably explains it very soon after. Commented Oct 25, 2016 at 8:57

2 Answers 2

2

To define multiple WORD-sized variables in assembly we can use

var1 WORD 1000h
var2 WORD 2000h
var3 WORD 3000h
var4 WORD 4000h 

Often the programmer doesn't need to name every single variable, just the first one and then use pointer arithmetic to get to the others.
In this case we can use

var1 WORD 1000h
     WORD 2000h
     WORD 3000h
     WORD 4000h 

This is especially handy when some variables may have a different sizes, otherwise the repeating of the keyword WORD is annoying and can be simplified in the final form

var1 WORD 1000h, 2000h, 3000h, 4000h

Which is equivalent to the second form and (names aside) to the first form.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the example and explanation. It's just a bit different with language like Java which I originally learned where most of the time variables are given it's own name. Like your first example. I appreciate it. Thanks. :)
I understand it now.
1

I wouldn't even use word "variable" much in Assembly.

You can certainly think like that, and it will mostly work, but technically it's more low level.

var1 DWORD 12345678h will compile value 12345678h into four bytes (on x86 little-endian the bytes will be 78 56 34 12), those will "land" somewhere into .data segment, which will become content of memory after loading executable by OS. OS will pick some free memory to load it, so it will also provide the starting address of .data segment, and adjust the loaded code to reflect the real address after loading of binary is finished, ahead of executing it.

That means 78 56 34 12 bytes will be available at some particular address (the memory on x86 is addressable by bytes).

And the var1 will become a "symbol" in symbol table, marking the address of the first byte from that four.

Then you can write in Assembly instructions like mov cl,[var1], which means "load cl with byte from memory at address of symbol var1", this instruction is marked in executable, and OS will adjust it with real symbol table value, so during execution it will point to the correct memory where .data segment was loaded.

Or when like in x86_64 relative addressing is used, the mov instruction is compiled as mov cl,[rip+offset_between_next_instruction_and_var1], then OS doesn't need to adjust the instruction at all, as it will work at any memory location, the offset is relative.

Loading memory content from address [var1] of BYTE size will load 78h - example of non-trivial variable-thinking-break manipulation. And loading WORD from [var1+1] will load WORD value 3456h from memory (on non-x86 platform unaligned memory access may crash the execution, on x86 it will work, only performance is penalized).

Then var1 DWORD 1,2,3,4 just means you compile lot more byte values behind that var address, like 01 00 00 00 02 00 00 00 .... in this case. And you can work with them by addressing them as mov eax,[var1 + 2*4] -> will load the var1[2] (Java like array) value into eax, which is 3.

Note the var1 is just address of memory, so you can actually address even data from var2 trough it, if you add correct offset to it. hence how unintentional overwriting of other variables in Assembly is so easy, for example it's enough to write DWORD into BYTE variable by mistake, and you already overwrote 3 bytes of some memory beyond variable, probably used by some other variable.

Also note how you have to always do the *1, *2, *4, .. of index, when accessing arrays, manually! So you must be always aware of the size of array element.

The basic power of two sizes can be directly encoded into extended addressing mode of x86 instruction, like mov eax,[ebx + esi*4 - 44] to address some dword array at ebx address with "esi-11" index, so Java-like it would resemble eax = ebx[esi-11];, saving you from calculating the multiplication separately.

This is another common source of bugs, forgetting that the "index" is equal to "memory byte offset" only when array element is of single byte size, in all other cases you have to multiply index by element size to get byte-offset for memory addressing.

Finally, when you write those things into .data segment, they get compiled sequentially, as concatenated stream of bytes (check your Assembler specifications to know about any automatic padding of particular directives, like dword, inserting padding bytes ahead as needed to align the resulting data). So you actually don't need any var1 there, if you want to go hardcore, calculate all the offsets on your own, and address those bytes trough .data + offset, it's possible (this is just mental exercise to show you what means writing "more values" on the line, not a recommendation :) ).


edit: "irvine" ... so you are probably using as assembler MASM? That one keeps not only the address symbol around during compilation, but also remembers the first declaration size (like "DWORD"), so it will try to cover some use cases with more "variable-like" approach to compilation.

I personally would suggest you to ignore that and think about them only as addresses and avoid all the MASM quirk syntax, as 1) it doesn't work in other x86 assemblers 2) can be quite confusing in larger source, once you get used to the low level assembly.

I mean, in MASM mov eax,var1 is compiled into machine code as mov eax,[address_of_var1] (load eax with memory content at var1, ie. "load eax with variable var1" from human perspective).

But when I read in source instruction without visible [], I'm used to think it's not accessing memory and working only with immediate (like in case of mov eax,esi vs mov eax,[esi]). Even in MASM you can write mov eax,[var1], it will work too. But to extract the address of var1 itself - requires extra syntax sugar like mov eax,OFFSET var1 IIRC.


final note: those WORD 1,2,3,4 definitions are usually used at places where in Java you would use array, like short wordArrayVar[] = {1, 2, 3, 4};. That's one possible interpretation of those multi-value definitions. But in some cases it's used even at lower level, just defining particular byte values in .data segment, not even to be used as array, but in some different way.

One other common pattern is initialization of "structure" instance, in Java there's no good example, as class member variables are not guaranteed to be stored in memory one after another? But in C++ all the class/struct member variables written in source can be imagined as byte by byte piece of memory, with each member variable having particular offset and alignment decided by it's type and position in source. At that point you can create pre-initialized instance of such structure by defining values for each byte, in a block of the size of the full structure.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.