I'm aware that when using gcc inline assembly, if you don't specify otherwise, it assumes that you consume all your inputs before you write any ouput operand. If you actually want to write to an output operand before consuming all inputs, you must specify it as early-clobber so it doesn't reuse that register for an input.
My question arose when I saw this example from the authoritative reference:
void
dscal (size_t n, double *x, double alpha)
{
asm ("/* lots of asm here */"
: "+m" (*(double (*)[n]) x), "+&r" (n), "+b" (x) // <-- There's the "+&r" (n)
: "d" (alpha), "b" (32), "b" (48), "b" (64),
"b" (80), "b" (96), "b" (112)
: "cr0",
"vs32","vs33","vs34","vs35","vs36","vs37","vs38","vs39",
"vs40","vs41","vs42","vs43","vs44","vs45","vs46","vs47");
}
What? Why does it earlyclobber an ouput-input register? Isn't it the same register anyway?
There is no explanation of the matter in that page.
Digging further I found this, which states:
An operand which is read by the instruction can be tied to an earlyclobber operand if its only use as an input occurs before the early result is written. Adding alternatives of this form often allows GCC to produce better code when only some of the read operands can be affected by the earlyclobber. See, for example, the ‘mulsi3’ insn of the ARM.
Furthermore, if the earlyclobber operand is also a read/write operand, then that operand is written only after it’s used.
That last one speaks about the +&r case but I honestly don't get what it says. I don't know what "used" means.
Doing a quick grep -r '+&' on the linux kernel yielded very few results, and only one file where it is used in x86 architecture (which is what I'm somewhat familiar with (not too much)): (file arch/x86/crypto/curve25519-x86_64.c)
/* Computes the addition of four-element f1 with value in f2
* and returns the carry (if any) */
static inline u64 add_scalar(u64 *out, const u64 *f1, u64 f2)
{
u64 carry_r;
asm volatile(
/* Clear registers to propagate the carry bit */
" xor %%r8d, %%r8d;"
" xor %%r9d, %%r9d;"
" xor %%r10d, %%r10d;"
" xor %%r11d, %%r11d;"
" xor %k1, %k1;"
/* Begin addition chain */
" addq 0(%3), %0;"
" movq %0, 0(%2);"
" adcxq 8(%3), %%r8;"
" movq %%r8, 8(%2);"
" adcxq 16(%3), %%r9;"
" movq %%r9, 16(%2);"
" adcxq 24(%3), %%r10;"
" movq %%r10, 24(%2);"
/* Return the carry bit in a register */
" adcx %%r11, %1;"
: "+&r"(f2), "=&r"(carry_r)
: "r"(out), "r"(f1)
: "%r8", "%r9", "%r10", "%r11", "memory", "cc");
return carry_r;
}
I really don't get why using +r wouldn't be enough.
f2andf1are known by the compiler to contain the same value? Can it use the same register for both? That might work (thus saving a register) if f1 is only used before f2 gets written. But if that can't be guaranteed, earlyclobber ensures they use separate registers.+&made a difference. These sort of details seem fuzzy and not well known by many. For anyone interested I found this thread concerning my question. By the way, why not make that an answer? It answered my question perfectly!