4

Consider the following snippet:

ldr q0, [x0]
cmeq v0.16b, v0.16b, #0
shrn v0.8b, v0.8h, #4
fcmp d0, #0.0

This is a common way to implement functions such as strlen with SIMD. According to the Arm64 Architecture Reference Manual (version L.b), fcmp can generate an Invalid Operation floating-point exception if d0 is a signaling NaN. As a signaling NaN is represented by having all bits 62-52 set and bit 51 not set, and because this way of using shrn generates bands of 4 equal bits, it is possible to have bit 52 and bit 51 have different values and thus have a signaling NaN in d0.

I checked on Linux Arm64, the IOC bit of the FPSR register is set after performing fcmp with a signaling NaN (and it is clear before). However, it did not raise an exception / crash the program. Is this a characteristic of Linux? Could it be possible that this code raise an exception on some other OS / distro? If so, is it really portable to use these instructions to implement functions such as strlen?

1
  • Note that as discussed in this question, the approach fails to work if FTZ mode is set. So unless you have full control over the FP environment, don't use it. Commented Nov 7 at 10:07

1 Answer 1

4

This is unsafe if you don't have full control of the FP environment for both exception masking and because it breaks in fast-math mode. FTZ mode treats subnormals aka denormals as 0.0 on input to all FP ops including compare. It would ignore 0 bytes near the start of a vector if the later bytes are all non-zero so the compare mask as a binary64 has the exponent field all 0. You'd also have to check that no AArch64 CPUs are really slow with subnormal compares in non-FTZ mode; some really old Intel CPUs do have microcode assists for that (e.g. Core 2 from 2006).

(Programs linked with gcc -ffast-math will set FTZ mode in their CRT startup code.)

It's potentially also slower than fmov x0, d0 / cbz integer compare; getting data forwarded from the FP execution units to integer should be about the same cost whether that's the flags result of a compare or a 64-bit register value. An FP compare take more cycles than an integer compare/test, as you can see from looking at the latency of SIMD packed-compare instructions that produce a mask in a vector reg.

(Update: @fuz says this fcmp / branch is more efficient in general, especially on some CPUs which have slow FP->integer data transfer. But of course you need full control of the FP environment to make it safe. So not usable in a library strlen, but potentially in your own programs.)

If you want to know the exact length down to the byte rather than just somewhere in a 16-byte vector, you need fmov / rbit / clz (the latter two insns finding the position of the lowest non-zero bit). Doing that fmov as part of your loop condition saves code-size later.


Your actual question about FP exceptions

FP exceptions are masked by default, so raising one only sets a bit in the FP environment as you found. Only if you unmask them is there an actual trap (branch to an exception handler in kernel code), which the kernel would handle and deliver SIGFPE. You can use glibc feenableexcept to unmask some or all exceptions.
(On AArch64, support for unmasked FP exceptions is optional; not all CPUs even support it. On x86 it's mandatory.)

For the same reason, dividing by 1.0 / 0.0 silently produces +inf, and taking the square root of a negative number silently produces NaN, with the exceptions raised just setting bits in the FP status register.

Same for other FP exception types, like precision exception which happens any time the bits discarded by rounding weren't all zeros.

Some operations like comparison don't normally trigger FP exceptions even with quiet-NaN as an input. Signalling NaN can create an FP exception even in operations that are normally "quiet", but doesn't override the exception-mask.


Fun fact: Glibc strlen doesn't use fcmp; it always uses fmov to a GPR for rbit / clz, or cmeq v0.8b, v0.8b, 0 / fmov/cbz to keep looping after using 2x uminp to pack 2 vectors (32 bytes) down to 8 bytes. https://codebrowser.dev/glibc/glibc/sysdeps/aarch64/multiarch/strlen_asimd.S.html#158
(startup for the first 32 bytes is done with scalar bithacks to make the short-string case fast.)


Anyway, if you care about not trapping in code that has unmasked some FP exceptions, and/or not raising spurious FP exceptions to pollute the FP environment, yes use fmov x0, d0 and cbz or whatever instead of fcmp / bne.

Reducing the compare mask down to 8 bytes means it can fit in an integer register, so that's a good option vs. treating it as an IEEE binary64 double FP value.

Code size is equal for fmov/cb[n]z vs. fcmp/bne to branch on it. Replacing fcmp/csel or something costs 3 instructions like fmov/tst/csel, but that's wouldn't be normal as part of strlen or memcmp or similar loops.


Semi-related:

Sign up to request clarification or add additional context in comments.

7 Comments

Please also mention that this fails if FTZ mode is set.
Oh right! Yeah that tips the balance from this being just kinda weird an probably inefficient to actually bad. Updated my answer to start with that. Do you happen to know if any AArch64 CPUs have microcode assists for subnormal FP compares the way Intel Core 2 did? (Later Intel doesn't take microcode assists for subnormal compares, only for some other cases of operations.)
Also, am I missing something or is this always just worse than fmov x0, d0/cbz x0 for branching on the result? I guess if you have no spare integer registers, but that seems pretty unlikely. I made that assumption in my answer.
In our experiments, we found fcmp to be a lot faster than fmov followed by cbz, as FP-to-scalar moves have a fairly high latency on some chips. No microcode assist was noticed. It's the better approach if you have control over the floating point environment.
Weird; on x86 CPUs, FP compares which set integer FLAGs include the same (low) FP-to-scalar transfer cost as movd/movq. It doesn't make much sense to me that there'd be a low-latency path for compare results but not for fmov. But it's possible they throw extra HW at it since branching on scalar FP compares is probably more common than fmov, and/or maybe there's some other effect going on.
On Cortex A-72 for instance, fmov x0, d0 is 5 cycles latency, whereas fcmp is quoted as 3.
Thanks. And that's not a CPU where with slow SIMD->integer data transfer where fmov stalls the whole SIMD/FP progress, it can happen speculatively with OoO exec so the scheduler can track an op that depends on an FP input and produces an integer output. Maybe integer flags have a different forwarding network from integer data and fcmp's pipeline stages connect between them? I'm so used to x86 where most integer insns produce a FLAGS result, and Intel uses one integer PRF with room for both an integer and FLAGS result. Although IIRC Zen has a separate FLAGS PRF.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.