Return to Answer

added 496 characters in body

Source Link

edited May 6, 2023 at 20:12

12.6k
1
19
39

Sneaky AVX512 dependence

Note that _mm256_loadu_epi8 and such ("typed integer loads" or whatever you want to call them) are AVX512. It looks like this code is aimed at AVX2 though, and the corresponding AVX2 load is _mm256_loadu_si256 regardless of the size of the element. _mm256_loadu_epi8 and friends probably exist (I think) because there are masked versions of them, the element size matters for the masked versions but not the "basic" version that just loads 32 bytes.

Passing vectors by const-reference

That's unnecessary. Think of vectors as "slightly bigger integers", they fit in registers, you can copy them for almost free. Passing them by const-reference is fine as long as the compiler optimizes away the indirection, but it's not really a good thing, more of an "not necessarily a bad thing, when the compiler cooperates".

Passing uint32_t by const-reference is even more unusual, and once again the best you can hope for is that the compiler doesn't do what you told it to do, which is not a great position to be in.

`reinterpret_cast<vec256f>`

Does that work? The usual way to express vector reinterpretation is with the "cast" family of intrinsics such as _mm256_castsi256_ps.

`__builtin_ctz`

__builtin_ctz is fine but you may like to know that as of C++20 there is a <bit> header that defines std::countr_zero .. or, looking at your user name, perhaps that's not your preference ;)

Head and tail handling

This is a common problem when working with SIMD, and there is nothing inherently wrong with using some scalar code at the start and end, but it's not the only option.

You can also consider one of these tricks:

Use an unaligned load at the start. Once upon a time unaligned loads used to be really quite bad, worth avoiding at significant cost, but not today, and there would be only a couple of them. You're not stuck with unaligned loads throughout, because it's fine if the first aligned(*) load (in the main loop) partially overlaps the unaligned load. There is some wasted work there, but also the potential to still be faster than a scalar loop.
By the way, be careful with this: this technique does not handle an input array that's shorter than a vector.
Similarly, use an unaligned load at the end, also partially overlapping, with the last aligned load.
You can use aligned loads, but then ignore/discard the data that was loaded from before the start of the input and after the end of the input. Be careful with zeroing out the invalid parts when searching for a zero.

*: by "aligned load" I mean that the address is aligned. Back in the old days, it used to be that the unaligned load instruction was always slow, even if the address was actually aligned. That hasn't been the case for over a decade. A load can be considered aligned if the address is aligned, the type of instruction doesn't really matter. Some compilers refuse to emit the aligned instruction even when use its intrinsic, opting to always emit the unaligned instruction.

Sneaky AVX512 dependence

Passing vectors by const-reference

Passing uint32_t by const-reference is even more unusual, and once again the best you can hope for is that the compiler doesn't do what you told it to do, which is not a great position to be in.

`reinterpret_cast<vec256f>`

Does that work? The usual way to express vector reinterpretation is with the "cast" family of intrinsics such as _mm256_castsi256_ps.

`__builtin_ctz`

__builtin_ctz is fine but you may like to know that as of C++20 there is a <bit> header that defines std::countr_zero .. or, looking at your user name, perhaps that's not your preference ;)

Head and tail handling

This is a common problem when working with SIMD, and there is nothing inherently wrong with using some scalar code at the start and end, but it's not the only option.

You can also consider one of these tricks:

Use an unaligned load at the start. Once upon a time unaligned loads used to be really quite bad, worth avoiding at significant cost, but not today, and there would be only a couple of them. You're not stuck with unaligned loads throughout, because it's fine if the first aligned load (in the main loop) partially overlaps the unaligned load. There is some wasted work there, but also the potential to still be faster than a scalar loop.
By the way, be careful with this: this technique does not handle an input array that's shorter than a vector.
Similarly, use an unaligned load at the end, also partially overlapping, with the last aligned load.
You can use aligned loads, but then ignore/discard the data that was loaded from before the start of the input and after the end of the input. Be careful with zeroing out the invalid parts when searching for a zero.

Sneaky AVX512 dependence

Passing vectors by const-reference

Passing uint32_t by const-reference is even more unusual, and once again the best you can hope for is that the compiler doesn't do what you told it to do, which is not a great position to be in.

`reinterpret_cast<vec256f>`

Does that work? The usual way to express vector reinterpretation is with the "cast" family of intrinsics such as _mm256_castsi256_ps.

`__builtin_ctz`

__builtin_ctz is fine but you may like to know that as of C++20 there is a <bit> header that defines std::countr_zero .. or, looking at your user name, perhaps that's not your preference ;)

Head and tail handling

This is a common problem when working with SIMD, and there is nothing inherently wrong with using some scalar code at the start and end, but it's not the only option.

You can also consider one of these tricks:

Use an unaligned load at the start. Once upon a time unaligned loads used to be really quite bad, worth avoiding at significant cost, but not today, and there would be only a couple of them. You're not stuck with unaligned loads throughout, because it's fine if the first aligned(*) load (in the main loop) partially overlaps the unaligned load. There is some wasted work there, but also the potential to still be faster than a scalar loop.
By the way, be careful with this: this technique does not handle an input array that's shorter than a vector.
Similarly, use an unaligned load at the end, also partially overlapping, with the last aligned load.
You can use aligned loads, but then ignore/discard the data that was loaded from before the start of the input and after the end of the input. Be careful with zeroing out the invalid parts when searching for a zero.

Source Link

answered May 6, 2023 at 19:59

user555045

12.6k
1
19
39

Sneaky AVX512 dependence

Passing vectors by const-reference

Passing uint32_t by const-reference is even more unusual, and once again the best you can hope for is that the compiler doesn't do what you told it to do, which is not a great position to be in.

`reinterpret_cast<vec256f>`

Does that work? The usual way to express vector reinterpretation is with the "cast" family of intrinsics such as _mm256_castsi256_ps.

`__builtin_ctz`

__builtin_ctz is fine but you may like to know that as of C++20 there is a <bit> header that defines std::countr_zero .. or, looking at your user name, perhaps that's not your preference ;)

Head and tail handling

This is a common problem when working with SIMD, and there is nothing inherently wrong with using some scalar code at the start and end, but it's not the only option.

You can also consider one of these tricks:

Use an unaligned load at the start. Once upon a time unaligned loads used to be really quite bad, worth avoiding at significant cost, but not today, and there would be only a couple of them. You're not stuck with unaligned loads throughout, because it's fine if the first aligned load (in the main loop) partially overlaps the unaligned load. There is some wasted work there, but also the potential to still be faster than a scalar loop.
By the way, be careful with this: this technique does not handle an input array that's shorter than a vector.
Similarly, use an unaligned load at the end, also partially overlapping, with the last aligned load.
You can use aligned loads, but then ignore/discard the data that was loaded from before the start of the input and after the end of the input. Be careful with zeroing out the invalid parts when searching for a zero.