Memset struct variables separately vs memset entire struct, which is faster?

Question

Say I have a structure like this:

struct tmp {
    unsigned char arr1[10];
    unsigned char arr2[10];
    int  i1;
    int  i2;
    unsigned char arr3[10];
    unsigned char arr4[10];
};

Which of these would be faster?

(1) Memset entire struct to 0 and then fill members as:

struct tmp t1;
memset(&t1, 0, sizeof(struct tmp));

t1.i1 = 10;
t1.i2 = 20;
memcpy(t1.arr1, "ab", sizeof("ab"));
// arr2, arr3 and arr4 will be filled later.

OR

(2) Memset separate variables:

struct tmp t1;
t1.i1 = 10;
t1.i2 = 20;
memcpy(t1.arr1, "ab", sizeof("ab"));

memset(t1.arr2, 0, sizeof(t1.arr2); // will be filled later
memset(t2.arr3, 0, sizeof(t1.arr3); // will be filled later
memset(t2.arr4, 0, sizeof(t1.arr4); // will be filled later

Just in terms of performance, is multiple calls to memset faster (on separate members of a structure) faster/slower than a single call to memset (on the entire structure).

One global memset is slightly faster (unless you have hundreds on non-array fields), but you might also consider just t1.arr[0] = '\0';. If the memory image is saved, a global memset is need for alignment of field offsets, where otherwise garbage would be shown.. — Joop Eggen
– Joop Eggen, Commented Apr 24, 2020 at 8:54
Null terminator is not required in my case since its network bytes I'm dealing with. Memory image is not being saved, but yes it could be garbage that can be read by some other function. — Rahul Bharadwaj
– Rahul Bharadwaj, Commented Apr 24, 2020 at 8:58
If the initialization needs to happen only once, the fastest is to do it at compile-time with a definition like struct tmp t1 = { .i1 = 10, .i2 = 20, .arr1 = { 'a', 'b' } };. Most compilers would then place t1 in an initialized data section. — Jens
– Jens, Commented Apr 24, 2020 at 10:45

Lundin · Accepted Answer · 2020-04-24 09:19:07Z

It isn't really meaningful to discuss this without a specific system in mind, nor is it fruitful to ponder these things unless you actually have a performance bottlneck. I can give it a try still.

For a "general computer", you would have to consider:

Aligned access
Accessing a chunk of data in one go is usually better. In case of potential misalignment, the overhead code to deal with that is roughly the same no matter how large the data is. Assuming theoretically that all access in this code happens to be misaligned, then 1 memset call is better than 3.

Also, we can assume that the first item of a struct is aligned, but we cannot assume that for any individual member inside the struct. The linker will allocate the struct at an aligned address, then potentially add padding anywhere inside it to compensate for misalignment.

Your struct has been declared without any consideration about alignment, so this will be an issue here - the compiler will insert lots of padding.

(On the other hand, a memset on the whole struct will also overwrite padding bytes, which is a tiny bit of overhead code.)
Data cache use
Accessing an area of adjacent memory from top to bottom is much more "cache-friendly" than accessing fragments of it from multiple places in your code. Subsequent access of contiguous memory means that the computer can load a lot of data into cache, instead of fetching it from RAM, which is slower.
Instruction cache use and branch prediction
Not very relevant in this case, since the code is basically just doing raw copies and doing so branch-free.

The amount of machine instructions generated
This is always a good, rough indication of how fast the code is. Obviously some instructions are a lot slower than others etc, but less instructions very often means faster code. Dissassembling your two functions with gcc x86_64 -O3 then I get this:

func1:
    movabs  rax, 85899345930
    pxor    xmm0, xmm0
    movups  XMMWORD PTR [rdi+16], xmm0
    mov     QWORD PTR [rdi+20], rax
    mov     eax, 25185
    movups  XMMWORD PTR [rdi], xmm0
    movups  XMMWORD PTR [rdi+32], xmm0
    mov     WORD PTR [rdi], ax
    ret

func2:
    movabs  rax, 85899345930
    xor     edx, edx
    xor     ecx, ecx
    xor     esi, esi
    mov     QWORD PTR [rdi+20], rax
    mov     eax, 25185
    mov     WORD PTR [rdi], ax
    mov     BYTE PTR [rdi+2], 0
    mov     QWORD PTR [rdi+10], 0
    mov     WORD PTR [rdi+18], dx
    mov     QWORD PTR [rdi+28], 0
    mov     WORD PTR [rdi+36], cx
    mov     QWORD PTR [rdi+38], 0
    mov     WORD PTR [rdi+46], si
    ret

This is a pretty good indication that the former code is more efficient, and it should also be more data cache-friendly, so it would surprise me if (1) isn't significantly faster.

Also note that if you declared this struct with static storage duration, you would "outsource" the zero-out to the CRT part of the program setting .bss and getting executed before main() is even called. Then none of these memset would be needed. At the expensive of slightly slower start-up, but a faster program overall.

This for a "generic computer", x86 like. If you'd done the same exercise on a low-end 8 bit MCU, then only the machine code would matter. Alignment would be a non-issue and cache wouldn't be available.
I see, thank you for all this info. I guess optimising this will not be as meaningful as optimising actual algorithm.

Collectives™ on Stack Overflow

Memset struct variables separately vs memset entire struct, which is faster?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related