0

Say I have a structure like this:

struct tmp {
    unsigned char arr1[10];
    unsigned char arr2[10];
    int  i1;
    int  i2;
    unsigned char arr3[10];
    unsigned char arr4[10];
};

Which of these would be faster?

(1) Memset entire struct to 0 and then fill members as:

struct tmp t1;
memset(&t1, 0, sizeof(struct tmp));

t1.i1 = 10;
t1.i2 = 20;
memcpy(t1.arr1, "ab", sizeof("ab"));
// arr2, arr3 and arr4 will be filled later.

OR

(2) Memset separate variables:

struct tmp t1;
t1.i1 = 10;
t1.i2 = 20;
memcpy(t1.arr1, "ab", sizeof("ab"));

memset(t1.arr2, 0, sizeof(t1.arr2); // will be filled later
memset(t2.arr3, 0, sizeof(t1.arr3); // will be filled later
memset(t2.arr4, 0, sizeof(t1.arr4); // will be filled later

Just in terms of performance, is multiple calls to memset faster (on separate members of a structure) faster/slower than a single call to memset (on the entire structure).

5
  • 1
    One global memset is slightly faster (unless you have hundreds on non-array fields), but you might also consider just t1.arr[0] = '\0';. If the memory image is saved, a global memset is need for alignment of field offsets, where otherwise garbage would be shown.. Commented Apr 24, 2020 at 8:54
  • Null terminator is not required in my case since its network bytes I'm dealing with. Memory image is not being saved, but yes it could be garbage that can be read by some other function. Commented Apr 24, 2020 at 8:58
  • 1
    sizeof("ab") counts the null terminator though Commented Apr 24, 2020 at 9:11
  • 1
    If the initialization needs to happen only once, the fastest is to do it at compile-time with a definition like struct tmp t1 = { .i1 = 10, .i2 = 20, .arr1 = { 'a', 'b' } };. Most compilers would then place t1 in an initialized data section. Commented Apr 24, 2020 at 10:45
  • @Jens I see, thank you for the idea! Commented Apr 24, 2020 at 13:32

1 Answer 1

2

It isn't really meaningful to discuss this without a specific system in mind, nor is it fruitful to ponder these things unless you actually have a performance bottlneck. I can give it a try still.

For a "general computer", you would have to consider:

  • Aligned access
    Accessing a chunk of data in one go is usually better. In case of potential misalignment, the overhead code to deal with that is roughly the same no matter how large the data is. Assuming theoretically that all access in this code happens to be misaligned, then 1 memset call is better than 3.

    Also, we can assume that the first item of a struct is aligned, but we cannot assume that for any individual member inside the struct. The linker will allocate the struct at an aligned address, then potentially add padding anywhere inside it to compensate for misalignment.

    Your struct has been declared without any consideration about alignment, so this will be an issue here - the compiler will insert lots of padding.

    (On the other hand, a memset on the whole struct will also overwrite padding bytes, which is a tiny bit of overhead code.)

  • Data cache use
    Accessing an area of adjacent memory from top to bottom is much more "cache-friendly" than accessing fragments of it from multiple places in your code. Subsequent access of contiguous memory means that the computer can load a lot of data into cache, instead of fetching it from RAM, which is slower.

  • Instruction cache use and branch prediction
    Not very relevant in this case, since the code is basically just doing raw copies and doing so branch-free.

  • The amount of machine instructions generated
    This is always a good, rough indication of how fast the code is. Obviously some instructions are a lot slower than others etc, but less instructions very often means faster code. Dissassembling your two functions with gcc x86_64 -O3 then I get this:

    func1:
        movabs  rax, 85899345930
        pxor    xmm0, xmm0
        movups  XMMWORD PTR [rdi+16], xmm0
        mov     QWORD PTR [rdi+20], rax
        mov     eax, 25185
        movups  XMMWORD PTR [rdi], xmm0
        movups  XMMWORD PTR [rdi+32], xmm0
        mov     WORD PTR [rdi], ax
        ret
    
    func2:
        movabs  rax, 85899345930
        xor     edx, edx
        xor     ecx, ecx
        xor     esi, esi
        mov     QWORD PTR [rdi+20], rax
        mov     eax, 25185
        mov     WORD PTR [rdi], ax
        mov     BYTE PTR [rdi+2], 0
        mov     QWORD PTR [rdi+10], 0
        mov     WORD PTR [rdi+18], dx
        mov     QWORD PTR [rdi+28], 0
        mov     WORD PTR [rdi+36], cx
        mov     QWORD PTR [rdi+38], 0
        mov     WORD PTR [rdi+46], si
        ret
    

    This is a pretty good indication that the former code is more efficient, and it should also be more data cache-friendly, so it would surprise me if (1) isn't significantly faster.

Also note that if you declared this struct with static storage duration, you would "outsource" the zero-out to the CRT part of the program setting .bss and getting executed before main() is even called. Then none of these memset would be needed. At the expensive of slightly slower start-up, but a faster program overall.

Sign up to request clarification or add additional context in comments.

2 Comments

This for a "generic computer", x86 like. If you'd done the same exercise on a low-end 8 bit MCU, then only the machine code would matter. Alignment would be a non-issue and cache wouldn't be available.
I see, thank you for all this info. I guess optimising this will not be as meaningful as optimising actual algorithm.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.