2

I'm trying to measure the write bandwidth of my memory, I created an 8G char array, and call memset on it with 128 threads. Below is the code snippet.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include <pthread.h>
int64_t char_num = 8000000000;
int threads = 128;
int res_num = 62500000;

uint8_t* arr;

static inline double timespec_to_sec(struct timespec t)
{
    return t.tv_sec * 1.0 + t.tv_nsec / 1000000000.0;
}

void* multithread_memset(void* val) {
    int thread_id = *(int*)val;
    memset(arr + (res_num * thread_id), 1, res_num);
    return NULL;
}

void start_parallel()
{
    int* thread_id = malloc(sizeof(int) * threads);
    for (int i = 0; i < threads; i++) {
        thread_id[i] = i;
    }
    pthread_t* thread_array = malloc(sizeof(pthread_t) * threads);
    for (int i = 0; i < threads; i++) {
        pthread_create(&thread_array[i], NULL, multithread_memset, &thread_id[i]);
    }
    for (int i = 0; i < threads; i++) {
        pthread_join(thread_array[i], NULL);
    }
}

int main(int argc, char *argv[])
{
    struct timespec before;
    struct timespec after;
    float time = 0;
    arr = malloc(char_num);

    clock_gettime(CLOCK_MONOTONIC, &before);
    start_parallel();
    clock_gettime(CLOCK_MONOTONIC, &after);
    double before_time = timespec_to_sec(before);
    double after_time = timespec_to_sec(after);
    time = after_time - before_time;
    printf("sequential = %10.8f\n", time);
    return 0;
}

According to the output, it took 0.6 second to finish all memset, to my understanding, this implies a 8G/0.6 = 13G memory write bandwith. However, I have a 2667 MHz DDR4 which should have a 21.3 GB/s bandwith. Is there anything wrong with my code or my calculation? Thanks for any help!!

8
  • You are assuming that all threads run on different CPUs and that all threads are CPU bound. But also, you have provided only one decimal point of precision. So 0.6 might be anything from 0.550 to 0.649 or anything between 12.3 GB/s and 14.5 GB/s. So measuring to only one decimal point gives over 2 GB/s of variation. Commented Sep 28, 2021 at 0:09
  • 1
    For one thing, memset won't do only write cycles. The first write instruction in each cache line will necessarily read that line into cache, because the CPU doesn't know you will later overwrite all of it. Commented Sep 28, 2021 at 1:02
  • 1
    Also, 128 threads is a lot, unless you have 128 cores. The time spent context switching between them may be significant. Commented Sep 28, 2021 at 1:05
  • 8e10 is not 8G. 8G is 8*1024*1024*1024 Commented Sep 28, 2021 at 1:18
  • 1
    If you want to prevent reading the cache line into the CPU cache, you may want to take a look at non-temporal writes. You don't have to write assembler code for this. You can also use compiler intrinsics. Commented Sep 28, 2021 at 1:45

1 Answer 1

7

TL;DR: Measuring the memory bandwidth is not easy. In your case, the performance problem probably comes from the page faults.

If you want to measure the memory write bandwidth, you need to be careful about multiples things:

  • On Intel/AMD x86 platforms, memory writes in location not fetched in the cache cause a write allocate : data at the missed-write location is loaded to cache. See this page for more information. This strategy enable to processor to fill the part of the cache line that is not written in order to ensure the consistency of CPU caches. However, this also means that half of the memory throughput is "wasted". In practice, the situation is even worst because the interleave memory read-write often introduces additional overhead. One solution to fix that is to use non-temporal write instruction. In SSE, you can use _mm_stream_* intrinsics (typically _mm_stream_si128). In AVX, this is the _mm256_stream_* intrinsics (typically _mm256_stream_si256). Note that such instructions are good to use only if the data chunks do not fit in the cache or are not reused soon after that. A good libc implementation should use such instructions for memset and memcpy on big chunks.

  • Most operating systems do not actually map the allocated pages to physical ones at allocation time. The memory is only virtually allocated and not physically. Doing a first touch on the allocated memory pages cause a page fault which is pretty expensive. The full page is generally physically mapped at this time and it is reset to zero for security reasons on most systems. In order to measure the memory throughput, you need not to include such an overhead in the benchmark by just pre-allocating the memory and write into the memory chunks ahead of time (if possible with random values).

  • CPU cache can be quite big, the memory buffer written should be much larger than them in order not to measure the throughput of the cache themselves (typically due to cache associativity).

  • One thread is often not enough to saturate the bandwidth of the main memory. Few are often needed to reach the optimal throughput (this is very dependent of the platform, many threads are often need on server processors like Intel Xeon processors). With too many threads, some complex effects (eg. contention) can appear reducing the overall throughput.

  • On NUMA systems, memory access are generally faster if a core access to its own memory. This means threads should be pinned to cores and should read/write into a buffer dedicated to a thread to achieve the best throughput. This is for example especially true on the AMD Ryzen desktop/server processors or on bi-socket server systems.

  • Modern processors often use a variable frequency (see frequency scaling). Moreover, threads can take some time to create and actually start. Thus, using a loop iterating multiple times on the same buffer with a synchronization barrier is important to minimize the bias introduce with this effect. This is also important to check that the time taken by each thread is approximately the same (otherwise, it means that there is an unwanted effect happening like NUMA ones).

  • The amount of memory used should not be too big since some operating system use memory compression strategy (eg. z-swap) to avoid the memory use to be too big. In the worst case, a swapping storage device can be used.

Note that you can use OpenMP to write a parallel code more easily (the resulting code will be smaller and easier to read). OpenMP also enable you to control thread pinning and to thread the right amount of thread regarding the target architecture. OpenMP is supported on most compilers including GCC, Clang, ICC, MSVC (only the version 2.0 for MSVC so far).

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.