Using custom malloc implementation within MKL

Question

I am writing a program that uses Intel's MKL to do some matrix multiplications. I have a frustrating requirement that only a custom version of dynamic memory allocation is utilized. I'm aware this is usually considered a terrible idea, but I am using the linker's --wrap functionality to wrap malloc and free with my own custom implementation. In general, this has gone well so far.

However, it seems that some of the MKL code is performing dynamic allocation, and it is not invoking my custom malloc. I understand that MKL has also replaced the system malloc with its own custom malloc, but in my program, I am calling mkl_disable_fast_mm() which, to my understanding, should turn off the use of the MKL-custom malloc and revert to the system malloc. Now, since I've --wrapped the system malloc with my custom malloc, I was expecting to see my custom malloc called when MKL does its dynamic allocation.

When I run my program normally (as described above), I can see my custom malloc getting called everywhere that malloc is used except for the calls from inside MKL.

To add another level of complication, if I run the program with valgrind, then I do see my custom malloc invoked everywhere, including from within MKL. I realize that valgrind is ALSO replacing malloc with its own custom malloc, so there's several levels of malloc-replacement going on in this case.

My question, then, is: how can I get MKL to call my custom malloc when it does dynamic allocation. It seems that it must be possible, since it seems that using valgrind makes it happen, but I haven't been able to track down a way without using valgrind.

I put together a very minimal example that demonstrates what I tried to describe above:

#include <stdio.h>
#include <stdlib.h>

#include "mkl.h"

//Typedef some function pointer types here for malloc and free
typedef void* (*MallocFptr)(size_t);
typedef void (*FreeFptr)(void*);

extern MallocFptr __real_malloc;
extern FreeFptr __real_free;

extern "C" void* __wrap_malloc(size_t numBytes)
{
  //Yes, this is a dumb way to do to this, keeping it minimal for demo
  static char heapSpace[10000000] = {0};
  static size_t heapOff = 0;

  fprintf(stderr, "In __wrap_malloc, cur offset: %ld requesting %d!\n", heapOff, numBytes);

  void* heapLoc = heapSpace + heapOff;
  heapOff += numBytes;

  return heapLoc;
}

extern "C" void __wrap_free(void* ptrToFree)
{
  fprintf(stderr, "In __wrap_free!\n");
  //just a no-op for the minimal demo
}

int main()
{
  fprintf(stderr, "Disabling fast memory management for MKL in order to use system malloc and free instead\n");
  int disableFastMMReturnVal = mkl_disable_fast_mm();
  fprintf(stderr, "  --> Reports value of %d (1 should mean MKL memory management turned off successfully)\n", disableFastMMReturnVal);

  //Use malloc to allocate a small array of chars
  char* tmpPtr;
  tmpPtr = (char*)malloc(4 * sizeof(char));
  tmpPtr[0] = 'f'; tmpPtr[1] = 'o'; tmpPtr[2] = 'o'; tmpPtr[3] = '\0';

  //Use malloc to allocate another small array of chars
  char* diffPtr;
  diffPtr = (char*)malloc(3 * sizeof(char));
  diffPtr[0] = 'h'; diffPtr[1] = 'i'; diffPtr[2] = '\0';

  //See that data is as expected
  fprintf(stderr, "TEMPPTR: %s DIFFPTR: %s\n", tmpPtr, diffPtr);
  //Just a no-op for this demo, but see that the wrapped free gets called
  free(diffPtr);
  free(tmpPtr);

  //Now, set up a MKL matrix multiply call:
  const int M = 128;
  const int K = 128;
  const int N = 128;
  const float alpha = 1.0;
  const float beta = 0.0;

  float A[M * K];
  float B[K * N];
  float C[M * N];

  //Initialize the input matrices to known values
  for (int r = 0; r < M; r++)
    for (int c = 0; c < K; c++)
      A[r * K + c] = r * c;

  for (int r = 0; r < K; r++)
    for (int c = 0; c < N; c++)
      B[r * N + c] = r + c;

  fprintf(stderr, "START CALL TO cblas_sgemm\n");
  cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, M, N, K, alpha, A, K, B, N, beta, C, N);
  fprintf(stderr, "FINISHED CALL TO cblas_sgemm\n");

  //Print some values from the output to check consistent results from run to run
  fprintf(stderr, "[0][0]: %f\n", C[0 * N + 0]);
  fprintf(stderr, "[20][20]: %f\n", C[20 * N + 20]);
  fprintf(stderr, "[40][40]: %f\n", C[40 * N + 40]);
  fprintf(stderr, "[100][100]: %f\n", C[100 * N + 100]);

  return 0;
}

Here's the output when I run without valgrind:

# ./demo.exe 
Disabling fast memory management for MKL in order to use system malloc and free instead
  --> Reports value of 1 (1 should mean MKL memory management turned off successfully)
In __wrap_malloc, cur offset: 0 requesting 4!
In __wrap_malloc, cur offset: 4 requesting 3!
TEMPPTR: foo DIFFPTR: hi
In __wrap_free!
In __wrap_free!
START CALL TO cblas_sgemm
FINISHED CALL TO cblas_sgemm
[0][0]: 0.000000
[20][20]: 17068800.000000
[40][40]: 40640000.000000
[100][100]: 150367936.000000

And here's the output when I run with valgrind:

# valgrind ./demo.exe 
==487722== Memcheck, a memory error detector
==487722== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==487722== Using Valgrind-3.17.0 and LibVEX; rerun with -h for copyright info
==487722== Command: ./demo.exe
==487722== 
Disabling fast memory management for MKL in order to use system malloc and free instead
  --> Reports value of 1 (1 should mean MKL memory management turned off successfully)
In __wrap_malloc, cur offset: 0 requesting 4!
In __wrap_malloc, cur offset: 4 requesting 3!
TEMPPTR: foo DIFFPTR: hi
In __wrap_free!
In __wrap_free!
START CALL TO cblas_sgemm
In __wrap_malloc, cur offset: 7 requesting 4344376!
In __wrap_malloc, cur offset: 4344383 requesting 69664!
In __wrap_malloc, cur offset: 4414047 requesting 256!
In __wrap_free!
FINISHED CALL TO cblas_sgemm
[0][0]: 0.000000
[20][20]: 17068800.000000
[40][40]: 40640000.000000
[100][100]: 150367936.000000
==487722== 
==487722== HEAP SUMMARY:
==487722==     in use at exit: 0 bytes in 0 blocks
==487722==   total heap usage: 1 allocs, 1 frees, 72,704 bytes allocated
==487722== 
==487722== All heap blocks were freed -- no leaks are possible
==487722== 
==487722== For lists of detected and suppressed errors, rerun with: -s
==487722== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Notice that without valgrind, there is no calls to __wrap_malloc during the call to cblas_sgemm, but with valgrind, there are three calls to it.

EDIT #2: As @Andrew Henle suggested, there may be other allocation functions, so I added a wrapper for calloc and realloc as well. It produces the exact result as shown above after adding these two new wrappers. Next, I ran nm on my executable (everything is linked statically), and I get the following:

nm -C demo.exe |& grep alloc
00000000004021f0 T cblas_xerbla_malloc_error
0000000000c7b008 D i_calloc
0000000000c7b000 D i_malloc
0000000000c7b010 D i_realloc
0000000001f94d40 b mkl_hbw_malloc_psize
00000000004438f0 T mkl_serv_allocate
000000000044a550 T mkl_serv_calloc
0000000000446030 T mkl_serv_deallocate
000000000044a660 T mkl_serv_jit_alloc
0000000000444b10 T mkl_serv_malloc
0000000000449860 T mkl_serv_realloc
0000000000443710 t mm_internal_malloc
0000000000442df0 t mm_internal_realloc
0000000001f94d50 b sys_alloc
0000000001f94d68 b sys_allocate
0000000001f94d70 b sys_deallocate
0000000001f94d58 b sys_realloc
0000000000401442 T __wrap_calloc
00000000004013e6 T __wrap_malloc
00000000004014b2 T __wrap_realloc
0000000001f8dd80 b __wrap_calloc::heapOff
0000000001604700 b __wrap_calloc::heapSpace
00000000016046e0 b __wrap_malloc::heapOff
0000000000c7b060 b __wrap_malloc::heapSpace

EDIT #3: My original post didn't have a small demo program, and I was anxious to update my post to include one, and in doing so, neglected to include the build information.

Here's the small Makefile:

MKL := /opt/intel/oneapi/mkl/2024.1

all:
    g++ -pthread -g demo.cpp -I ${MKL}/include ${MKL}/lib/libmkl_intel_lp64.a ${MKL}/lib/libmkl_sequential.a ${MKL}/lib/libmkl_core.a -ldl -Wl,--wrap=malloc,--wrap=free -o demo.exe

Environment and versions:

OS: CentOS Linux release 8.5.2111
g++: g++ (GCC) 10.3.1 20210422 (Red Hat 10.3.1-1)
ld: GNU ld version 2.30-108.el8_5.1

I realize those versions are kind of strange, so I also ran in Docker using Ubuntu 22.04.4 with g++ 11.4.0 and ld 2.38 and the results are the same as shown above.

What's the full set of functions you're interposing? Functions such as realloc(), calloc(), posix_memalign(), valloc(), memalign(), aligned_alloc(), pvalloc() and probably others are all used by various implementations along with malloc() to dynamically allocate memory. — Andrew Henle
– Andrew Henle, Commented May 10, 2024 at 17:44
Ahh, that's a good point. Right now, I'm ONLY doing malloc and free. I should definitely try the others, was hoping it wasn't necessary, but if it is, so be it. Maybe MKL uses realloc or something, and when valgrind replaces it, it uses regular malloc in its place? Something to think about. I just added a "small" demo to the question in case it is help or clarifies anything. — daroo
– daroo, Commented May 10, 2024 at 17:50
How is MKL provided? You can use nm or even strings -a to see the symbols it's using if it's a library. strings -a /path/to/.../libXXX.so | grep alloc would show most of the dynamic memory allocating symbols. — Andrew Henle
– Andrew Henle, Commented May 10, 2024 at 17:56
Another good point.. I ran "nm -C" on my executable (since everything is linked in statically), and it looks like calloc and realloc might be being used.. I wrapped those as well using the same approach. When I build and run (with prints in the new wrapped functions), it produces the exact same results as above for both without-valgrind, and with-valgrind, so doesn't look like those are getting called after all.. — daroo
– daroo, Commented May 10, 2024 at 18:27

Employed Russian · Accepted Answer · 2024-05-14 00:24:49Z

From David Agans 9 debugging rules book: quit thinking and look.

Step 0, replicate your output

MKL=/opt/intel/oneapi/mkl/2024.1
g++ -pthread -g mkl.c -I ${MKL}/include -static -L ${MKL}/lib \
  -lmkl_intel_lp64 -lmkl_intel_thread -liomp5 -lmkl_core -ldl \
  $(for f in malloc free; do echo -Wl,--wrap,$f; done)

./a.out
...
START CALL TO cblas_sgemm
In __wrap_malloc, cur offset: 1189 requesting 47!
In __wrap_malloc, cur offset: 1236 requesting 24!
In __wrap_free!
In __wrap_malloc, cur offset: 1260 requesting 51!
In __wrap_free!
...
In __wrap_malloc, cur offset: 777399 requesting 256!
In __wrap_free!
FINISHED CALL TO cblas_sgemm
[0][0]: 0.000000
[20][20]: 17068800.000000
[40][40]: 40640000.000000
[100][100]: 150367936.000000
...

Hmm, I've failed at that step :-(

Update: With the updated build command:

g++ -pthread -g mkl.cc -I ${MKL}/include \
  ${MKL}/lib/libmkl_intel_lp64.a ${MKL}/lib/libmkl_sequential.a \
  ${MKL}/lib/libmkl_core.a -ldl -Wl,--wrap=malloc,--wrap=free

I am still failing to reproduce the problem. My output is:

$ ./a.out
Disabling fast memory management for MKL in order to use system malloc and free instead
  --> Reports value of 1 (1 should mean MKL memory management turned off successfully)
In __wrap_malloc, cur offset: 0 requesting 4!
In __wrap_malloc, cur offset: 4 requesting 3!
TEMPPTR: foo DIFFPTR: hi
In __wrap_free!
In __wrap_free!
START CALL TO cblas_sgemm
In __wrap_malloc, cur offset: 7 requesting 213688!
In __wrap_malloc, cur offset: 213695 requesting 69664!
In __wrap_malloc, cur offset: 283359 requesting 256!
In __wrap_free!
FINISHED CALL TO cblas_sgemm
[0][0]: 0.000000
[20][20]: 17068800.000000
[40][40]: 40640000.000000
[100][100]: 150367936.000000

When I set a breakpoint on __wrap_free, I observe that it's getting called from this stack:

(gdb) bt
#0  __wrap_free (ptrToFree=0xa7d067 <__wrap_malloc::heapSpace+7>) at mkl.cc:29
#1  0x0000000000448724 in mkl_serv_free ()
#2  0x00000000004478c0 in mkl_serv_deallocate ()
#3  0x0000000000848c70 in mkl_blas_def_xsgemm_bdz ()
#4  0x000000000043663d in mkl_blas_def_xsgemm ()
#5  0x0000000000403fd6 in mkl_blas_sgemm ()
#6  0x0000000000402e04 in sgemm_ ()
#7  0x0000000000402912 in cblas_sgemm ()
#8  0x0000000000402697 in main () at mkl.cc:76

And the disassembly of mkl_serv_free near the call is:

(gdb) up
#1  0x0000000000448724 in mkl_serv_free ()
(gdb) x/2i $pc-5
   0x44871f <mkl_serv_free+431>:        call   0x402442 <__wrap_free(void*)>
=> 0x448724 <mkl_serv_free+436>:        jmp    0x448732 <mkl_serv_free+450>

In your case, the call @mkl_serv_free+431 probably looks like __libc_free for some reason.

End update.

Possible causes:

You are not telling exactly how you are building your test, and it matters
You are using a different version of MKL
You are using a different version of GCC / binutils (I used g++ (GCC) 13.2.1 20240316 (Red Hat 13.2.1-7) and GNU ld version 2.40-14.fc39).
Something else.

If I was able to replicate your behavior, my next step would have been: run the program under GDB, set breakpoints on all the allocation functions, disable them. Set a breakpoint on cblas_sgemm. Once that breakpoint is hit, reenable all other breakpoints, and once one of them is hit use (gdb) where to figure out where un-intercepted invocation is coming from.

After that, to figure out why it was not intercepted I would examine the calling function, look at its relocation records in the .o file using readelf -Wr foo.o, etc.

You're right - I was so concerned with updating my original post to include a small demo program I completely forgot to show how I built it when I edited it in. I'll update the post with the build command (which is fairly similar to yours) and OS and compiler/linker version (which are fairly different from yours). To add to the confusion, I misspoke a bit when I said "everything is linked in statically" - really, its only the MKL stuff that is linked statically. Interesting that during the cblas_sgemm call your first offset is 1189 - seems like a lot more allocations occurring prior?
@daroo I am still failing to repro. I suggest setting a breakpoint on mkl_serv_free and verifying that it's getting hit.

Collectives™ on Stack Overflow

Using custom malloc implementation within MKL

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related