I am writing a program that uses Intel's MKL to do some matrix multiplications. I have a frustrating requirement that only a custom version of dynamic memory allocation is utilized. I'm aware this is usually considered a terrible idea, but I am using the linker's --wrap functionality to wrap malloc and free with my own custom implementation. In general, this has gone well so far.
However, it seems that some of the MKL code is performing dynamic allocation, and it is not invoking my custom malloc. I understand that MKL has also replaced the system malloc with its own custom malloc, but in my program, I am calling mkl_disable_fast_mm() which, to my understanding, should turn off the use of the MKL-custom malloc and revert to the system malloc. Now, since I've --wrapped the system malloc with my custom malloc, I was expecting to see my custom malloc called when MKL does its dynamic allocation.
When I run my program normally (as described above), I can see my custom malloc getting called everywhere that malloc is used except for the calls from inside MKL.
To add another level of complication, if I run the program with valgrind, then I do see my custom malloc invoked everywhere, including from within MKL. I realize that valgrind is ALSO replacing malloc with its own custom malloc, so there's several levels of malloc-replacement going on in this case.
My question, then, is: how can I get MKL to call my custom malloc when it does dynamic allocation. It seems that it must be possible, since it seems that using valgrind makes it happen, but I haven't been able to track down a way without using valgrind.
I put together a very minimal example that demonstrates what I tried to describe above:
#include <stdio.h>
#include <stdlib.h>
#include "mkl.h"
//Typedef some function pointer types here for malloc and free
typedef void* (*MallocFptr)(size_t);
typedef void (*FreeFptr)(void*);
extern MallocFptr __real_malloc;
extern FreeFptr __real_free;
extern "C" void* __wrap_malloc(size_t numBytes)
{
//Yes, this is a dumb way to do to this, keeping it minimal for demo
static char heapSpace[10000000] = {0};
static size_t heapOff = 0;
fprintf(stderr, "In __wrap_malloc, cur offset: %ld requesting %d!\n", heapOff, numBytes);
void* heapLoc = heapSpace + heapOff;
heapOff += numBytes;
return heapLoc;
}
extern "C" void __wrap_free(void* ptrToFree)
{
fprintf(stderr, "In __wrap_free!\n");
//just a no-op for the minimal demo
}
int main()
{
fprintf(stderr, "Disabling fast memory management for MKL in order to use system malloc and free instead\n");
int disableFastMMReturnVal = mkl_disable_fast_mm();
fprintf(stderr, " --> Reports value of %d (1 should mean MKL memory management turned off successfully)\n", disableFastMMReturnVal);
//Use malloc to allocate a small array of chars
char* tmpPtr;
tmpPtr = (char*)malloc(4 * sizeof(char));
tmpPtr[0] = 'f'; tmpPtr[1] = 'o'; tmpPtr[2] = 'o'; tmpPtr[3] = '\0';
//Use malloc to allocate another small array of chars
char* diffPtr;
diffPtr = (char*)malloc(3 * sizeof(char));
diffPtr[0] = 'h'; diffPtr[1] = 'i'; diffPtr[2] = '\0';
//See that data is as expected
fprintf(stderr, "TEMPPTR: %s DIFFPTR: %s\n", tmpPtr, diffPtr);
//Just a no-op for this demo, but see that the wrapped free gets called
free(diffPtr);
free(tmpPtr);
//Now, set up a MKL matrix multiply call:
const int M = 128;
const int K = 128;
const int N = 128;
const float alpha = 1.0;
const float beta = 0.0;
float A[M * K];
float B[K * N];
float C[M * N];
//Initialize the input matrices to known values
for (int r = 0; r < M; r++)
for (int c = 0; c < K; c++)
A[r * K + c] = r * c;
for (int r = 0; r < K; r++)
for (int c = 0; c < N; c++)
B[r * N + c] = r + c;
fprintf(stderr, "START CALL TO cblas_sgemm\n");
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, M, N, K, alpha, A, K, B, N, beta, C, N);
fprintf(stderr, "FINISHED CALL TO cblas_sgemm\n");
//Print some values from the output to check consistent results from run to run
fprintf(stderr, "[0][0]: %f\n", C[0 * N + 0]);
fprintf(stderr, "[20][20]: %f\n", C[20 * N + 20]);
fprintf(stderr, "[40][40]: %f\n", C[40 * N + 40]);
fprintf(stderr, "[100][100]: %f\n", C[100 * N + 100]);
return 0;
}
Here's the output when I run without valgrind:
# ./demo.exe
Disabling fast memory management for MKL in order to use system malloc and free instead
--> Reports value of 1 (1 should mean MKL memory management turned off successfully)
In __wrap_malloc, cur offset: 0 requesting 4!
In __wrap_malloc, cur offset: 4 requesting 3!
TEMPPTR: foo DIFFPTR: hi
In __wrap_free!
In __wrap_free!
START CALL TO cblas_sgemm
FINISHED CALL TO cblas_sgemm
[0][0]: 0.000000
[20][20]: 17068800.000000
[40][40]: 40640000.000000
[100][100]: 150367936.000000
And here's the output when I run with valgrind:
# valgrind ./demo.exe
==487722== Memcheck, a memory error detector
==487722== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==487722== Using Valgrind-3.17.0 and LibVEX; rerun with -h for copyright info
==487722== Command: ./demo.exe
==487722==
Disabling fast memory management for MKL in order to use system malloc and free instead
--> Reports value of 1 (1 should mean MKL memory management turned off successfully)
In __wrap_malloc, cur offset: 0 requesting 4!
In __wrap_malloc, cur offset: 4 requesting 3!
TEMPPTR: foo DIFFPTR: hi
In __wrap_free!
In __wrap_free!
START CALL TO cblas_sgemm
In __wrap_malloc, cur offset: 7 requesting 4344376!
In __wrap_malloc, cur offset: 4344383 requesting 69664!
In __wrap_malloc, cur offset: 4414047 requesting 256!
In __wrap_free!
FINISHED CALL TO cblas_sgemm
[0][0]: 0.000000
[20][20]: 17068800.000000
[40][40]: 40640000.000000
[100][100]: 150367936.000000
==487722==
==487722== HEAP SUMMARY:
==487722== in use at exit: 0 bytes in 0 blocks
==487722== total heap usage: 1 allocs, 1 frees, 72,704 bytes allocated
==487722==
==487722== All heap blocks were freed -- no leaks are possible
==487722==
==487722== For lists of detected and suppressed errors, rerun with: -s
==487722== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Notice that without valgrind, there is no calls to __wrap_malloc during the call to cblas_sgemm, but with valgrind, there are three calls to it.
EDIT #2:
As @Andrew Henle suggested, there may be other allocation functions, so I added a wrapper for calloc and realloc as well. It produces the exact result as shown above after adding these two new wrappers. Next, I ran nm on my executable (everything is linked statically), and I get the following:
nm -C demo.exe |& grep alloc
00000000004021f0 T cblas_xerbla_malloc_error
0000000000c7b008 D i_calloc
0000000000c7b000 D i_malloc
0000000000c7b010 D i_realloc
0000000001f94d40 b mkl_hbw_malloc_psize
00000000004438f0 T mkl_serv_allocate
000000000044a550 T mkl_serv_calloc
0000000000446030 T mkl_serv_deallocate
000000000044a660 T mkl_serv_jit_alloc
0000000000444b10 T mkl_serv_malloc
0000000000449860 T mkl_serv_realloc
0000000000443710 t mm_internal_malloc
0000000000442df0 t mm_internal_realloc
0000000001f94d50 b sys_alloc
0000000001f94d68 b sys_allocate
0000000001f94d70 b sys_deallocate
0000000001f94d58 b sys_realloc
0000000000401442 T __wrap_calloc
00000000004013e6 T __wrap_malloc
00000000004014b2 T __wrap_realloc
0000000001f8dd80 b __wrap_calloc::heapOff
0000000001604700 b __wrap_calloc::heapSpace
00000000016046e0 b __wrap_malloc::heapOff
0000000000c7b060 b __wrap_malloc::heapSpace
EDIT #3: My original post didn't have a small demo program, and I was anxious to update my post to include one, and in doing so, neglected to include the build information.
Here's the small Makefile:
MKL := /opt/intel/oneapi/mkl/2024.1
all:
g++ -pthread -g demo.cpp -I ${MKL}/include ${MKL}/lib/libmkl_intel_lp64.a ${MKL}/lib/libmkl_sequential.a ${MKL}/lib/libmkl_core.a -ldl -Wl,--wrap=malloc,--wrap=free -o demo.exe
Environment and versions:
- OS: CentOS Linux release 8.5.2111
- g++: g++ (GCC) 10.3.1 20210422 (Red Hat 10.3.1-1)
- ld: GNU ld version 2.30-108.el8_5.1
I realize those versions are kind of strange, so I also ran in Docker using Ubuntu 22.04.4 with g++ 11.4.0 and ld 2.38 and the results are the same as shown above.
realloc(),calloc(),posix_memalign(),valloc(),memalign(),aligned_alloc(),pvalloc()and probably others are all used by various implementations along withmalloc()to dynamically allocate memory.nmor evenstrings -ato see the symbols it's using if it's a library.strings -a /path/to/.../libXXX.so | grep allocwould show most of the dynamic memory allocating symbols.