I am using the LightGBM C Api in our ML model hosting service, written in Golang. I've written a CGO wrapper around the C Api. I am using the “lib_lightgbm.so” library file provided on Github.
I am on go1.20.4 in a Linux environment.
I raised an issue with the official LightGBM git as well here. It contains a more detailed analysis of the situation.
Context:
I load a few LightGBM models in our model hosting service in production and refresh the models as soon as the new ones are available. The new models are loaded via the method LGBM_BoosterCreateFromModelfile provided by the api and the older ones are released with with the method LGBM_BoosterFree.
I am hosting this service on GKE pods which have a fixed amount of memory.
Issue:
I see a gradual uptick in the RSS (Resident Memory Set) of the service as soon as the model is refreshed. To debug the issue, I stripped down the problematic piece of code to bare minimum and following is the result.
package main
// #cgo LDFLAGS: -L/home/ayush.goya/minimalExample/ -l_lightgbm
// #include "c_api.h"
// #include <stdio.h>
// #include <stdlib.h>
import "C"
import (
"runtime/debug"
)
var predictor C.BoosterHandle
func Load() {
outNumIterations := C.int(0)
res := int(C.LGBM_BoosterCreateFromModelfile(C.CString("model.txt"), &outNumIterations, &predictor))
debug.FreeOSMemory()
println("Load Success")
}
func Release() {
res := int(C.LGBM_BoosterFree(predictor))
debug.FreeOSMemory()
println("Release Success")
}
I am measuring RSS through this piece of code and I trust the values because they match with htop
func GetRssMB() string {
// Read memory statistics from /proc/self/statm
data, err := os.ReadFile("/proc/self/statm")
if err != nil {
fmt.Println("Error reading /proc/self/statm:", err)
return "0"
}
// Extract resident memory size (in pages)
fields := strings.Fields(string(data))
if len(fields) < 2 {
fmt.Println("Unexpected format of /proc/self/statm")
return "0"
}
rssPages, err := strconv.ParseUint(fields[1], 10, 64)
if err != nil {
fmt.Println("Error parsing resident memory size:", err)
return "0"
}
// Convert pages to bytes (assuming 4 KB page size)
rssBytes := rssPages * 4096
rssBytes /= 1024 * 1024
return strconv.FormatUint(rssBytes, 10)
}
Let's say the starting RSS is x bytes. After calling Load(), when the model has finished loading, the RSS increases to y, which is expected. Now, on trying to free up the memory by calling Release(), the RSS is z.
Me expectation is that, z should be very close to x. Instead z and x have a huge difference in values. (z is almost 100 times x for my model size).
This happens everytime I go through the cycle of Load() and Release() and hence the RSS gradually increases. This is causing my GKE pods to get OOM killed.
What is holding up the memory and not returning? I profiled the code and the heapSys, heapIdle, heapInuse are all very low.
I am at a loss on how to figure this. Is it something about Go memory management that I am missing here? Or something about how to handle CGo. Requesting help.
debug.FreeOSMemoryhas nothing to do with memory allocated in C via CGO, since that memory was not allocated by the Go runtime. You also can't profile memory allocated in C via the go profiler, again because that is entirely outside of the go runtime.debug.FreeOSMemoryto be sure not to pick up any footprints related to serving.C.free. That all depends on how the C library is designed of course. If the C code is leaking memory, then you need to fix it from within the C code, it's not something related to Go.C.free()the value returned byC.CString("model"), though I doubt that's the issue causing you to go OOM.