So I'm trying to toss together a little demo that is essentially: 1) generate some text live and save to a file (I've got this working), 2) have a local instance of an LLM running (Llama3 in this case), 3) pass chunks of the generated text to LLM to clean it up (it has lots of little typos and errors) using running context and re-save to text, 4) have another instance of LLM up so I can ask it questions about the complete text file being generated.
For portability, and so that I can eventually access it remotely on a web portal, I've got the text generation happening in Docker but I'm trying to run Llama in WASMEdge, using the LlamaEdge project (https://github.com/LlamaEdge/LlamaEdge). I've tried both following manual instructions to get the API server up and also just using their run-llm.sh, but in both cases it seems to ignore my GPU and also run painfully slow on my CPU (like 2-3 minutes processing for a 2 word conversational response to a simple question). I've run Llama3 locally on an old laptop with an 8th Gen Core i7 and it was way faster than this.
Given the speed and the GPU problem, I could have multiple issues here, but I know for sure the GPU isn't being used.
- When I run
wasmedgewith Llama3.1-8B, I see[info]entries that nvcc and CUDA (12) are detected (andnvidia-smiworks for me) but thatlibcudart.so is not found in the default installation path of CUDA. - My
/usr/lib/cudais indeed empty so I added/usr/lib/x86_64-linux-gnuto mybashrcPATH and ransource, but the error is the same. Since their .wasm is already compiled, I don't know where it's looking for CUDA files. - I can see
[WASI-NN]entries talking about "backend: llama_system_info: CPU" and a bunch of buffer size info. I'm not sure which is relevant. Examples: it tells me that CPU output buffer size is 0.5 MB, KV buffer size is 3500 MB, and that my CPU compute buffer size is ~1600 MB. I also see an entry saying the context size is limited:n_ctx_per_seq (32000) < n_ctx_train (131072) -- the full capacity of the model will not be utilized. - Inside LlamaEdge's
run-llm.sh, I tried settingbackend=gpuandctx_size=2048, then delete the .wasm and GGUF and run the script again, but it doesn't make a difference.
Does anyone have insight into what to test/fix here or, in the absence of that, a quick backup tool?
I'm running this on Ubuntu 24.04, RTX 4060, i5-13400F. Once I fix this or find an alternative, then I'll worry about how to get the text files moving back and forth, but one problem at a time.
Thanks!