0

So I'm trying to toss together a little demo that is essentially: 1) generate some text live and save to a file (I've got this working), 2) have a local instance of an LLM running (Llama3 in this case), 3) pass chunks of the generated text to LLM to clean it up (it has lots of little typos and errors) using running context and re-save to text, 4) have another instance of LLM up so I can ask it questions about the complete text file being generated.

For portability, and so that I can eventually access it remotely on a web portal, I've got the text generation happening in Docker but I'm trying to run Llama in WASMEdge, using the LlamaEdge project (https://github.com/LlamaEdge/LlamaEdge). I've tried both following manual instructions to get the API server up and also just using their run-llm.sh, but in both cases it seems to ignore my GPU and also run painfully slow on my CPU (like 2-3 minutes processing for a 2 word conversational response to a simple question). I've run Llama3 locally on an old laptop with an 8th Gen Core i7 and it was way faster than this.

Given the speed and the GPU problem, I could have multiple issues here, but I know for sure the GPU isn't being used.

  • When I run wasmedge with Llama3.1-8B, I see [info] entries that nvcc and CUDA (12) are detected (and nvidia-smi works for me) but that libcudart.so is not found in the default installation path of CUDA.
  • My /usr/lib/cuda is indeed empty so I added /usr/lib/x86_64-linux-gnu to my bashrc PATH and ran source, but the error is the same. Since their .wasm is already compiled, I don't know where it's looking for CUDA files.
  • I can see [WASI-NN] entries talking about "backend: llama_system_info: CPU" and a bunch of buffer size info. I'm not sure which is relevant. Examples: it tells me that CPU output buffer size is 0.5 MB, KV buffer size is 3500 MB, and that my CPU compute buffer size is ~1600 MB. I also see an entry saying the context size is limited: n_ctx_per_seq (32000) < n_ctx_train (131072) -- the full capacity of the model will not be utilized.
  • Inside LlamaEdge's run-llm.sh, I tried setting backend=gpu and ctx_size=2048, then delete the .wasm and GGUF and run the script again, but it doesn't make a difference.

Does anyone have insight into what to test/fix here or, in the absence of that, a quick backup tool?

I'm running this on Ubuntu 24.04, RTX 4060, i5-13400F. Once I fix this or find an alternative, then I'll worry about how to get the text files moving back and forth, but one problem at a time.

Thanks!

1 Answer 1

0

So I ended up figuring this out, although I'm not sure if it was the best solution or not.

I had to reinstall WASMedge using an updated install script and specify the path to that .so directly in the install script (LIBCUDART_PATH). That made it recognize CUDA and the GPU, and then I had to mess with context size for memory reasons, and then it was up and running.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.