Skip to content

[ANALYSE] wrong perf on quantized model with rocm 7.10/7.11 #17596

@Djip007

Description

@Djip007

I don't know what can get wrong. rocm or llama.cpp si for now juste a question.

I have 2 build of llama.cpp with rocm7.9 ans rocm7.11 (from therock) when I bench with Mistral-Small-2506-Q6_K.gguf I get:

Hadware: a Ryzen IA MAX+ with 128Go
OS: Fedora 43

  • with rocm 7.9 (same with rocm 6.4.4)

⬢ [philou@toolbx LLM]$ GGML_CUDA_ENABLE_UNIFIED_MEMORY=ON ./build_ref/${THEROCK_VER}/bin/llama-bench -ngl 999 --mmap 0 -ub 4096 -b 8192 -fa 1 -r 3 -p "1,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16" -n 16 -pg "512,64" -m ${LLM_MODEL}
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32

model size params backend ngl n_batch n_ubatch fa mmap test t/s
llama 13B Q6_K 18.31 GiB 23.57 B ROCm 999 8192 4096 1 0 pp1 11.34 ± 0.01
llama 13B Q6_K 18.31 GiB 23.57 B ROCm 999 8192 4096 1 0 pp1 11.34 ± 0.00
llama 13B Q6_K 18.31 GiB 23.57 B ROCm 999 8192 4096 1 0 pp2 22.20 ± 0.01
llama 13B Q6_K 18.31 GiB 23.57 B ROCm 999 8192 4096 1 0 pp3 32.85 ± 0.01
llama 13B Q6_K 18.31 GiB 23.57 B ROCm 999 8192 4096 1 0 pp4 42.87 ± 0.05
llama 13B Q6_K 18.31 GiB 23.57 B ROCm 999 8192 4096 1 0 pp5 51.47 ± 0.03
llama 13B Q6_K 18.31 GiB 23.57 B ROCm 999 8192 4096 1 0 pp6 57.74 ± 0.06
llama 13B Q6_K 18.31 GiB 23.57 B ROCm 999 8192 4096 1 0 pp7 62.06 ± 0.06
llama 13B Q6_K 18.31 GiB 23.57 B ROCm 999 8192 4096 1 0 pp8 66.04 ± 0.03
llama 13B Q6_K 18.31 GiB 23.57 B ROCm 999 8192 4096 1 0 pp9 77.15 ± 0.05
llama 13B Q6_K 18.31 GiB 23.57 B ROCm 999 8192 4096 1 0 pp10 85.57 ± 0.10
llama 13B Q6_K 18.31 GiB 23.57 B ROCm 999 8192 4096 1 0 pp11 94.00 ± 0.11
llama 13B Q6_K 18.31 GiB 23.57 B ROCm 999 8192 4096 1 0 pp12 102.18 ± 0.04
llama 13B Q6_K 18.31 GiB 23.57 B ROCm 999 8192 4096 1 0 pp13 110.20 ± 0.11
llama 13B Q6_K 18.31 GiB 23.57 B ROCm 999 8192 4096 1 0 pp14 118.40 ± 0.08
llama 13B Q6_K 18.31 GiB 23.57 B ROCm 999 8192 4096 1 0 pp15 126.76 ± 0.09
llama 13B Q6_K 18.31 GiB 23.57 B ROCm 999 8192 4096 1 0 pp16 135.08 ± 0.13
  • with rocm 7.11:
model size params backend ngl n_batch n_ubatch fa mmap test t/s
llama 13B Q6_K 18.31 GiB 23.57 B ROCm 999 8192 4096 1 0 pp1 11.41 ± 0.00
llama 13B Q6_K 18.31 GiB 23.57 B ROCm 999 8192 4096 1 0 pp1 11.41 ± 0.01
llama 13B Q6_K 18.31 GiB 23.57 B ROCm 999 8192 4096 1 0 pp2 22.35 ± 0.00
llama 13B Q6_K 18.31 GiB 23.57 B ROCm 999 8192 4096 1 0 pp3 33.01 ± 0.04
llama 13B Q6_K 18.31 GiB 23.57 B ROCm 999 8192 4096 1 0 pp4 43.16 ± 0.02
llama 13B Q6_K 18.31 GiB 23.57 B ROCm 999 8192 4096 1 0 pp5 52.03 ± 0.04
llama 13B Q6_K 18.31 GiB 23.57 B ROCm 999 8192 4096 1 0 pp6 58.99 ± 0.08
llama 13B Q6_K 18.31 GiB 23.57 B ROCm 999 8192 4096 1 0 pp7 62.81 ± 0.10
llama 13B Q6_K 18.31 GiB 23.57 B ROCm 999 8192 4096 1 0 pp8 66.02 ± 0.07
llama 13B Q6_K 18.31 GiB 23.57 B ROCm 999 8192 4096 1 0 pp9 26.60 ± 0.02
llama 13B Q6_K 18.31 GiB 23.57 B ROCm 999 8192 4096 1 0 pp10 29.53 ± 0.02
llama 13B Q6_K 18.31 GiB 23.57 B ROCm 999 8192 4096 1 0 pp11 32.47 ± 0.02
llama 13B Q6_K 18.31 GiB 23.57 B ROCm 999 8192 4096 1 0 pp12 35.39 ± 0.01
llama 13B Q6_K 18.31 GiB 23.57 B ROCm 999 8192 4096 1 0 pp13 38.35 ± 0.02
llama 13B Q6_K 18.31 GiB 23.57 B ROCm 999 8192 4096 1 0 pp14 41.27 ± 0.02
llama 13B Q6_K 18.31 GiB 23.57 B ROCm 999 8192 4096 1 0 pp15 44.21 ± 0.01
llama 13B Q6_K 18.31 GiB 23.57 B ROCm 999 8192 4096 1 0 pp16 47.16 ± 0.03
llama 13B Q6_K 18.31 GiB 23.57 B ROCm 999 8192 4096 1 0 tg16 11.40 ± 0.01
llama 13B Q6_K 18.31 GiB 23.57 B ROCm 999 8192 4096 1 0 pp512+tg64 81.44 ± 0.04

Now I know is a runtime problem, if I run the binary build with rocm7.9 with the rocm7.11 runtime I have the same result has with rocm7.9....

on BF16/FP16 model all look good with both release...

I did not know if it can be because of a different build path with rocm7.11, or with the compiler (hipcc/clang.)

  • rocm 7.9 look to use llvm 20
  • rocm 7.11 look to use llvm 22

For now, I'm trying to figure out what's going on. This is so I can report the right bugs to the right place.

If someone have idea? What happen with pp8 vs pp9, different path/code? (note: I'll have a look but did not see anything for now)

if some want to test I can provide more build detail.

Metadata

Metadata

Assignees

No one assigned

    Labels

    RoCMIssues related to the RoCM backendperformanceSpeed related topics

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions