1

I’m trying to load gpt-oss-20b locally using Hugging Face transformers with CPU only. Minimal code:

from transformers import pipeline
model_path = "/mnt/d/Projects/models/gpt-oss-20b"
pipe = pipeline("text-generation", model=model_path, torch_dtype="auto", device_map="auto")
pipe("Hello", max_new_tokens=20)

I get:

KeyError: 'model.layers.5.mlp.experts.gate_up_proj'

Here are some more details from the traceback:

Using MXFP4 quantized models requires a GPU, we will default to dequantizing the model to bf16
Loading checkpoint shards: 100%
Some parameters are on the meta device because they were offloaded to the cpu and disk.
Device set to use cpu
Traceback (most recent call last):
  File "/home/dev/projects/wolf-in-ai-clothing/convo_test.py", line 19, in invoke
    response = model(user_message, max_new_tokens=20, num_return_sequences=1)
  File ".../transformers/pipelines/text_generation.py", line 419, in _forward
    output = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
  File ".../transformers/models/gpt_oss/modeling_gpt_oss.py", line 375, in forward
    hidden_states, _ = self.mlp(hidden_states)  # diff with llama: router scores
  File ".../transformers/models/gpt_oss/modeling_gpt_oss.py", line 159, in forward
    routed_out = self.experts(hidden_states, router_indices=router_indices, routing_weights=router_scores)
  File ".../accelerate/utils/offload.py", line 118, in __getitem__
    return self.dataset[f"{self.prefix}{key}"]
  File ".../accelerate/utils/offload.py", line 165, in __getitem__
    weight_info = self.index[key]
KeyError: 'model.layers.5.mlp.experts.gate_up_proj'

I verified that the directory exists and includes the model files. A similar problem appears in huggingface discussions where I followed the steps suggested by @noobaymax :

pip install git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels

pip install git+https://github.com/huggingface/transformers.git

pip install kernels

but the output remains the same.

Environment:

  • Python 3.12.3

  • transformers 4.56.0.dev0 (also tried 4.55.1)

  • torch 2.8.0

  • accelerate 1.10.0

  • Ubuntu 22.04 on WSL2, no GPU, 32GB RAM

How can I load this model correctly on CPU?

1

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.