I’m trying to load gpt-oss-20b locally using Hugging Face transformers with CPU only. Minimal code:
from transformers import pipeline
model_path = "/mnt/d/Projects/models/gpt-oss-20b"
pipe = pipeline("text-generation", model=model_path, torch_dtype="auto", device_map="auto")
pipe("Hello", max_new_tokens=20)
I get:
KeyError: 'model.layers.5.mlp.experts.gate_up_proj'
Here are some more details from the traceback:
Using MXFP4 quantized models requires a GPU, we will default to dequantizing the model to bf16
Loading checkpoint shards: 100%
Some parameters are on the meta device because they were offloaded to the cpu and disk.
Device set to use cpu
Traceback (most recent call last):
File "/home/dev/projects/wolf-in-ai-clothing/convo_test.py", line 19, in invoke
response = model(user_message, max_new_tokens=20, num_return_sequences=1)
File ".../transformers/pipelines/text_generation.py", line 419, in _forward
output = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
File ".../transformers/models/gpt_oss/modeling_gpt_oss.py", line 375, in forward
hidden_states, _ = self.mlp(hidden_states) # diff with llama: router scores
File ".../transformers/models/gpt_oss/modeling_gpt_oss.py", line 159, in forward
routed_out = self.experts(hidden_states, router_indices=router_indices, routing_weights=router_scores)
File ".../accelerate/utils/offload.py", line 118, in __getitem__
return self.dataset[f"{self.prefix}{key}"]
File ".../accelerate/utils/offload.py", line 165, in __getitem__
weight_info = self.index[key]
KeyError: 'model.layers.5.mlp.experts.gate_up_proj'
I verified that the directory exists and includes the model files. A similar problem appears in huggingface discussions where I followed the steps suggested by @noobaymax :
pip install git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels
pip install git+https://github.com/huggingface/transformers.git
pip install kernels
but the output remains the same.
Environment:
Python 3.12.3
transformers 4.56.0.dev0 (also tried 4.55.1)
torch 2.8.0
accelerate 1.10.0
Ubuntu 22.04 on WSL2, no GPU, 32GB RAM
How can I load this model correctly on CPU?