I'm trying to run the Qwen2.5-Coder-3B model locally with 8-bit quantization using BitsAndBytes.
While loading the model, I noticed that some examples also specify torch_dtype=torch.float16.
From my understanding, torch_dtype mainly affects the activation and output dtypes, not the quantized weights themselves.
However, I’m not completely sure whether setting torch_dtype=torch.float16 actually overrides the quantization or if both can safely coexist.
With torch_dtype=torch.float16
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-3B")
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-Coder-3B",
quantization_config=bnb_config,
torch_dtype=torch.float16, # <-- does this override quantization?
device_map="auto"
)
Without specifying torch_dtype
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-Coder-3B",
quantization_config=bnb_config,
device_map="auto"
)
What is the difference between these two setups in terms of:
- How model weights are stored and loaded (INT8 vs FP16)
- The dtype used for activations and outputs during inference
- Whether setting
torch_dtype=torch.float16can override or interfere with 8-bit quantization applied by BitsAndBytes