0

I'm trying to run the Qwen2.5-Coder-3B model locally with 8-bit quantization using BitsAndBytes.

While loading the model, I noticed that some examples also specify torch_dtype=torch.float16. From my understanding, torch_dtype mainly affects the activation and output dtypes, not the quantized weights themselves.

However, I’m not completely sure whether setting torch_dtype=torch.float16 actually overrides the quantization or if both can safely coexist.

With torch_dtype=torch.float16

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(load_in_8bit=True)

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-3B")

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-Coder-3B",
    quantization_config=bnb_config,
    torch_dtype=torch.float16,   # <-- does this override quantization?
    device_map="auto"
)

Without specifying torch_dtype

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-Coder-3B",
    quantization_config=bnb_config,
    device_map="auto"
)

What is the difference between these two setups in terms of:

  1. How model weights are stored and loaded (INT8 vs FP16)
  2. The dtype used for activations and outputs during inference
  3. Whether setting torch_dtype=torch.float16 can override or interfere with 8-bit quantization applied by BitsAndBytes

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.