3

Not sure if its the right forum to ask but.

Assuming i have a gptq model that is 4bit. how does using from_pretrained(torch_dtype=torch.float16) work? In my understanding 4 bit meaning changing the weights from either 32-bit precision to 4bit precision using quantization methods.

However, calling it the torch_dtype=torch.float16 would mean the weights are in 16 bits? Am i missing something here.

1 Answer 1

2

GPTQ is a Post-Training Quantization method. This means a GPTQ model was created in full precision and then compressed. Not all values will be in 4 bits unless every weight and activation layer has been quantized.

The GPTQ method does not do this:

Specifically, GPTQ adopts a mixed int4/fp16 quantization scheme where weights are quantized as int4 while activations remain in float16.

As these values need to be multiplied together, this means that,

during inference, weights are dequantized on the fly and the actual compute is performed in float16.

In a Hugging Face quantization blog post from Aug 2023, they talk about the possibility of quantizing activations as well in the Room for Improvement section. However, at that time there were no open source implementations.

Since then, they have released Quanto. This does support quantizing activations. It looks promising but it is not yet quicker than other quantization methods. It is in beta and the docs say to expect breaking changes in the API and serialization. There are some accuracy and perplexity benchmarks which look pretty good with most models. Surprisingly, at the moment it is slower than 16-bit models due to lack of optimized kernels, but that seems to be something they're working on.

So this does not just apply to GPTQ. You will find yourself using float16 with any of the popular quantization methods at the moment. For example, Activation-aware Weight Quantization (AWQ) also preserves in full precision a small percentage of the weights that are important for performance. This is a useful blog post comparing GPTQ with other quantization methods.

Sign up to request clarification or add additional context in comments.

7 Comments

Hey, thanks for the answer. Does it mean that quantization refers to, e.g., the state at which the weights are stored, while the dtype is the data type of the weights when they are dequantized during a computation? Thanks in advance for the help.
@thenac. Not quite. Quantization refers to how the weights are stored and used during computations, typically in lower precision (e.g., int8) to save memory and improve speed. The dtype refers to the format of the weights at any given point—quantized or dequantized. In AWQ, weights stay quantized even during computation, meaning dequantization isn’t typically involved in the process.
Thanks for the reply. I am not sure I get it though, sorry. Libraries like transformers, e.g. through bitsandbytes, allow you to define that you want to (1) load the weights in 4 bits, and (2) set the compute_dtype to fp16. -- Do you know what this means? Is it 4 bits or 16 at the end of the day?
@thenac I'm not entirely sure I understand what you're asking. Do you think it might be worth asking a new a question, with some code? Asking "what does this code do" type questions often don't go down that well but if you phrase it as something you're trying to do, i.e. how do I load the weights in 4 bits, I think that would be on-topic (as long as you're clearly you're not asking for library recommendations).
Yes, that was precisely what I meant! Thanks for the answer, I appreciate the help :)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.