How to quantize sentence-transformer model on CPU to use it on GPU?

Question

I wanted to use the 'Salesforce/SFR-Embedding-Mistral' embedding model, but it is too large for the GPU partition I have access to. Therefore, I considered quantizing the model, but I couldn't find a pre-quantized version available.

When I attempted to quantize it using bitsandbytes, it tries to load the entire model onto the GPU, which resulted in the same error.

model = AutoModel.from_pretrained(
    'Salesforce/SFR-Embedding-Mistral',
    trust_remote_code=True,
    device_map='auto',
    torch_dtype=torch.bfloat16,
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16
    )
)

Then, I tried to load the model onto the CPU first and then quantize it before moving the quantized model to the GPU:

model.to('cpu')
if torch.cuda.is_available():
    model.to('cuda')

However, bitsandbytes does not support changing devices for quantized models:

ValueError: `.to` is not supported for `4-bit` or `8-bit` bitsandbytes models. Please use the model as it is, since the model has already been set to the correct devices and cast to the correct `dtype`.

The solutions I found, such as this GitHub issue and this blog post, were not helpful or are outdated.

esqew · Accepted Answer · 2024-06-10 16:15:37Z

1

In this case, you only have to load the model without moving to CUDA. As exposed in the error log, when the parameter device_map is 'auto', device_map='auto', and you are using a quantization configuration, bnbconfig of 4 bits or 8 bits, the model is automatically moved to CUDA. Hence, it could not be moved twice and that error raises.

So, only use this code and will work:

model = AutoModel.from_pretrained(
    'Salesforce/SFR-Embedding-Mistral',
    trust_remote_code=True,
    device_map='auto',
    torch_dtype=torch.bfloat16,
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16
    )
)

edited Jun 10, 2024 at 16:15

esqew

44.9k28 gold badges139 silver badges179 bronze badges

answered Jun 10, 2024 at 12:01

Andrés Herencia

111 bronze badge

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to quantize sentence-transformer model on CPU to use it on GPU?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related