I wanted to use the 'Salesforce/SFR-Embedding-Mistral' embedding model, but it is too large for the GPU partition I have access to. Therefore, I considered quantizing the model, but I couldn't find a pre-quantized version available.
When I attempted to quantize it using bitsandbytes, it tries to load the entire model onto the GPU, which resulted in the same error.
model = AutoModel.from_pretrained(
'Salesforce/SFR-Embedding-Mistral',
trust_remote_code=True,
device_map='auto',
torch_dtype=torch.bfloat16,
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
)
Then, I tried to load the model onto the CPU first and then quantize it before moving the quantized model to the GPU:
model.to('cpu')
if torch.cuda.is_available():
model.to('cuda')
However, bitsandbytes does not support changing devices for quantized models:
ValueError: `.to` is not supported for `4-bit` or `8-bit` bitsandbytes models. Please use the model as it is, since the model has already been set to the correct devices and cast to the correct `dtype`.
The solutions I found, such as this GitHub issue and this blog post, were not helpful or are outdated.