478 questions
2
votes
0
answers
66
views
Issue Replicating TF-Lite Conv2D Quantized Inference Output
I am trying to reproduce the exact layer-wise output of a quantized EfficientNet model (TFLite model, TensorFlow 2.17) by re-implementing Conv2D, DepthwiseConv2D, FullyConnected, Add, Mul, Sub and ...
0
votes
1
answer
164
views
Why does TFLite INT8 quantization decompose BatchMatMul (from Einsum) into many FullyConnected layers?
I’m debugging a model conversion using onnx2tf and post-training quantization issue involving Einsum, BatchMatMul, and FullyConnected layers across different model formats.
Pipeline:
ONNX → TF ...
0
votes
0
answers
33
views
Error while converting quantized Torch model to ONNX
I’m applying QAT to YOLOv8n model with the following configuration:
QConfig(
activation=FakeQuantize.with_args(
observer=MovingAverageMinMaxObserver,
quant_min=0,
quant_max=...
1
vote
0
answers
36
views
Quantization In Tensorflow2, Instance error
I am trying to quantize a model in tensorflow using tfmot.
This is a sample model,
inputs = keras.layers.Input(shape=(512, 512, 1))
x = keras.layers.Conv2D(3, kernel_size=1, padding='same')(inputs)
x =...
0
votes
1
answer
240
views
RuntimeError: CUDA error: named symbol not found when using TorchAoConfig with Qwen2.5-VL-7B-Instruct model
I'm trying to load the Qwen2.5-VL-7B-Instruct model from hugging face with 4-bit weight-only quantization using TorchAoConfig (similar to how its mentioned in the documentation here), but I'm getting ...
1
vote
0
answers
112
views
Fine-tuned LLaMA 2–7B with QLoRA, but reloading fails: missing 4bit metadata. Likely saved after LoRA+resize. Need proper 4bit save method
I’ve been working on fine-tuning LLaMA 2–7B using QLoRA with bitsandbytes 4-bit quantization and ran into a weird issue. I did adaptive pretraining on Arabic data with a custom tokenizer (vocab size ~...
0
votes
2
answers
59
views
Straight-Through estimation for vector quantization inside a recurrent neural network
in my model, I use vector quantization (VQ) inside a recurrent neural network. The VQ is trained using straight-through estimation with that particular code being identical to [1]:
...
0
votes
0
answers
179
views
Cannot use bitsandbytes for quantization of LLM
I am using LLM, and I want to use quantization to boost the inference process. I am using the Nvidia Jetson AGX Orin GPU, which is an ARM-based architecture. I use this code
model_name = "tiiuae/...
0
votes
0
answers
37
views
Mismatch between PyTorch inference and manual implementation
I’m trying to manually reproduce the inference forward-pass to understand exactly how quantized inference works. To do so, I trained and quantized a model in PyTorch using QAT, manually simulate the ...
1
vote
0
answers
92
views
how to convert a QAT quantization aware trained tensorflow graph into tflite model?
I have am quantizing a neural network using QAT and I want to convert it into tflite.
Quantization nodes get added to the skeleton graph and we get a new graph.
I am able to load the trained QAT ...
0
votes
0
answers
37
views
Stable Diffusion v1.4 PTQ on both weight and activation
I'm currently working on quantizing the Stable Diffusion v1.4 checkpoint without relying on external libraries such as torch.quantization or other quantization toolkits. I’m exploring two scenarios:
...
0
votes
0
answers
138
views
Error about bitsandbytes from Huggingface
For the below code, which is a standard snippet from Huggingface website, I'm getting the error:
ImportError: Using `bitsandbytes` 4-bit quantization requires the latest version
of bitsandbytes: `pip ...
0
votes
0
answers
73
views
sub-4 bit quantized model on nvidia gpu
I was trying to run deepseek-r1-distill-llama70b-bf16.gguf (131gb on disk) on two A6000 gpus (48gb vram each) with llama.cpp. It runs with partial gpu offload but the gpu utilization is at 9-10% and ...
0
votes
0
answers
167
views
How do I resolve ImportError Using bitsandbytes 4bit quantization requires the latest version of bitsandbytes despite having version 0.45.3 installed?
I am trying to use the bitsandbytes library for 4-bit quantization in my model loading function, but I keep encountering an ImportError. The error message says, "Using bitsandbytes 4-bit ...
0
votes
0
answers
89
views
Onnxruntime quantization script for MatMulNbits, what is the type after conversion?
In the onnxruntime documentation, for quantization here:
https://onnxruntime.ai/docs/performance/model-optimizations/quantization.html#quantize-to-int4uint4
It sets accuracy_level=4 which means it's a ...
1
vote
0
answers
48
views
Issues with MP3-like Compression: Quantization and File Size
I’m trying to implement an MP3-like compression algorithm for audio and have followed the general steps, but I’m encountering a few issues with the quantization step. Here's the overall process I'm ...
0
votes
0
answers
26
views
Kernel Dies When Testing a Quantized ResNet101 Model in PyTorch
Issue: I am encountering a kernel dies problem specifically during inference when using a quantized ResNet101 model in PyTorch. The model trains and quantized successfully, but the kernel dies when ...
0
votes
1
answer
263
views
Trying to quantize YOLOv11 in tensorflow, is this topology normal?
I'm trying to quantize the YOLO v11 model in tensorflow and get this as a result:
The target should be int8. Is this normal behaviour? When running it with tflite micro on an esp32 I quicly run out of ...
0
votes
0
answers
25
views
Unit testing PNG quantization by Sharp in Jest
I am fetching PNG files from a 3rd party API endpoint and quantizing them using Sharp before sending the response to the client. How can I unit test the quantization process?
My intention was to have ...
1
vote
0
answers
95
views
Transforming a picture into a posterized image with matching grid overlay and symbols
First of all, I want to help my mom with her embroidery projects and secondly, I want to get better in Python. So I don't need an exact solution. But it would be great to be pointed in the right ...
3
votes
1
answer
736
views
RuntimeError: "Unused kwargs" and "frozenset object has no attribute discard" with BitsAndBytes bf16 Quantized Model in Hugging Face Gradio App
I'm encountering a RuntimeError while running a BitsAndBytes bf16 quantized Gemma-2-2b model on Hugging Face Spaces with a Gradio UI. The error specifically mentions unused kwargs and an ...
0
votes
1
answer
556
views
Why are model_q4.onnx and model_q4f16.onnx not 4 times smaller than model.onnx?
I see on https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct/tree/main/onnx:
File Name
Size
model.onnx
654 MB
model_fp16.onnx
327 MB
model_q4.onnx
200 MB
model_q4f16.onnx
134 MB
I understand ...
1
vote
0
answers
34
views
pytorch quantized linear function gives shape invalid error
I am trying to implement write a simple quantized tensor linear multiplication. Assuming the weight matrix w3 of shape (14336, 4096) and the input tensor x of shape (2, 512, 4096) where first dim is ...
3
votes
2
answers
3k
views
How to Load a 4-bit Quantized VLM Model from Hugging Face with Transformers?
I’m new to quantization and working with visual language models (VLM).I’m trying to load a 4-bit quantized version of the Ovis1.6-Gemma model from Hugging Face using the transformers library. I ...
1
vote
0
answers
98
views
Inference speed for tflite fp16 converted model is slow on intel core i5 cpu
I converted an existing tensorflow efficient net model built on tensorflow version 2.3.1 to a tflite fp16 version to reduce its size. I want to run it on CPU and use in my API. But while testing I ...
1
vote
1
answer
1k
views
valueError: Supplied state dict for layers does not contain `bitsandbytes__*` and possibly other `quantized_stats`(when load saved quantized model)
We are trying to deploy a quantized Llama 3.1 70B model(from Huggingface, using bitsandbytes), quantizing part works fine as we check the model memory which is correct and also test getting ...
1
vote
0
answers
2k
views
Quantize and fine-tune Llama 3.1 8B for Ollama
I want to fine-tune locally the Meta's Llama 3.1 8B Instruct model with custom data and then save it in a format compatible with Ollama for further inference. As I do everything locally and don't have ...
1
vote
0
answers
131
views
Openvino set_input_tensor() must be called on a function with exactly one parameter
RuntimeError: 'inputs.size() == 1' when setting input tensor for OpenVINO model with multiple inputs
I'm trying to use an OpenVINO model that was originally designed for PyTorch, and I'm running into ...
2
votes
1
answer
4k
views
How to quantize a HF safetensors model and save it to llama.cpp GGUF format with less than q8_0 quantization?
I'm developing LLM agents using llama.cpp as inference engine. Sometimes I want to use models in safetensors format and there is a python script (https://github.com/ggerganov/llama.cpp/blob/master/...
0
votes
1
answer
2k
views
cannot import name 'AliasGenerator' from 'pydantic'
stuck at this issue, any idea on how i can rectify this?
I tried installing openbb and upgrading pydantic. however i am unable to rectify this issue. Please help me provide any suggestions. thank you ...
1
vote
0
answers
64
views
ValueError: ('Expected `model` argument to be a `Model` instance, got ', <keras.engine.sequential.Sequential object at 0x7f234263dfd0>)
I want to do Quantization Aware Training,
Here's my model architecture.
Model: "sequential_4"
_________________________________________________________________
Layer (type) ...
0
votes
1
answer
120
views
Unable to build interpreter for TFLITE ViT-based image classifiers on Dart / Flutter: Didn't find op for builtin opcode 'CONV_2D' version '6'
We are trying to deploy vision transformer models (EfficientViT_B0, MobileViT_V2_175, and RepViT_M11) on our flutter application using the tflite_flutter_plus and tflite_flutter_plus_helper ...
1
vote
0
answers
127
views
Convert Quantization to Onnx
I am new and want to try converting models to Onnx format and I have the following issue. I have a model that has been quantized to 4-bit, and then I converted this model to Onnx. My quantized model ...
0
votes
1
answer
129
views
What is the difference, if any, between model.half() and model.to(dtype=torch.float16) in huggingface-transformers?
Example:
# pip install transformers
from transformers import AutoModelForTokenClassification, AutoTokenizer
# Load model
model_path = 'huawei-noah/TinyBERT_General_4L_312D'
model = ...
5
votes
1
answer
1k
views
ONNX-Python: Can someone explain the Calibration_Data_Reader requested by the static_quantization-function?
I am using the ONNX-Python-library. I am trying to quantize ai-models statically using the quantize_static() function imported from onnxruntime.quantization.
This function takes a ...
2
votes
0
answers
1k
views
Cannot Export HuggingFace Model to ONNX with Optimum-CLI
Summary
I am trying to export the CIDAS/clipseg-rd16 model to ONNX using optimum-cli as given in the HuggingFace documentation. However, I get an error saying
ValueError: Unrecognized configuration ...
3
votes
2
answers
1k
views
Speeding up load time of LLMs
I am currently only able to play around with a V100 on GCP. I understand that I can load a LLM in 4bit quantization as shown below. However, (assuming due to the quantization) it is taking up to 10 ...
0
votes
1
answer
1k
views
How to resolve Import Error when using quantization in bitsandbytes
I am trying to make a gradio chatbot in Hugging Face Spaces using Mistral-7B-v0.1 model. As this is a large model, I have to quantize, else the free 50G storage gets full. I am using bitsandbytes to ...
0
votes
0
answers
63
views
Fixed point vs Float point number
I have a project that is basically to analyze the effects of quantization on orientation estimation algorithms. I have sensor data from gyroscope that looks like this when using float datatype:
gx=-0....
0
votes
1
answer
197
views
How to set training=False for keras-model/layer outside of the __call__ method?
I’m using Keras with tensorflow-model-optimization (tf_mot) for quantization aware training (QAT). My model is based on a pre-trained backbone from keras.application. As mentioned in the transfer ...
0
votes
1
answer
564
views
Diffrence between gguf and lora
Does the gguf format perform model quantization even though it's already quantized with LORA?
Hello ! im new to Llms ,and l've fine-tuned the CODELLAMA model on kaggle using LORA.I've merged and ...
1
vote
0
answers
252
views
error: 'tf.TensorListSetItem' op is neither a custom op nor a flex op while trying to quantize a model
I am trying to learn about quantization so was playing with a github repo trying to quantize it into int8 format. I have used the following code to quantize the model.
modelClass = DTLN_model()
...
2
votes
0
answers
1k
views
On onnxruntime-gpu,CUDAProvider,Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on perf
I have been facing an issue when I am trying to inference using a dynamically quantized yolov8s onnx model on GPU.
I have used yolov8s.pt and exported it to yolov8.onnx using onnx export. Then I ...
3
votes
1
answer
3k
views
Quantization and torch_dtype in huggingface transformer
Not sure if its the right forum to ask but.
Assuming i have a gptq model that is 4bit. how does using from_pretrained(torch_dtype=torch.float16) work? In my understanding 4 bit meaning changing the ...
5
votes
2
answers
6k
views
ValueError: You can't pass `load_in_4bit`or `load_in_8bit` as a kwarg when passing `quantization_config` argument at the same time
I'm currently fine-tuning the Mistral 7B model and encountered the following error:
ValueError: You cannot simultaneously pass the load_in_4bit or load_in_8bit arguments while also passing the ...
0
votes
1
answer
1k
views
Quantization 4 bit and 8 bit - error in 'quantization_config'
I am using model = 'filipealmeida/Mistral-7B-Instruct-v0.1-sharded' and quantize it in 4_bit
with the following function.
def load_quantized_model(model_name: str):
"""
:param ...
1
vote
2
answers
664
views
How to manually dequantize the output of a layer and requantize it for the next layer in Pytorch?
I am working on school project that requires me to perform manual quantization of each layer of a model. Specifically, I want to implement manually:
Quantized activation, combined with quantized ...
0
votes
1
answer
156
views
Image quantization with Numpy
I wanted to have a look at the example code for image quantization from here
However, it's rather old and Python and NP have changed since then.
from pylab import imread,imshow,figure,show,subplot
...
0
votes
0
answers
98
views
Is there a way to make the tflite converter cut the tails of the distributions when using the representative dataset?
I am in the process of quantizing a model to int8 in order to make it run on the coral edgetpu. In order to do that I am using the tflite converter. My code looks like this one
class ...
2
votes
1
answer
3k
views
How to quantize sentence-transformer model on CPU to use it on GPU?
I wanted to use the 'Salesforce/SFR-Embedding-Mistral' embedding model, but it is too large for the GPU partition I have access to. Therefore, I considered quantizing the model, but I couldn't find a ...