Newest 'quantization' Questions

2 votes

0 answers

66 views

Issue Replicating TF-Lite Conv2D Quantized Inference Output

I am trying to reproduce the exact layer-wise output of a quantized EfficientNet model (TFLite model, TensorFlow 2.17) by re-implementing Conv2D, DepthwiseConv2D, FullyConnected, Add, Mul, Sub and ...

Jolverine

1

asked Nov 17 at 8:58

0 votes

1 answer

164 views

Why does TFLite INT8 quantization decompose BatchMatMul (from Einsum) into many FullyConnected layers?

I’m debugging a model conversion using onnx2tf and post-training quantization issue involving Einsum, BatchMatMul, and FullyConnected layers across different model formats. Pipeline: ONNX → TF ...

Saurav Rai

2,197

asked Nov 13 at 11:26

0 votes

0 answers

33 views

Error while converting quantized Torch model to ONNX

I’m applying QAT to YOLOv8n model with the following configuration: QConfig( activation=FakeQuantize.with_args( observer=MovingAverageMinMaxObserver, quant_min=0, quant_max=...

Matteo

111

asked Sep 5 at 14:39

1 vote

0 answers

36 views

Quantization In Tensorflow2, Instance error

I am trying to quantize a model in tensorflow using tfmot. This is a sample model, inputs = keras.layers.Input(shape=(512, 512, 1)) x = keras.layers.Conv2D(3, kernel_size=1, padding='same')(inputs) x =...

Sai

11

asked Aug 29 at 17:03

0 votes

1 answer

240 views

RuntimeError: CUDA error: named symbol not found when using TorchAoConfig with Qwen2.5-VL-7B-Instruct model

I'm trying to load the Qwen2.5-VL-7B-Instruct model from hugging face with 4-bit weight-only quantization using TorchAoConfig (similar to how its mentioned in the documentation here), but I'm getting ...

Sankalp Dhupar

73

asked Jul 21 at 23:41

1 vote

0 answers

112 views

Fine-tuned LLaMA 2–7B with QLoRA, but reloading fails: missing 4bit metadata. Likely saved after LoRA+resize. Need proper 4bit save method

I’ve been working on fine-tuning LLaMA 2–7B using QLoRA with bitsandbytes 4-bit quantization and ran into a weird issue. I did adaptive pretraining on Arabic data with a custom tokenizer (vocab size ~...

orchid Ali

11

asked Jun 26 at 17:50

0 votes

2 answers

59 views

Straight-Through estimation for vector quantization inside a recurrent neural network

in my model, I use vector quantization (VQ) inside a recurrent neural network. The VQ is trained using straight-through estimation with that particular code being identical to [1]: ...

Cola Lightyear

23

asked Jun 11 at 11:46

0 votes

0 answers

179 views

Cannot use bitsandbytes for quantization of LLM

I am using LLM, and I want to use quantization to boost the inference process. I am using the Nvidia Jetson AGX Orin GPU, which is an ARM-based architecture. I use this code model_name = "tiiuae/...

Chawki-Hjaiji

23

asked May 14 at 13:03

0 votes

0 answers

37 views

Mismatch between PyTorch inference and manual implementation

I’m trying to manually reproduce the inference forward-pass to understand exactly how quantized inference works. To do so, I trained and quantized a model in PyTorch using QAT, manually simulate the ...

greifswald

1

asked Apr 28 at 19:06

1 vote

0 answers

92 views

how to convert a QAT quantization aware trained tensorflow graph into tflite model?

I have am quantizing a neural network using QAT and I want to convert it into tflite. Quantization nodes get added to the skeleton graph and we get a new graph. I am able to load the trained QAT ...

Prateek Sharma

11

asked Apr 8 at 9:08

0 votes

0 answers

37 views

Stable Diffusion v1.4 PTQ on both weight and activation

I'm currently working on quantizing the Stable Diffusion v1.4 checkpoint without relying on external libraries such as torch.quantization or other quantization toolkits. I’m exploring two scenarios: ...

DOGLOPER

1

asked Apr 4 at 10:06

0 votes

0 answers

138 views

Error about bitsandbytes from Huggingface

For the below code, which is a standard snippet from Huggingface website, I'm getting the error: ImportError: Using `bitsandbytes` 4-bit quantization requires the latest version of bitsandbytes: `pip ...

Aryan Bhusari

1

asked Mar 31 at 22:33

0 votes

0 answers

73 views

sub-4 bit quantized model on nvidia gpu

I was trying to run deepseek-r1-distill-llama70b-bf16.gguf (131gb on disk) on two A6000 gpus (48gb vram each) with llama.cpp. It runs with partial gpu offload but the gpu utilization is at 9-10% and ...

afsara_ben

691

asked Mar 28 at 4:40

0 votes

0 answers

167 views

How do I resolve ImportError Using bitsandbytes 4bit quantization requires the latest version of bitsandbytes despite having version 0.45.3 installed?

I am trying to use the bitsandbytes library for 4-bit quantization in my model loading function, but I keep encountering an ImportError. The error message says, "Using bitsandbytes 4-bit ...

from

1

asked Mar 11 at 10:54

0 votes

0 answers

89 views

Onnxruntime quantization script for MatMulNbits, what is the type after conversion?

In the onnxruntime documentation, for quantization here: https://onnxruntime.ai/docs/performance/model-optimizations/quantization.html#quantize-to-int4uint4 It sets accuracy_level=4 which means it's a ...

Owen Zhang

23

asked Feb 17 at 9:55

1 vote

0 answers

48 views

Issues with MP3-like Compression: Quantization and File Size

I’m trying to implement an MP3-like compression algorithm for audio and have followed the general steps, but I’m encountering a few issues with the quantization step. Here's the overall process I'm ...

Muchacho

17

asked Jan 6 at 13:18

0 votes

0 answers

26 views

Kernel Dies When Testing a Quantized ResNet101 Model in PyTorch

Issue: I am encountering a kernel dies problem specifically during inference when using a quantized ResNet101 model in PyTorch. The model trains and quantized successfully, but the kernel dies when ...

Pavan Pandya

1

asked Dec 13, 2024 at 5:28

0 votes

1 answer

263 views

Trying to quantize YOLOv11 in tensorflow, is this topology normal?

I'm trying to quantize the YOLO v11 model in tensorflow and get this as a result: The target should be int8. Is this normal behaviour? When running it with tflite micro on an esp32 I quicly run out of ...

gillo04

148

asked Dec 11, 2024 at 7:00

0 votes

0 answers

25 views

Unit testing PNG quantization by Sharp in Jest

I am fetching PNG files from a 3rd party API endpoint and quantizing them using Sharp before sending the response to the client. How can I unit test the quantization process? My intention was to have ...

Ben Sullivan

1

asked Dec 2, 2024 at 16:19

1 vote

0 answers

95 views

Transforming a picture into a posterized image with matching grid overlay and symbols

First of all, I want to help my mom with her embroidery projects and secondly, I want to get better in Python. So I don't need an exact solution. But it would be great to be pointed in the right ...

Ricked

11

asked Nov 12, 2024 at 16:33

3 votes

1 answer

736 views

RuntimeError: "Unused kwargs" and "frozenset object has no attribute discard" with BitsAndBytes bf16 Quantized Model in Hugging Face Gradio App

I'm encountering a RuntimeError while running a BitsAndBytes bf16 quantized Gemma-2-2b model on Hugging Face Spaces with a Gradio UI. The error specifically mentions unused kwargs and an ...

doniker99

56

asked Nov 10, 2024 at 16:07

0 votes

1 answer

556 views

Why are model_q4.onnx and model_q4f16.onnx not 4 times smaller than model.onnx?

I see on https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct/tree/main/onnx: File Name Size model.onnx 654 MB model_fp16.onnx 327 MB model_q4.onnx 200 MB model_q4f16.onnx 134 MB I understand ...

Franck Dernoncourt

84.7k

asked Nov 7, 2024 at 17:52

1 vote

0 answers

34 views

pytorch quantized linear function gives shape invalid error

I am trying to implement write a simple quantized tensor linear multiplication. Assuming the weight matrix w3 of shape (14336, 4096) and the input tensor x of shape (2, 512, 4096) where first dim is ...

hafezmg48

99

asked Oct 30, 2024 at 20:32

3 votes

2 answers

3k views

How to Load a 4-bit Quantized VLM Model from Hugging Face with Transformers?

I’m new to quantization and working with visual language models (VLM).I’m trying to load a 4-bit quantized version of the Ovis1.6-Gemma model from Hugging Face using the transformers library. I ...

meysam

194

asked Oct 27, 2024 at 9:31

1 vote

0 answers

98 views

Inference speed for tflite fp16 converted model is slow on intel core i5 cpu

I converted an existing tensorflow efficient net model built on tensorflow version 2.3.1 to a tflite fp16 version to reduce its size. I want to run it on CPU and use in my API. But while testing I ...

Harry Ali

11

asked Oct 13, 2024 at 17:13

1 vote

1 answer

1k views

valueError: Supplied state dict for layers does not contain `bitsandbytes__*` and possibly other `quantized_stats`(when load saved quantized model)

We are trying to deploy a quantized Llama 3.1 70B model(from Huggingface, using bitsandbytes), quantizing part works fine as we check the model memory which is correct and also test getting ...

Luis Leal

3,554

asked Oct 9, 2024 at 2:25

1 vote

0 answers

2k views

Quantize and fine-tune Llama 3.1 8B for Ollama

I want to fine-tune locally the Meta's Llama 3.1 8B Instruct model with custom data and then save it in a format compatible with Ollama for further inference. As I do everything locally and don't have ...

Adrien

13

asked Aug 28, 2024 at 8:55

1 vote

0 answers

131 views

Openvino set_input_tensor() must be called on a function with exactly one parameter

RuntimeError: 'inputs.size() == 1' when setting input tensor for OpenVINO model with multiple inputs I'm trying to use an OpenVINO model that was originally designed for PyTorch, and I'm running into ...

Framefact

11

asked Aug 20, 2024 at 11:08

2 votes

1 answer

4k views

How to quantize a HF safetensors model and save it to llama.cpp GGUF format with less than q8_0 quantization?

I'm developing LLM agents using llama.cpp as inference engine. Sometimes I want to use models in safetensors format and there is a python script (https://github.com/ggerganov/llama.cpp/blob/master/...

arkuzo

41

asked Aug 7, 2024 at 6:10

0 votes

1 answer

2k views

cannot import name 'AliasGenerator' from 'pydantic'

stuck at this issue, any idea on how i can rectify this? I tried installing openbb and upgrading pydantic. however i am unable to rectify this issue. Please help me provide any suggestions. thank you ...

milner pch

11

asked Aug 4, 2024 at 9:26

1 vote

0 answers

64 views

ValueError: ('Expected `model` argument to be a `Model` instance, got ', <keras.engine.sequential.Sequential object at 0x7f234263dfd0>)

I want to do Quantization Aware Training, Here's my model architecture. Model: "sequential_4" _________________________________________________________________ Layer (type) ...

Vina

27

asked Jul 22, 2024 at 6:14

0 votes

1 answer

120 views

Unable to build interpreter for TFLITE ViT-based image classifiers on Dart / Flutter: Didn't find op for builtin opcode 'CONV_2D' version '6'

We are trying to deploy vision transformer models (EfficientViT_B0, MobileViT_V2_175, and RepViT_M11) on our flutter application using the tflite_flutter_plus and tflite_flutter_plus_helper ...

D.Varam

1

asked Jul 17, 2024 at 13:08

1 vote

0 answers

127 views

Convert Quantization to Onnx

I am new and want to try converting models to Onnx format and I have the following issue. I have a model that has been quantized to 4-bit, and then I converted this model to Onnx. My quantized model ...

Toàn Nguyễn Phúc

11

asked Jul 11, 2024 at 3:03

0 votes

1 answer

129 views

What is the difference, if any, between model.half() and model.to(dtype=torch.float16) in huggingface-transformers?

Example: # pip install transformers from transformers import AutoModelForTokenClassification, AutoTokenizer # Load model model_path = 'huawei-noah/TinyBERT_General_4L_312D' model = ...

Franck Dernoncourt

84.7k

asked Jul 7, 2024 at 23:33

5 votes

1 answer

1k views

ONNX-Python: Can someone explain the Calibration_Data_Reader requested by the static_quantization-function?

I am using the ONNX-Python-library. I am trying to quantize ai-models statically using the quantize_static() function imported from onnxruntime.quantization. This function takes a ...

Zylon

51

asked Jun 18, 2024 at 12:13

2 votes

0 answers

1k views

Cannot Export HuggingFace Model to ONNX with Optimum-CLI

Summary I am trying to export the CIDAS/clipseg-rd16 model to ONNX using optimum-cli as given in the HuggingFace documentation. However, I get an error saying ValueError: Unrecognized configuration ...

Sattwik Kumar Sahu

21

asked Jun 18, 2024 at 6:10

3 votes

2 answers

1k views

Speeding up load time of LLMs

I am currently only able to play around with a V100 on GCP. I understand that I can load a LLM in 4bit quantization as shown below. However, (assuming due to the quantization) it is taking up to 10 ...

sachinruk

10k

asked Jun 3, 2024 at 12:30

0 votes

1 answer

1k views

How to resolve Import Error when using quantization in bitsandbytes

I am trying to make a gradio chatbot in Hugging Face Spaces using Mistral-7B-v0.1 model. As this is a large model, I have to quantize, else the free 50G storage gets full. I am using bitsandbytes to ...

Anish

13

asked May 23, 2024 at 5:02

0 votes

0 answers

63 views

Fixed point vs Float point number

I have a project that is basically to analyze the effects of quantization on orientation estimation algorithms. I have sensor data from gyroscope that looks like this when using float datatype: gx=-0....

user3662181

11

asked May 3, 2024 at 17:18

0 votes

1 answer

197 views

How to set training=False for keras-model/layer outside of the call method?

I’m using Keras with tensorflow-model-optimization (tf_mot) for quantization aware training (QAT). My model is based on a pre-trained backbone from keras.application. As mentioned in the transfer ...

Никита Шубин

28

asked Apr 22, 2024 at 20:29

0 votes

1 answer

564 views

Diffrence between gguf and lora

Does the gguf format perform model quantization even though it's already quantized with LORA? Hello ! im new to Llms ,and l've fine-tuned the CODELLAMA model on kaggle using LORA.I've merged and ...

Samar

3

asked Apr 17, 2024 at 10:30

1 vote

0 answers

252 views

error: 'tf.TensorListSetItem' op is neither a custom op nor a flex op while trying to quantize a model

I am trying to learn about quantization so was playing with a github repo trying to quantize it into int8 format. I have used the following code to quantize the model. modelClass = DTLN_model() ...

Niaz Palak

327

asked Apr 13, 2024 at 7:02

2 votes

0 answers

1k views

On onnxruntime-gpu,CUDAProvider,Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on perf

I have been facing an issue when I am trying to inference using a dynamically quantized yolov8s onnx model on GPU. I have used yolov8s.pt and exported it to yolov8.onnx using onnx export. Then I ...

Suraj Rao

21

asked Apr 4, 2024 at 6:40

3 votes

1 answer

3k views

Quantization and torch_dtype in huggingface transformer

Not sure if its the right forum to ask but. Assuming i have a gptq model that is 4bit. how does using from_pretrained(torch_dtype=torch.float16) work? In my understanding 4 bit meaning changing the ...

aceminer

4,365

asked Apr 3, 2024 at 12:48

5 votes

2 answers

6k views

ValueError: You can't pass `load_in_4bit`or `load_in_8bit` as a kwarg when passing `quantization_config` argument at the same time

I'm currently fine-tuning the Mistral 7B model and encountered the following error: ValueError: You cannot simultaneously pass the load_in_4bit or load_in_8bit arguments while also passing the ...

Jyoti yadav

300

asked Apr 1, 2024 at 13:55

0 votes

1 answer

1k views

Quantization 4 bit and 8 bit - error in 'quantization_config'

I am using model = 'filipealmeida/Mistral-7B-Instruct-v0.1-sharded' and quantize it in 4_bit with the following function. def load_quantized_model(model_name: str): """ :param ...

Gabriele Castaldi

3

asked Mar 31, 2024 at 12:54

1 vote

2 answers

664 views

How to manually dequantize the output of a layer and requantize it for the next layer in Pytorch?

I am working on school project that requires me to perform manual quantization of each layer of a model. Specifically, I want to implement manually: Quantized activation, combined with quantized ...

longbow

11

asked Mar 28, 2024 at 17:17

0 votes

1 answer

156 views

Image quantization with Numpy

I wanted to have a look at the example code for image quantization from here However, it's rather old and Python and NP have changed since then. from pylab import imread,imshow,figure,show,subplot ...

Ghoul Fool

7,027

asked Mar 26, 2024 at 15:34

0 votes

0 answers

98 views

Is there a way to make the tflite converter cut the tails of the distributions when using the representative dataset?

I am in the process of quantizing a model to int8 in order to make it run on the coral edgetpu. In order to do that I am using the tflite converter. My code looks like this one class ...

Kilian Tiziano Le Creurer

1

asked Mar 21, 2024 at 19:23

2 votes

1 answer

3k views

How to quantize sentence-transformer model on CPU to use it on GPU?

I wanted to use the 'Salesforce/SFR-Embedding-Mistral' embedding model, but it is too large for the GPU partition I have access to. Therefore, I considered quantizing the model, but I couldn't find a ...

Firevince

21

asked Mar 7, 2024 at 18:27

Collectives™ on Stack Overflow