Skip to main content
Filter by
Sorted by
Tagged with
2 votes
0 answers
66 views

I am trying to reproduce the exact layer-wise output of a quantized EfficientNet model (TFLite model, TensorFlow 2.17) by re-implementing Conv2D, DepthwiseConv2D, FullyConnected, Add, Mul, Sub and ...
Jolverine's user avatar
0 votes
1 answer
164 views

I’m debugging a model conversion using onnx2tf and post-training quantization issue involving Einsum, BatchMatMul, and FullyConnected layers across different model formats. Pipeline: ONNX → TF ...
Saurav Rai's user avatar
  • 2,197
0 votes
0 answers
33 views

I’m applying QAT to YOLOv8n model with the following configuration: QConfig( activation=FakeQuantize.with_args( observer=MovingAverageMinMaxObserver, quant_min=0, quant_max=...
Matteo's user avatar
  • 111
1 vote
0 answers
36 views

I am trying to quantize a model in tensorflow using tfmot. This is a sample model, inputs = keras.layers.Input(shape=(512, 512, 1)) x = keras.layers.Conv2D(3, kernel_size=1, padding='same')(inputs) x =...
Sai's user avatar
  • 11
0 votes
1 answer
240 views

I'm trying to load the Qwen2.5-VL-7B-Instruct model from hugging face with 4-bit weight-only quantization using TorchAoConfig (similar to how its mentioned in the documentation here), but I'm getting ...
Sankalp Dhupar's user avatar
1 vote
0 answers
112 views

I’ve been working on fine-tuning LLaMA 2–7B using QLoRA with bitsandbytes 4-bit quantization and ran into a weird issue. I did adaptive pretraining on Arabic data with a custom tokenizer (vocab size ~...
orchid Ali's user avatar
0 votes
2 answers
59 views

in my model, I use vector quantization (VQ) inside a recurrent neural network. The VQ is trained using straight-through estimation with that particular code being identical to [1]: ...
Cola Lightyear's user avatar
0 votes
0 answers
179 views

I am using LLM, and I want to use quantization to boost the inference process. I am using the Nvidia Jetson AGX Orin GPU, which is an ARM-based architecture. I use this code model_name = "tiiuae/...
Chawki-Hjaiji's user avatar
0 votes
0 answers
37 views

I’m trying to manually reproduce the inference forward-pass to understand exactly how quantized inference works. To do so, I trained and quantized a model in PyTorch using QAT, manually simulate the ...
greifswald's user avatar
1 vote
0 answers
92 views

I have am quantizing a neural network using QAT and I want to convert it into tflite. Quantization nodes get added to the skeleton graph and we get a new graph. I am able to load the trained QAT ...
Prateek Sharma's user avatar
0 votes
0 answers
37 views

I'm currently working on quantizing the Stable Diffusion v1.4 checkpoint without relying on external libraries such as torch.quantization or other quantization toolkits. I’m exploring two scenarios: ...
DOGLOPER's user avatar
0 votes
0 answers
138 views

For the below code, which is a standard snippet from Huggingface website, I'm getting the error: ImportError: Using `bitsandbytes` 4-bit quantization requires the latest version of bitsandbytes: `pip ...
Aryan Bhusari's user avatar
0 votes
0 answers
73 views

I was trying to run deepseek-r1-distill-llama70b-bf16.gguf (131gb on disk) on two A6000 gpus (48gb vram each) with llama.cpp. It runs with partial gpu offload but the gpu utilization is at 9-10% and ...
afsara_ben's user avatar
0 votes
0 answers
167 views

I am trying to use the bitsandbytes library for 4-bit quantization in my model loading function, but I keep encountering an ImportError. The error message says, "Using bitsandbytes 4-bit ...
from's user avatar
  • 1
0 votes
0 answers
89 views

In the onnxruntime documentation, for quantization here: https://onnxruntime.ai/docs/performance/model-optimizations/quantization.html#quantize-to-int4uint4 It sets accuracy_level=4 which means it's a ...
Owen Zhang's user avatar
1 vote
0 answers
48 views

I’m trying to implement an MP3-like compression algorithm for audio and have followed the general steps, but I’m encountering a few issues with the quantization step. Here's the overall process I'm ...
Muchacho's user avatar
0 votes
0 answers
26 views

Issue: I am encountering a kernel dies problem specifically during inference when using a quantized ResNet101 model in PyTorch. The model trains and quantized successfully, but the kernel dies when ...
Pavan Pandya's user avatar
0 votes
1 answer
263 views

I'm trying to quantize the YOLO v11 model in tensorflow and get this as a result: The target should be int8. Is this normal behaviour? When running it with tflite micro on an esp32 I quicly run out of ...
gillo04's user avatar
  • 148
0 votes
0 answers
25 views

I am fetching PNG files from a 3rd party API endpoint and quantizing them using Sharp before sending the response to the client. How can I unit test the quantization process? My intention was to have ...
Ben Sullivan's user avatar
1 vote
0 answers
95 views

First of all, I want to help my mom with her embroidery projects and secondly, I want to get better in Python. So I don't need an exact solution. But it would be great to be pointed in the right ...
Ricked's user avatar
  • 11
3 votes
1 answer
736 views

I'm encountering a RuntimeError while running a BitsAndBytes bf16 quantized Gemma-2-2b model on Hugging Face Spaces with a Gradio UI. The error specifically mentions unused kwargs and an ...
doniker99's user avatar
0 votes
1 answer
556 views

I see on https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct/tree/main/onnx: File Name Size model.onnx 654 MB model_fp16.onnx 327 MB model_q4.onnx 200 MB model_q4f16.onnx 134 MB I understand ...
Franck Dernoncourt's user avatar
1 vote
0 answers
34 views

I am trying to implement write a simple quantized tensor linear multiplication. Assuming the weight matrix w3 of shape (14336, 4096) and the input tensor x of shape (2, 512, 4096) where first dim is ...
hafezmg48's user avatar
3 votes
2 answers
3k views

I’m new to quantization and working with visual language models (VLM).I’m trying to load a 4-bit quantized version of the Ovis1.6-Gemma model from Hugging Face using the transformers library. I ...
meysam's user avatar
  • 194
1 vote
0 answers
98 views

I converted an existing tensorflow efficient net model built on tensorflow version 2.3.1 to a tflite fp16 version to reduce its size. I want to run it on CPU and use in my API. But while testing I ...
Harry Ali's user avatar
1 vote
1 answer
1k views

We are trying to deploy a quantized Llama 3.1 70B model(from Huggingface, using bitsandbytes), quantizing part works fine as we check the model memory which is correct and also test getting ...
Luis Leal's user avatar
  • 3,554
1 vote
0 answers
2k views

I want to fine-tune locally the Meta's Llama 3.1 8B Instruct model with custom data and then save it in a format compatible with Ollama for further inference. As I do everything locally and don't have ...
Adrien's user avatar
  • 13
1 vote
0 answers
131 views

RuntimeError: 'inputs.size() == 1' when setting input tensor for OpenVINO model with multiple inputs I'm trying to use an OpenVINO model that was originally designed for PyTorch, and I'm running into ...
Framefact's user avatar
2 votes
1 answer
4k views

I'm developing LLM agents using llama.cpp as inference engine. Sometimes I want to use models in safetensors format and there is a python script (https://github.com/ggerganov/llama.cpp/blob/master/...
arkuzo's user avatar
  • 41
0 votes
1 answer
2k views

stuck at this issue, any idea on how i can rectify this? I tried installing openbb and upgrading pydantic. however i am unable to rectify this issue. Please help me provide any suggestions. thank you ...
milner pch's user avatar
1 vote
0 answers
64 views

I want to do Quantization Aware Training, Here's my model architecture. Model: "sequential_4" _________________________________________________________________ Layer (type) ...
Vina's user avatar
  • 27
0 votes
1 answer
120 views

We are trying to deploy vision transformer models (EfficientViT_B0, MobileViT_V2_175, and RepViT_M11) on our flutter application using the tflite_flutter_plus and tflite_flutter_plus_helper ...
D.Varam's user avatar
1 vote
0 answers
127 views

I am new and want to try converting models to Onnx format and I have the following issue. I have a model that has been quantized to 4-bit, and then I converted this model to Onnx. My quantized model ...
Toàn Nguyễn Phúc's user avatar
0 votes
1 answer
129 views

Example: # pip install transformers from transformers import AutoModelForTokenClassification, AutoTokenizer # Load model model_path = 'huawei-noah/TinyBERT_General_4L_312D' model = ...
Franck Dernoncourt's user avatar
5 votes
1 answer
1k views

I am using the ONNX-Python-library. I am trying to quantize ai-models statically using the quantize_static() function imported from onnxruntime.quantization. This function takes a ...
Zylon's user avatar
  • 51
2 votes
0 answers
1k views

Summary I am trying to export the CIDAS/clipseg-rd16 model to ONNX using optimum-cli as given in the HuggingFace documentation. However, I get an error saying ValueError: Unrecognized configuration ...
Sattwik Kumar Sahu's user avatar
3 votes
2 answers
1k views

I am currently only able to play around with a V100 on GCP. I understand that I can load a LLM in 4bit quantization as shown below. However, (assuming due to the quantization) it is taking up to 10 ...
sachinruk's user avatar
  • 10k
0 votes
1 answer
1k views

I am trying to make a gradio chatbot in Hugging Face Spaces using Mistral-7B-v0.1 model. As this is a large model, I have to quantize, else the free 50G storage gets full. I am using bitsandbytes to ...
Anish's user avatar
  • 13
0 votes
0 answers
63 views

I have a project that is basically to analyze the effects of quantization on orientation estimation algorithms. I have sensor data from gyroscope that looks like this when using float datatype: gx=-0....
user3662181's user avatar
0 votes
1 answer
197 views

I’m using Keras with tensorflow-model-optimization (tf_mot) for quantization aware training (QAT). My model is based on a pre-trained backbone from keras.application. As mentioned in the transfer ...
Никита Шубин's user avatar
0 votes
1 answer
564 views

Does the gguf format perform model quantization even though it's already quantized with LORA? Hello ! im new to Llms ,and l've fine-tuned the CODELLAMA model on kaggle using LORA.I've merged and ...
Samar's user avatar
  • 3
1 vote
0 answers
252 views

I am trying to learn about quantization so was playing with a github repo trying to quantize it into int8 format. I have used the following code to quantize the model. modelClass = DTLN_model() ...
Niaz Palak's user avatar
2 votes
0 answers
1k views

I have been facing an issue when I am trying to inference using a dynamically quantized yolov8s onnx model on GPU. I have used yolov8s.pt and exported it to yolov8.onnx using onnx export. Then I ...
Suraj Rao's user avatar
3 votes
1 answer
3k views

Not sure if its the right forum to ask but. Assuming i have a gptq model that is 4bit. how does using from_pretrained(torch_dtype=torch.float16) work? In my understanding 4 bit meaning changing the ...
aceminer's user avatar
  • 4,365
5 votes
2 answers
6k views

I'm currently fine-tuning the Mistral 7B model and encountered the following error: ValueError: You cannot simultaneously pass the load_in_4bit or load_in_8bit arguments while also passing the ...
Jyoti yadav's user avatar
0 votes
1 answer
1k views

I am using model = 'filipealmeida/Mistral-7B-Instruct-v0.1-sharded' and quantize it in 4_bit with the following function. def load_quantized_model(model_name: str): """ :param ...
Gabriele Castaldi's user avatar
1 vote
2 answers
664 views

I am working on school project that requires me to perform manual quantization of each layer of a model. Specifically, I want to implement manually: Quantized activation, combined with quantized ...
longbow's user avatar
  • 11
0 votes
1 answer
156 views

I wanted to have a look at the example code for image quantization from here However, it's rather old and Python and NP have changed since then. from pylab import imread,imshow,figure,show,subplot ...
Ghoul Fool's user avatar
  • 7,027
0 votes
0 answers
98 views

I am in the process of quantizing a model to int8 in order to make it run on the coral edgetpu. In order to do that I am using the tflite converter. My code looks like this one class ...
Kilian Tiziano Le Creurer's user avatar
2 votes
1 answer
3k views

I wanted to use the 'Salesforce/SFR-Embedding-Mistral' embedding model, but it is too large for the GPU partition I have access to. Therefore, I considered quantizing the model, but I couldn't find a ...
Firevince's user avatar

1
2 3 4 5
10