9,136 questions
1
vote
0
answers
34
views
Microsoft.ML C#: GPU not found in K8s/Docker container
I have created a .NET app that uses Microsoft.ML.OnnxRuntime.Gpu for interference. Now I'm trying to integrate it with Azure Kubernetes.
We have made the setup with Tesla T4 GPU and we confirmed it's ...
-3
votes
0
answers
64
views
Intel ARC GPU hangs when performing an untyped surface read [closed]
I am currently writing a driver for the Intel ARC GPU series (specifically I use the A750 for testing purposes) for my own operating system.
I am already able to execute compute kernels that use ...
-2
votes
0
answers
66
views
Slow collision detection in Python [closed]
I am making a particle simulator in python, and noticed that my collision detection is ruining the performance the most,I am not even sure if this is a thing but is it possible to tell the GPU to do ...
0
votes
0
answers
46
views
Taking advantage of memory contiguousness in HLSL
This is a bit of a slog so bare with me.
I'm currently writing a 3D S(moothed) P(article) H(ydrodynamics) simulation in Unity with a parallel HLSL backend. It's a Lagrangian method of fluid simulation,...
Tooling
0
votes
0
replies
24
views
ComfyUI + Flux 1 dev + limited RAM + same workflow: With 2 GPUs?
I am running Flux 1 dev text to image model through ComfyUI in Kaggle. Everything works but I noticed that Kaggle offers a second GPU inside the notebook. If I try to run two instances of the ComfyUI ...
-4
votes
0
answers
31
views
Tensorflow GPU use in python 3.13 [duplicate]
Im trying to use tensorflow with gpu on my windows device, i have python 3.13 venv. Is newer version of tensorflow support gou acceleration on windows. Ive read that it stopped in tensorflow version 2....
Tooling
0
votes
0
replies
85
views
Which LLMs can I run locally on RTX 1080 8GB with 48GB RAM?
I'm exploring options for running large language models locally on my workstation and would appreciate guidance on suitable models given my hardware constraints.
Hardware specifications:
CPU: Intel ...
1
vote
1
answer
72
views
Is passing ray resources as options when calling the function equivalent to setting them in the function's decorator?
Is
@ray.remote
def run_experiment(...):
(...)
if __name__ == '__main__':
ray.init()
exp_config = sys.argv[1]
params_tuples, num_cpus, num_gpus = load_exp_config(exp_config)
ray.get(...
0
votes
0
answers
52
views
Problems with fencing sporadic command buffer submission in Vulkan
I am invoking a compute shader, writing to it, then reading it to then write to disk.
According to renderdoc the image is properly generated. Additionally, when compiled in debug mode I get the right ...
2
votes
0
answers
115
views
Implementing Arbitrary Precision Arithmetic in CubeCL for Infinite Zoom Fractals
Context
I'm implementing a Julia set fractal renderer using CubeCL (a Rust GPU compute framework). I want to achieve "infinite zoom" similar to deep Mandelbrot zoom videos, which requires ...
3
votes
0
answers
112
views
How does one log the operations done on a GPU during the execution of Python code?
I have encountered a particular problem while executing a function from the transformers library of huggingface on an Intel GPU wheel of torch. Since I am doing something I normally shouldn't be ...
1
vote
0
answers
65
views
How to force allocated D3D12 resource to reside in VRAM and not be demoted to shared RAM?
For testing purposes I need a tool that will occupy some amount of VRAM, leaving a reduced available VRAM to the rest of the applications. I implemented a version that somewhat works using D3D12 API, ...
0
votes
0
answers
64
views
Utilizing GPU with RNN models which takes it's output as input [torch]
I have a machine-translation model. In this model, I calculate a vector for a given sentence and I take this vector, aggregate with each generated output of RNN and put it into RNN again for ...
0
votes
0
answers
90
views
i686 compiler with GNU and SDL3 failing to claim window
I am using an i686 system, with the compiler mingw G++. I can run code that creates a GPU device and attaches it to a window fine on that machine. However, when I attempt to run it on my i686 windows ...
0
votes
1
answer
99
views
CPU-GPU producer-consumer pattern using unified memory but GPU is in spin loop
I am trying to implement producer consumer problem in GPU-CPU. Required for some other project. GPU requests some data via Unified memory to CPU. CPU copies that data to a specific location in global ...
3
votes
1
answer
429
views
GPU supported Jax Installation [closed]
I am quite new to jax. I am trying to make use of it to do some optimization work. I have tried using a CPU-only version jax, and it has worked well. Indeed the speed is not impressive as expected, so ...
-3
votes
1
answer
73
views
How to set up WebGPU work groups with fully independent tasks?
I am new to using WebGPU. I'm using the rust wgpu crate with compute shaders to run a cryptographic task at high speed.
My shader is quite simple: Take a common input state, append a unique per-thread ...
1
vote
0
answers
370
views
Encountering an AttibuteError: module 'torch' has no attribute 'xpu'
I'm encountering an error
AttibuteError: module 'torch' has no attribute 'xpu'
when running the diffusers library in a Google Colab environment with a CUDA GPU. I'm trying to use DiffusionPipleline....
1
vote
1
answer
107
views
Is CPU multithreading effected by divergence?
Building on this question here
The term thread divergence is used in CUDA; from my understanding it's a situation where different threads are assigned to do different tasks and this results in a big ...
0
votes
1
answer
97
views
How can I empty the GPU memory when I get a error like OOM?
sometimes,when I got a OOM error,but the parameters of LLM has been load in GPU,and it cannot be cleared automatically.
So,I try this
torch.cuda.empty_cache()
but it did't work.So,everytime I must ...
0
votes
1
answer
141
views
How to correctly monitor a program’s GPU memory bandwidth utilization and SM utilization? (DCGM DRAM_ACTIVE vs in-program bandwidth differs a lot)
I want to quantitatively measure the memory bandwidth utilization and SM utilization of a CUDA program for performance analysis and regression testing.
My approach so far:
Compute the theoretical ...
2
votes
1
answer
125
views
ILGPU kernel silently not compiling
I am trying to debug a kernel written for ILGPU which does not compile.
My aplication has 2 big kernels.
The first (that loads and does the right thing):
/// <summary>
/// Unified GPU kernel ...
0
votes
0
answers
53
views
cuda commands from python in opencv fail
I've compiled opencv with cuda support and am trying to use some cuda functions - to that end I'm trying the following test:
import cv2
if cv2.cuda.getCudaEnabledDeviceCount() > 0:
print("...
3
votes
1
answer
106
views
JavaFX app freezes or flickers after Intel Iris Xe driver update [closed]
I have a JavaFX desktop application that started having rendering issues after updating the Intel Iris Xe graphics driver.
On Java 11 + JavaFX (Zulu distribution):
openjdk version "11.0.25" ...
-6
votes
1
answer
104
views
Host to GPU in terms of endianness discrepancy [closed]
I have been told that when it comes to GPU APIs like Vulkan and DirectX and the host is for example little-endian and the GPU is big-endian that you can read for example a 32-bit integer and the ...
1
vote
0
answers
94
views
Using different sources depending of architecture
Why?
The reason why I need to do this is, I am using rustgpu to compile shader crates with their own dependencies. Many of these dependencies need to compile on both the CPU and GPU, this means huge ...
2
votes
1
answer
69
views
How to correctly pass float4 vector to kernel using PyCUDA?
I am trying to pass a float4 as argument to my cuda kernel (by value) using PyCUDA’s make_float4(). But there seems to be some misalignment when the data is transferred to the kernel. If I read the ...
5
votes
1
answer
165
views
Is it expected that vmapping over different input sizes for the same function impacts the accuracy of the result?
I was suprised to see that depending on the size of an input matrix, which is vmapped over inside of a function, the output of the function changes slightly. That is, not only does the size of the ...
1
vote
1
answer
129
views
Fortran OpenMP offloading painfully slow on NVIDIA architectures
I am currently trying to porting a big portion of a Fortran code to GPU devices with OpenMP. I have a working version for AMD, specifically for the MI300A which features unified shared memory. I ...
0
votes
0
answers
52
views
HOOMD-blue GPU error: "invalid device ordinal" in WSL2 despite CUDA and GPU detection
I am trying to run HOOMD-blue 5.2.0 with GPU support inside WSL2 (Windows Subsystem for Linux), but I keep getting the following error:
RuntimeError: CUDA Error: invalid device ordinal before /hoomd/...
0
votes
1
answer
336
views
Distinction CuTe and NVIDIA Cutlass
I'm confused what exactly is handled by CuTe and by Cutlass.
From my understanding Cutlass handles the following:
Gemm computation of CuTe Tensors
Communication between CPU and GPU
Abstract memory ...
0
votes
0
answers
60
views
How to enable metal is being used by tensorflow.js with node/Bun
Please comment how to enable Metal with tfjs-node on MacOS
+Metal isn't ready with tensorflow (c++) on the server side.
bun ./verify-backend.js
const tf = require('@tensorflow/tfjs-node');
async ...
0
votes
1
answer
94
views
GPU out of memory issue when training in PyTorch [closed]
I am trying training data for my AI/ML model and got CUDA out of memory issue :
Any solution would be the great help
torch.cuda.OutOfMemoryError: CUDA out of memory.
Tried to allocate 3.52 GB....
0
votes
0
answers
29
views
Tracking energy consumption while executing code on a Cloud VM (Codecarbon)
I would like to know if codecarbon, a python package measuring the energy consumption of the GPU with pyNVML, would be able to have access to this informations in the case of Cloud computing. And if ...
1
vote
0
answers
163
views
How do CUDA stream, DMA engine, and Async Engine work and interact with each other?
I am trying to overlap data transfer and kernel execution using CUDA C++.
I created an array, split it into 8 chunks, and then assign each of chunk into a corresponding CUDA stream using the following ...
0
votes
1
answer
170
views
OpenCV-python and GPU
Is it possible (up today) to use OpenCV-Python with GPU? I'm trying to implement an opencv based script for image processing using a GPU on AWS Sagemaker, but there seems to be a problem on ...
1
vote
0
answers
43
views
How to optimize CPU tensor slicing and asynchronous transfer to the GPU?
My code involves slicing large tensors on the CPU by index and asynchronously transmitting them back to the GPU. However, through the Profiler debugging tool, I found that this step would seriously ...
1
vote
1
answer
90
views
Using evaluate library to evaluate BertScore only uses 1 busy GPU
I'm using evaluate library to evaluate the BertScore. Here are my codes:
import evaluate
bertscore = evaluate.load("bertscore")
bertscore_result = bertscore.compute(predictions=[sentence], ...
1
vote
0
answers
67
views
Understanding Nsight Graphics output
I am trying to find bottlenecks in some shaders through NVIDIA Nsight Graphics.
Right now I am focusing on trying to understand one result that seems impossible. The profiling UI shows that on each ...
0
votes
0
answers
56
views
Unknown DDP error when running multi-processing with pytorch on a Google GPU based colab kernel
I have created a colab (link: https://colab.research.google.com/drive/1gg57PS7KMLKvvx9wgDKLMDiyjhRICplp#scrollTo=zG2D7JO2OdEC) to play with the gpt2 fine-tuning.
And I was trying to practice the DDP ...
1
vote
0
answers
133
views
GPU to GPU direct data transfer with connectX and RDMA
we are trying to connect two gpus located on two servers via RDMA and infinibands. The GPUs are Nvidia RTX 6000 Ada and the infinbands are NVIDIA ConnectX-6.
Server configuration
Our server has the ...
3
votes
0
answers
164
views
An error occurs when creating queues with gpu selector in SYCL program
This is my SYCL program on windows to check if SYCL works fine in my system
#include <iostream>
#include <cl/sycl.hpp>
#include <Windows.h>
int main()
{
int array[5];
{
...
0
votes
0
answers
125
views
Segmentation fault when waiting for queue in Vulkan
I have some code that was working on a prior driver that now segfaults when compiling.
In particular, it segfaults here:
unsafe {
let graphics_queue = self.hardware_interface....
0
votes
1
answer
86
views
How does cudaMallocPitch help avoid bank conflict?
I learned CUDA function cudaMallocPitch creates padded memory that helps avoid bank conflict from this nice SO answer.
I can understand well how does the padding help alignment, as it very much ...
2
votes
0
answers
195
views
How to reuse cuDSS factors when solving a system of linear equations Ax=b
I am using cuDSS to solve a set of Ax=b equations as follows
cudssMatrixType_t mtype = CUDSS_MTYPE_SPD;
cudssMatrixViewType_t mview = CUDSS_MVIEW_UPPER;
cudssIndexBase_t base = CUDSS_BASE_ZERO;
...
1
vote
0
answers
32
views
Guarantee GPU on JupyterHub
I'm very new to JupyterHub in general and hope my question is not too naive. But would appreciate guiding in the right direction:
Problem statement:
If one user uses a process that involves a GPU, the ...
0
votes
0
answers
55
views
Segmentation issue, running PyTorch on GPU supported GKE node pool
I have a node pool of n1-highmem-4 machines with 1 NVIDIA Tesla T4 attached with a COS_CONTAINERD image. I am running a transformer model in python on a pod to execute the model on GPU. I get an ...
0
votes
0
answers
144
views
Unable to get WebGL GPU support in GCP CloudRun container using Playwright/Chromium
I'm trying to deploy a container to Google CloudRun which lets me use WebGL which is GPU hardware-accelerated.
I have the following front-end code (using node) to initialize WebGL and query its vendor ...
1
vote
0
answers
45
views
Image Tensors Return As Zero When num_workers > 0
I am facing an issue with multiprocessing. I am trying to load my .pt data as dataloaders. Everything works fine when I set the num_workers = 0. But when I set it to a value greater than 0, the tensor ...
0
votes
0
answers
50
views
XGBoost GPU version not outperforming CPU on small dataset despite parameter tuning – suggestions needed
I'm currently working on a parallel and distributed computing project where I'm comparing the performance of XGBoost running on CPU vs GPU. The goal is to demonstrate how GPU acceleration can improve ...