Newest 'gpu' Questions

1 vote

0 answers

34 views

Microsoft.ML C#: GPU not found in K8s/Docker container

I have created a .NET app that uses Microsoft.ML.OnnxRuntime.Gpu for interference. Now I'm trying to integrate it with Azure Kubernetes. We have made the setup with Tesla T4 GPU and we confirmed it's ...

ervin

555

asked 2 days ago

-3 votes

0 answers

64 views

Intel ARC GPU hangs when performing an untyped surface read [closed]

I am currently writing a driver for the Intel ARC GPU series (specifically I use the A750 for testing purposes) for my own operating system. I am already able to execute compute kernels that use ...

Joel Marker

40

asked Nov 26 at 0:54

-2 votes

0 answers

66 views

Slow collision detection in Python [closed]

I am making a particle simulator in python, and noticed that my collision detection is ruining the performance the most,I am not even sure if this is a thing but is it possible to tell the GPU to do ...

Dario

1

asked Nov 25 at 15:44

0 votes

0 answers

46 views

Taking advantage of memory contiguousness in HLSL

This is a bit of a slog so bare with me. I'm currently writing a 3D S(moothed) P(article) H(ydrodynamics) simulation in Unity with a parallel HLSL backend. It's a Lagrangian method of fluid simulation,...

Ben Williams

13

asked Nov 18 at 14:50

Tooling

0 votes

0 replies

24 views

ComfyUI + Flux 1 dev + limited RAM + same workflow: With 2 GPUs?

I am running Flux 1 dev text to image model through ComfyUI in Kaggle. Everything works but I noticed that Kaggle offers a second GPU inside the notebook. If I try to run two instances of the ComfyUI ...

Bram Fran

113

asked Nov 17 at 15:03

-4 votes

0 answers

31 views

Tensorflow GPU use in python 3.13 [duplicate]

Im trying to use tensorflow with gpu on my windows device, i have python 3.13 venv. Is newer version of tensorflow support gou acceleration on windows. Ive read that it stopped in tensorflow version 2....

Med Yassine Ghaoui

1

asked Nov 16 at 21:57

Tooling

0 votes

0 replies

85 views

Which LLMs can I run locally on RTX 1080 8GB with 48GB RAM?

I'm exploring options for running large language models locally on my workstation and would appreciate guidance on suitable models given my hardware constraints. Hardware specifications: CPU: Intel ...

GbreH

13

asked Nov 10 at 15:46

1 vote

1 answer

72 views

Is passing ray resources as options when calling the function equivalent to setting them in the function's decorator?

Is @ray.remote def run_experiment(...): (...) if __name__ == '__main__': ray.init() exp_config = sys.argv[1] params_tuples, num_cpus, num_gpus = load_exp_config(exp_config) ray.get(...

Blupon

1,091

asked Nov 10 at 14:51

0 votes

0 answers

52 views

Problems with fencing sporadic command buffer submission in Vulkan

I am invoking a compute shader, writing to it, then reading it to then write to disk. According to renderdoc the image is properly generated. Additionally, when compiled in debug mode I get the right ...

Makogan

9,991

asked Oct 28 at 4:41

2 votes

0 answers

115 views

Implementing Arbitrary Precision Arithmetic in CubeCL for Infinite Zoom Fractals

Context I'm implementing a Julia set fractal renderer using CubeCL (a Rust GPU compute framework). I want to achieve "infinite zoom" similar to deep Mandelbrot zoom videos, which requires ...

Marco Fanelli

41

asked Oct 27 at 18:52

3 votes

0 answers

112 views

How does one log the operations done on a GPU during the execution of Python code?

I have encountered a particular problem while executing a function from the transformers library of huggingface on an Intel GPU wheel of torch. Since I am doing something I normally shouldn't be ...

Logarithmnepnep

31

asked Oct 17 at 11:19

1 vote

0 answers

65 views

How to force allocated D3D12 resource to reside in VRAM and not be demoted to shared RAM?

For testing purposes I need a tool that will occupy some amount of VRAM, leaving a reduced available VRAM to the rest of the applications. I implemented a version that somewhat works using D3D12 API, ...

Virgileo

99

asked Oct 16 at 10:50

0 votes

0 answers

64 views

Utilizing GPU with RNN models which takes it's output as input [torch]

I have a machine-translation model. In this model, I calculate a vector for a given sentence and I take this vector, aggregate with each generated output of RNN and put it into RNN again for ...

cuneyttyler

1,395

asked Oct 15 at 14:20

0 votes

0 answers

90 views

i686 compiler with GNU and SDL3 failing to claim window

I am using an i686 system, with the compiler mingw G++. I can run code that creates a GPU device and attaches it to a window fine on that machine. However, when I attempt to run it on my i686 windows ...

Jasper Stocks

1

asked Oct 2 at 12:05

0 votes

1 answer

99 views

CPU-GPU producer-consumer pattern using unified memory but GPU is in spin loop

I am trying to implement producer consumer problem in GPU-CPU. Required for some other project. GPU requests some data via Unified memory to CPU. CPU copies that data to a specific location in global ...

Chinmaya Bhat K K

1

asked Sep 30 at 18:38

3 votes

1 answer

429 views

GPU supported Jax Installation [closed]

I am quite new to jax. I am trying to make use of it to do some optimization work. I have tried using a CPU-only version jax, and it has worked well. Indeed the speed is not impressive as expected, so ...

Newbee

39

asked Sep 29 at 20:40

-3 votes

1 answer

73 views

How to set up WebGPU work groups with fully independent tasks?

I am new to using WebGPU. I'm using the rust wgpu crate with compute shaders to run a cryptographic task at high speed. My shader is quite simple: Take a common input state, append a unique per-thread ...

conduition

13

asked Sep 29 at 17:46

1 vote

0 answers

370 views

Encountering an AttibuteError: module 'torch' has no attribute 'xpu'

I'm encountering an error AttibuteError: module 'torch' has no attribute 'xpu' when running the diffusers library in a Google Colab environment with a CUDA GPU. I'm trying to use DiffusionPipleline....

Beverly Sellers-Robinson

11

asked Sep 23 at 17:41

1 vote

1 answer

107 views

Is CPU multithreading effected by divergence?

Building on this question here The term thread divergence is used in CUDA; from my understanding it's a situation where different threads are assigned to do different tasks and this results in a big ...

bigcodeszzer

960

asked Sep 18 at 1:37

0 votes

1 answer

97 views

How can I empty the GPU memory when I get a error like OOM?

sometimes,when I got a OOM error,but the parameters of LLM has been load in GPU,and it cannot be cleared automatically. So,I try this torch.cuda.empty_cache() but it did't work.So,everytime I must ...

zddisworkinghard

1

asked Sep 6 at 8:13

0 votes

1 answer

141 views

How to correctly monitor a program’s GPU memory bandwidth utilization and SM utilization? (DCGM DRAM_ACTIVE vs in-program bandwidth differs a lot)

I want to quantitatively measure the memory bandwidth utilization and SM utilization of a CUDA program for performance analysis and regression testing. My approach so far: Compute the theoretical ...

plznobug

143

asked Sep 5 at 10:48

2 votes

1 answer

125 views

ILGPU kernel silently not compiling

I am trying to debug a kernel written for ILGPU which does not compile. My aplication has 2 big kernels. The first (that loads and does the right thing): /// <summary> /// Unified GPU kernel ...

AlessandroParma

161

asked Aug 29 at 12:23

0 votes

0 answers

53 views

cuda commands from python in opencv fail

I've compiled opencv with cuda support and am trying to use some cuda functions - to that end I'm trying the following test: import cv2 if cv2.cuda.getCudaEnabledDeviceCount() > 0: print("...

jeremy_rutman

6,170

asked Aug 27 at 17:36

3 votes

1 answer

106 views

JavaFX app freezes or flickers after Intel Iris Xe driver update [closed]

I have a JavaFX desktop application that started having rendering issues after updating the Intel Iris Xe graphics driver. On Java 11 + JavaFX (Zulu distribution): openjdk version "11.0.25" ...

Guilherme Almeida

31

asked Aug 25 at 19:21

-6 votes

1 answer

104 views

Host to GPU in terms of endianness discrepancy [closed]

I have been told that when it comes to GPU APIs like Vulkan and DirectX and the host is for example little-endian and the GPU is big-endian that you can read for example a 32-bit integer and the ...

Zebrafish

16.3k

asked Aug 18 at 18:32

1 vote

0 answers

94 views

Using different sources depending of architecture

Why? The reason why I need to do this is, I am using rustgpu to compile shader crates with their own dependencies. Many of these dependencies need to compile on both the CPU and GPU, this means huge ...

Makogan

9,991

asked Aug 18 at 6:13

2 votes

1 answer

69 views

How to correctly pass float4 vector to kernel using PyCUDA?

I am trying to pass a float4 as argument to my cuda kernel (by value) using PyCUDA’s make_float4(). But there seems to be some misalignment when the data is transferred to the kernel. If I read the ...

Dodilei

308

asked Aug 7 at 19:49

5 votes

1 answer

165 views

Is it expected that vmapping over different input sizes for the same function impacts the accuracy of the result?

I was suprised to see that depending on the size of an input matrix, which is vmapped over inside of a function, the output of the function changes slightly. That is, not only does the size of the ...

hvater

100

asked Aug 5 at 12:17

1 vote

1 answer

129 views

Fortran OpenMP offloading painfully slow on NVIDIA architectures

I am currently trying to porting a big portion of a Fortran code to GPU devices with OpenMP. I have a working version for AMD, specifically for the MI300A which features unified shared memory. I ...

Giorgio Daneri

11

asked Jul 29 at 18:05

0 votes

0 answers

52 views

HOOMD-blue GPU error: "invalid device ordinal" in WSL2 despite CUDA and GPU detection

I am trying to run HOOMD-blue 5.2.0 with GPU support inside WSL2 (Windows Subsystem for Linux), but I keep getting the following error: RuntimeError: CUDA Error: invalid device ordinal before /hoomd/...

Eric Ortiz Vazquez

9

asked Jul 23 at 22:10

0 votes

1 answer

336 views

Distinction CuTe and NVIDIA Cutlass

I'm confused what exactly is handled by CuTe and by Cutlass. From my understanding Cutlass handles the following: Gemm computation of CuTe Tensors Communication between CPU and GPU Abstract memory ...

jonithani123

254

asked Jul 2 at 14:23

0 votes

0 answers

60 views

How to enable metal is being used by tensorflow.js with node/Bun

Please comment how to enable Metal with tfjs-node on MacOS +Metal isn't ready with tensorflow (c++) on the server side. bun ./verify-backend.js const tf = require('@tensorflow/tfjs-node'); async ...

madeinQuant

1,823

asked Jun 28 at 6:34

0 votes

1 answer

94 views

GPU out of memory issue when training in PyTorch [closed]

I am trying training data for my AI/ML model and got CUDA out of memory issue : Any solution would be the great help torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.52 GB....

Boopathi Muthuraman

11

asked Jun 27 at 8:34

0 votes

0 answers

29 views

Tracking energy consumption while executing code on a Cloud VM (Codecarbon)

I would like to know if codecarbon, a python package measuring the energy consumption of the GPU with pyNVML, would be able to have access to this informations in the case of Cloud computing. And if ...

Titou

1

asked Jun 23 at 14:08

1 vote

0 answers

163 views

How do CUDA stream, DMA engine, and Async Engine work and interact with each other?

I am trying to overlap data transfer and kernel execution using CUDA C++. I created an array, split it into 8 chunks, and then assign each of chunk into a corresponding CUDA stream using the following ...

NPnothard

19

asked Jun 23 at 7:20

0 votes

1 answer

170 views

OpenCV-python and GPU

Is it possible (up today) to use OpenCV-Python with GPU? I'm trying to implement an opencv based script for image processing using a GPU on AWS Sagemaker, but there seems to be a problem on ...

AlternativeWaltz

93

asked Jun 20 at 16:04

1 vote

0 answers

43 views

How to optimize CPU tensor slicing and asynchronous transfer to the GPU?

My code involves slicing large tensors on the CPU by index and asynchronously transmitting them back to the GPU. However, through the Profiler debugging tool, I found that this step would seriously ...

Ponytail

11

asked Jun 19 at 16:19

1 vote

1 answer

90 views

Using evaluate library to evaluate BertScore only uses 1 busy GPU

I'm using evaluate library to evaluate the BertScore. Here are my codes: import evaluate bertscore = evaluate.load("bertscore") bertscore_result = bertscore.compute(predictions=[sentence], ...

Raptor

54.4k

asked May 28 at 1:53

1 vote

0 answers

67 views

Understanding Nsight Graphics output

I am trying to find bottlenecks in some shaders through NVIDIA Nsight Graphics. Right now I am focusing on trying to understand one result that seems impossible. The profiling UI shows that on each ...

Makogan

9,991

asked May 18 at 23:02

0 votes

0 answers

56 views

Unknown DDP error when running multi-processing with pytorch on a Google GPU based colab kernel

I have created a colab (link: https://colab.research.google.com/drive/1gg57PS7KMLKvvx9wgDKLMDiyjhRICplp#scrollTo=zG2D7JO2OdEC) to play with the gpt2 fine-tuning. And I was trying to practice the DDP ...

novakwang

11

asked May 18 at 22:01

1 vote

0 answers

133 views

GPU to GPU direct data transfer with connectX and RDMA

we are trying to connect two gpus located on two servers via RDMA and infinibands. The GPUs are Nvidia RTX 6000 Ada and the infinbands are NVIDIA ConnectX-6. Server configuration Our server has the ...

Alba Delgado

11

asked May 16 at 19:23

3 votes

0 answers

164 views

An error occurs when creating queues with gpu selector in SYCL program

This is my SYCL program on windows to check if SYCL works fine in my system #include <iostream> #include <cl/sycl.hpp> #include <Windows.h> int main() { int array[5]; { ...

Supergamer

425

asked May 14 at 10:27

0 votes

0 answers

125 views

Segmentation fault when waiting for queue in Vulkan

I have some code that was working on a prior driver that now segfaults when compiling. In particular, it segfaults here: unsafe { let graphics_queue = self.hardware_interface....

Makogan

9,991

asked May 13 at 2:47

0 votes

1 answer

86 views

How does cudaMallocPitch help avoid bank conflict?

I learned CUDA function cudaMallocPitch creates padded memory that helps avoid bank conflict from this nice SO answer. I can understand well how does the padding help alignment, as it very much ...

PkDrew

2,301

asked May 11 at 3:56

2 votes

0 answers

195 views

How to reuse cuDSS factors when solving a system of linear equations Ax=b

I am using cuDSS to solve a set of Ax=b equations as follows cudssMatrixType_t mtype = CUDSS_MTYPE_SPD; cudssMatrixViewType_t mview = CUDSS_MVIEW_UPPER; cudssIndexBase_t base = CUDSS_BASE_ZERO; ...

pk68

83

asked May 8 at 17:55

1 vote

0 answers

32 views

Guarantee GPU on JupyterHub

I'm very new to JupyterHub in general and hope my question is not too naive. But would appreciate guiding in the right direction: Problem statement: If one user uses a process that involves a GPU, the ...

Klemens Lechner

11

asked May 8 at 10:42

0 votes

0 answers

55 views

Segmentation issue, running PyTorch on GPU supported GKE node pool

I have a node pool of n1-highmem-4 machines with 1 NVIDIA Tesla T4 attached with a COS_CONTAINERD image. I am running a transformer model in python on a pod to execute the model on GPU. I get an ...

Rayhaan Iqbal

1

asked May 7 at 16:37

0 votes

0 answers

144 views

Unable to get WebGL GPU support in GCP CloudRun container using Playwright/Chromium

I'm trying to deploy a container to Google CloudRun which lets me use WebGL which is GPU hardware-accelerated. I have the following front-end code (using node) to initialize WebGL and query its vendor ...

Lenny

143

asked May 7 at 12:52

1 vote

0 answers

45 views

Image Tensors Return As Zero When num_workers > 0

I am facing an issue with multiprocessing. I am trying to load my .pt data as dataloaders. Everything works fine when I set the num_workers = 0. But when I set it to a value greater than 0, the tensor ...

jobayer

11

asked May 5 at 5:15

0 votes

0 answers

50 views

XGBoost GPU version not outperforming CPU on small dataset despite parameter tuning – suggestions needed

I'm currently working on a parallel and distributed computing project where I'm comparing the performance of XGBoost running on CPU vs GPU. The goal is to demonstrate how GPU acceleration can improve ...

Mxneeb

19

asked May 2 at 16:17

Collectives™ on Stack Overflow