Skip to main content
Filter by
Sorted by
Tagged with
1 vote
0 answers
34 views

I have created a .NET app that uses Microsoft.ML.OnnxRuntime.Gpu for interference. Now I'm trying to integrate it with Azure Kubernetes. We have made the setup with Tesla T4 GPU and we confirmed it's ...
ervin's user avatar
  • 555
-3 votes
0 answers
64 views

I am currently writing a driver for the Intel ARC GPU series (specifically I use the A750 for testing purposes) for my own operating system. I am already able to execute compute kernels that use ...
Joel Marker's user avatar
-2 votes
0 answers
66 views

I am making a particle simulator in python, and noticed that my collision detection is ruining the performance the most,I am not even sure if this is a thing but is it possible to tell the GPU to do ...
Dario's user avatar
  • 1
0 votes
0 answers
46 views

This is a bit of a slog so bare with me. I'm currently writing a 3D S(moothed) P(article) H(ydrodynamics) simulation in Unity with a parallel HLSL backend. It's a Lagrangian method of fluid simulation,...
Ben Williams's user avatar
Tooling
0 votes
0 replies
24 views

I am running Flux 1 dev text to image model through ComfyUI in Kaggle. Everything works but I noticed that Kaggle offers a second GPU inside the notebook. If I try to run two instances of the ComfyUI ...
Bram Fran's user avatar
  • 113
-4 votes
0 answers
31 views

Im trying to use tensorflow with gpu on my windows device, i have python 3.13 venv. Is newer version of tensorflow support gou acceleration on windows. Ive read that it stopped in tensorflow version 2....
Med Yassine Ghaoui's user avatar
Tooling
0 votes
0 replies
85 views

I'm exploring options for running large language models locally on my workstation and would appreciate guidance on suitable models given my hardware constraints. Hardware specifications: CPU: Intel ...
GbreH's user avatar
  • 13
1 vote
1 answer
72 views

Is @ray.remote def run_experiment(...): (...) if __name__ == '__main__': ray.init() exp_config = sys.argv[1] params_tuples, num_cpus, num_gpus = load_exp_config(exp_config) ray.get(...
Blupon's user avatar
  • 1,091
0 votes
0 answers
52 views

I am invoking a compute shader, writing to it, then reading it to then write to disk. According to renderdoc the image is properly generated. Additionally, when compiled in debug mode I get the right ...
Makogan's user avatar
  • 9,991
2 votes
0 answers
115 views

Context I'm implementing a Julia set fractal renderer using CubeCL (a Rust GPU compute framework). I want to achieve "infinite zoom" similar to deep Mandelbrot zoom videos, which requires ...
Marco Fanelli's user avatar
3 votes
0 answers
112 views

I have encountered a particular problem while executing a function from the transformers library of huggingface on an Intel GPU wheel of torch. Since I am doing something I normally shouldn't be ...
Logarithmnepnep's user avatar
1 vote
0 answers
65 views

For testing purposes I need a tool that will occupy some amount of VRAM, leaving a reduced available VRAM to the rest of the applications. I implemented a version that somewhat works using D3D12 API, ...
Virgileo's user avatar
0 votes
0 answers
64 views

I have a machine-translation model. In this model, I calculate a vector for a given sentence and I take this vector, aggregate with each generated output of RNN and put it into RNN again for ...
cuneyttyler's user avatar
  • 1,395
0 votes
0 answers
90 views

I am using an i686 system, with the compiler mingw G++. I can run code that creates a GPU device and attaches it to a window fine on that machine. However, when I attempt to run it on my i686 windows ...
Jasper Stocks's user avatar
0 votes
1 answer
99 views

I am trying to implement producer consumer problem in GPU-CPU. Required for some other project. GPU requests some data via Unified memory to CPU. CPU copies that data to a specific location in global ...
Chinmaya Bhat K K's user avatar
3 votes
1 answer
429 views

I am quite new to jax. I am trying to make use of it to do some optimization work. I have tried using a CPU-only version jax, and it has worked well. Indeed the speed is not impressive as expected, so ...
Newbee's user avatar
  • 39
-3 votes
1 answer
73 views

I am new to using WebGPU. I'm using the rust wgpu crate with compute shaders to run a cryptographic task at high speed. My shader is quite simple: Take a common input state, append a unique per-thread ...
conduition's user avatar
1 vote
0 answers
370 views

I'm encountering an error AttibuteError: module 'torch' has no attribute 'xpu' when running the diffusers library in a Google Colab environment with a CUDA GPU. I'm trying to use DiffusionPipleline....
Beverly Sellers-Robinson's user avatar
1 vote
1 answer
107 views

Building on this question here The term thread divergence is used in CUDA; from my understanding it's a situation where different threads are assigned to do different tasks and this results in a big ...
bigcodeszzer's user avatar
0 votes
1 answer
97 views

sometimes,when I got a OOM error,but the parameters of LLM has been load in GPU,and it cannot be cleared automatically. So,I try this torch.cuda.empty_cache() but it did't work.So,everytime I must ...
zddisworkinghard's user avatar
0 votes
1 answer
141 views

I want to quantitatively measure the memory bandwidth utilization and SM utilization of a CUDA program for performance analysis and regression testing. My approach so far: Compute the theoretical ...
plznobug's user avatar
  • 143
2 votes
1 answer
125 views

I am trying to debug a kernel written for ILGPU which does not compile. My aplication has 2 big kernels. The first (that loads and does the right thing): /// <summary> /// Unified GPU kernel ...
AlessandroParma's user avatar
0 votes
0 answers
53 views

I've compiled opencv with cuda support and am trying to use some cuda functions - to that end I'm trying the following test: import cv2 if cv2.cuda.getCudaEnabledDeviceCount() > 0: print("...
jeremy_rutman's user avatar
3 votes
1 answer
106 views

I have a JavaFX desktop application that started having rendering issues after updating the Intel Iris Xe graphics driver. On Java 11 + JavaFX (Zulu distribution): openjdk version "11.0.25" ...
Guilherme Almeida's user avatar
-6 votes
1 answer
104 views

I have been told that when it comes to GPU APIs like Vulkan and DirectX and the host is for example little-endian and the GPU is big-endian that you can read for example a 32-bit integer and the ...
Zebrafish's user avatar
  • 16.3k
1 vote
0 answers
94 views

Why? The reason why I need to do this is, I am using rustgpu to compile shader crates with their own dependencies. Many of these dependencies need to compile on both the CPU and GPU, this means huge ...
Makogan's user avatar
  • 9,991
2 votes
1 answer
69 views

I am trying to pass a float4 as argument to my cuda kernel (by value) using PyCUDA’s make_float4(). But there seems to be some misalignment when the data is transferred to the kernel. If I read the ...
Dodilei's user avatar
  • 308
5 votes
1 answer
165 views

I was suprised to see that depending on the size of an input matrix, which is vmapped over inside of a function, the output of the function changes slightly. That is, not only does the size of the ...
hvater's user avatar
  • 100
1 vote
1 answer
129 views

I am currently trying to porting a big portion of a Fortran code to GPU devices with OpenMP. I have a working version for AMD, specifically for the MI300A which features unified shared memory. I ...
Giorgio Daneri's user avatar
0 votes
0 answers
52 views

I am trying to run HOOMD-blue 5.2.0 with GPU support inside WSL2 (Windows Subsystem for Linux), but I keep getting the following error: RuntimeError: CUDA Error: invalid device ordinal before /hoomd/...
Eric Ortiz Vazquez's user avatar
0 votes
1 answer
336 views

I'm confused what exactly is handled by CuTe and by Cutlass. From my understanding Cutlass handles the following: Gemm computation of CuTe Tensors Communication between CPU and GPU Abstract memory ...
jonithani123's user avatar
0 votes
0 answers
60 views

Please comment how to enable Metal with tfjs-node on MacOS +Metal isn't ready with tensorflow (c++) on the server side. bun ./verify-backend.js const tf = require('@tensorflow/tfjs-node'); async ...
madeinQuant's user avatar
  • 1,823
0 votes
1 answer
94 views

I am trying training data for my AI/ML model and got CUDA out of memory issue : Any solution would be the great help torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.52 GB....
Boopathi Muthuraman's user avatar
0 votes
0 answers
29 views

I would like to know if codecarbon, a python package measuring the energy consumption of the GPU with pyNVML, would be able to have access to this informations in the case of Cloud computing. And if ...
Titou's user avatar
  • 1
1 vote
0 answers
163 views

I am trying to overlap data transfer and kernel execution using CUDA C++. I created an array, split it into 8 chunks, and then assign each of chunk into a corresponding CUDA stream using the following ...
NPnothard's user avatar
0 votes
1 answer
170 views

Is it possible (up today) to use OpenCV-Python with GPU? I'm trying to implement an opencv based script for image processing using a GPU on AWS Sagemaker, but there seems to be a problem on ...
AlternativeWaltz's user avatar
1 vote
0 answers
43 views

My code involves slicing large tensors on the CPU by index and asynchronously transmitting them back to the GPU. However, through the Profiler debugging tool, I found that this step would seriously ...
Ponytail's user avatar
1 vote
1 answer
90 views

I'm using evaluate library to evaluate the BertScore. Here are my codes: import evaluate bertscore = evaluate.load("bertscore") bertscore_result = bertscore.compute(predictions=[sentence], ...
Raptor's user avatar
  • 54.4k
1 vote
0 answers
67 views

I am trying to find bottlenecks in some shaders through NVIDIA Nsight Graphics. Right now I am focusing on trying to understand one result that seems impossible. The profiling UI shows that on each ...
Makogan's user avatar
  • 9,991
0 votes
0 answers
56 views

I have created a colab (link: https://colab.research.google.com/drive/1gg57PS7KMLKvvx9wgDKLMDiyjhRICplp#scrollTo=zG2D7JO2OdEC) to play with the gpt2 fine-tuning. And I was trying to practice the DDP ...
novakwang's user avatar
1 vote
0 answers
133 views

we are trying to connect two gpus located on two servers via RDMA and infinibands. The GPUs are Nvidia RTX 6000 Ada and the infinbands are NVIDIA ConnectX-6. Server configuration Our server has the ...
Alba Delgado's user avatar
3 votes
0 answers
164 views

This is my SYCL program on windows to check if SYCL works fine in my system #include <iostream> #include <cl/sycl.hpp> #include <Windows.h> int main() { int array[5]; { ...
Supergamer's user avatar
0 votes
0 answers
125 views

I have some code that was working on a prior driver that now segfaults when compiling. In particular, it segfaults here: unsafe { let graphics_queue = self.hardware_interface....
Makogan's user avatar
  • 9,991
0 votes
1 answer
86 views

I learned CUDA function cudaMallocPitch creates padded memory that helps avoid bank conflict from this nice SO answer. I can understand well how does the padding help alignment, as it very much ...
PkDrew's user avatar
  • 2,301
2 votes
0 answers
195 views

I am using cuDSS to solve a set of Ax=b equations as follows cudssMatrixType_t mtype = CUDSS_MTYPE_SPD; cudssMatrixViewType_t mview = CUDSS_MVIEW_UPPER; cudssIndexBase_t base = CUDSS_BASE_ZERO; ...
pk68's user avatar
  • 83
1 vote
0 answers
32 views

I'm very new to JupyterHub in general and hope my question is not too naive. But would appreciate guiding in the right direction: Problem statement: If one user uses a process that involves a GPU, the ...
Klemens Lechner's user avatar
0 votes
0 answers
55 views

I have a node pool of n1-highmem-4 machines with 1 NVIDIA Tesla T4 attached with a COS_CONTAINERD image. I am running a transformer model in python on a pod to execute the model on GPU. I get an ...
Rayhaan Iqbal's user avatar
0 votes
0 answers
144 views

I'm trying to deploy a container to Google CloudRun which lets me use WebGL which is GPU hardware-accelerated. I have the following front-end code (using node) to initialize WebGL and query its vendor ...
Lenny's user avatar
  • 143
1 vote
0 answers
45 views

I am facing an issue with multiprocessing. I am trying to load my .pt data as dataloaders. Everything works fine when I set the num_workers = 0. But when I set it to a value greater than 0, the tensor ...
jobayer's user avatar
  • 11
0 votes
0 answers
50 views

I'm currently working on a parallel and distributed computing project where I'm comparing the performance of XGBoost running on CPU vs GPU. The goal is to demonstrate how GPU acceleration can improve ...
Mxneeb's user avatar
  • 19

1
2 3 4 5
183