I have a node pool of n1-highmem-4 machines with 1 NVIDIA Tesla T4 attached with a COS_CONTAINERD image. I am running a transformer model in python on a pod to execute the model on GPU. I get an Segmentation error whenever trying to move the model to GPU.
Pod Image:
FROM nvidia/cuda:12.2.0-devel-ubuntu22.04
ENV DEBIAN_FRONTEND=noninteractive \
PYTHONUNBUFFERED=1 PIP_NO_CACHE_DIR=off PIP_DISABLE_PIP_VERSION_CHECK=1
RUN apt-get update && apt-get install -y --no-install-recommends \
python3-pip python3-dev build-essential \
&& rm -rf /var/lib/apt/lists/*
RUN ln -sf /usr/bin/python3 /usr/local/bin/python
RUN pip install --upgrade pip \
&& pip install --no-cache-dir \
--extra-index-url https://download.pytorch.org/whl/cu121 \
torch==2.1.2
WORKDIR /app
COPY requirements.txt /app/requirements.txt
RUN pip3 install --no-cache-dir -r requirements.txt
My requirement file has basic python modules that I require, including transformers==4.37.1. I do not have torch in them. They also don't have any nvidia/cuda specific modules (I'm assuming my base image covers me for any drivers required). In the pod I can see the following
:/app# nvidia-smi
Wed May 7 16:13:16 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.230.02 Driver Version: 535.230.02 CUDA Version: 12.2 |
:/app# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
12.2 and 12.1 seem compatible to me. When looking for torch cuda its giving me a segmentation error.
>>> import torch
>>> print(torch.__version__, torch.version.cuda)
2.1.2+cu121 12.1
>>> torch.cuda.is_available()
Segmentation fault (core dumped)
I've tried switching base images, torch versions, but nothing seems to work. Thanks in advance for someone who could help here.