Segmentation issue, running PyTorch on GPU supported GKE node pool

Ask Question

Asked 6 months ago

Modified 6 months ago

Viewed 55 times

Part of Google Cloud Collective

I have a node pool of n1-highmem-4 machines with 1 NVIDIA Tesla T4 attached with a COS_CONTAINERD image. I am running a transformer model in python on a pod to execute the model on GPU. I get an Segmentation error whenever trying to move the model to GPU.

Pod Image:

FROM nvidia/cuda:12.2.0-devel-ubuntu22.04

ENV DEBIAN_FRONTEND=noninteractive \
    PYTHONUNBUFFERED=1 PIP_NO_CACHE_DIR=off PIP_DISABLE_PIP_VERSION_CHECK=1

RUN apt-get update && apt-get install -y --no-install-recommends \
    python3-pip python3-dev build-essential \
    && rm -rf /var/lib/apt/lists/*

RUN ln -sf /usr/bin/python3 /usr/local/bin/python

RUN pip install --upgrade pip \
    && pip install --no-cache-dir \
    --extra-index-url https://download.pytorch.org/whl/cu121 \
    torch==2.1.2

WORKDIR /app
COPY requirements.txt /app/requirements.txt
RUN pip3 install --no-cache-dir -r requirements.txt

My requirement file has basic python modules that I require, including transformers==4.37.1. I do not have torch in them. They also don't have any nvidia/cuda specific modules (I'm assuming my base image covers me for any drivers required). In the pod I can see the following

:/app# nvidia-smi
Wed May  7 16:13:16 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.230.02             Driver Version: 535.230.02   CUDA Version: 12.2     |


:/app# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

12.2 and 12.1 seem compatible to me. When looking for torch cuda its giving me a segmentation error.

>>> import torch
>>> print(torch.__version__, torch.version.cuda)
2.1.2+cu121 12.1
>>> torch.cuda.is_available()
Segmentation fault (core dumped)

I've tried switching base images, torch versions, but nothing seems to work. Thanks in advance for someone who could help here.

edited May 8 at 9:32

simon

6,8092 gold badges18 silver badges33 bronze badges

asked May 7 at 16:37

Rayhaan Iqbal

11 bronze badge

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Segmentation issue, running PyTorch on GPU supported GKE node pool

0

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest