Newest 'hpc' Questions

Advice

0 votes

3 replies

39 views

HPC: /usr/lib folder not accessible to nodes

I am using a HPC system in which the folder /usr/ is not NFS. Therefore, the libraries installed in the master node do not seem available in the computation nodes, that is, if I ssh to a computer node ...

mancolric

111

asked Nov 21 at 16:06

1 vote

1 answer

107 views

Best practices for SLURM job pipeline with wrapper scripts - avoiding complex job ID extraction

I'm building a SLURM pipeline where each stage is a bash wrapper script that generates and submits SLURM jobs. Currently I'm doing complex job ID extraction which feels clunky: # Current approach ...

desert_ranger

1,859

asked Sep 23 at 20:20

1 vote

0 answers

45 views

How to run Neo4j Docker container using Singularity on HPC without shutdown during data import?

I'm trying to run the Neo4j Docker container using Singularity on an HPC system. The container starts successfully, but it shuts down automatically when I try to add data to the database (e.g., via ...

prasad

13

asked Sep 18 at 8:59

1 vote

1 answer

64 views

Debugging parallel python program in interruptible sleep

I have a mpi4py program, which runs well with mpiexec -np 30 python3 -O myscript.py at 100% CPU usage on each of the 30 CPUs. Now I am launching 8 instances with mpiexec -np 16 python3 -O myscript.py. ...

j13r

2,731

asked Aug 25 at 15:26

1 vote

0 answers

88 views

Slurm: salloc gets allocated then fails immediately with ExitCode=1:0 (Start=End same second), while equivalent sbatch works

I’ve been using salloc to allocate compute nodes without issues before. Recently, after switching to another user account (same .bashrc config, only the conda path changed), salloc stopped working. I ...

Calculus007

10

asked Aug 11 at 3:09

0 votes

0 answers

49 views

Postgresql, Postgis, QGIS in container launched from charliecloud

I need to migrate my work for geospatial processing (using mainly qgis processing and postgis functions from python scripts) to a HPC cluster. As neither qgis nor postgis are installed on the HPC I ...

Felix_geospatial

1

asked Jul 16 at 13:38

0 votes

1 answer

176 views

Spack `spack load` not setting LD\_LIBRARY\_PATH or CPATH environment variables as expected

I'm using Spack on Linux Mint to manage scientific libraries, including armadillo. I have installed Armadillo and its dependencies via Spack in an enviroment. Problem: When I run spack load armadillo, ...

jorge isaac rubiano

1

asked Jul 11 at 22:54

0 votes

0 answers

56 views

slurmstepd: error: execve(): mkdir: No such file or directory

I tried to use the sbatch file from this link (Running WindNinja on an HPC Cluster) to run the WindNinja software (WindNinja introduction) installed on HPC. However, it always produce the "...

Kaiyuan Zheng

33

asked Jul 7 at 9:58

0 votes

0 answers

78 views

How to force Slurm to pack GPU jobs onto partially occupied nodes to free full nodes?

When users request 1-2 GPUs via sbatch --gres=gpu:1, Slurm locks the entire 8-GPU node. This fragments our cluster: Multiple small requests spread across nodes (e.g., four 1-GPU jobs occupy four ...

train-server

1

asked Jun 26 at 19:16

0 votes

1 answer

58 views

how to use mkl_dcsrgemv or other functions in OneAPI to cal. scalar prodoct between mass dim sparse matrix and vector?

I program in fortran with Intel OneAPI compiler ifx and MKL packages. I want to cal. the scalar product between a mass dim sparse matrix and a vector. When the dim of the sparse matrix could be ...

River Chandler

1

asked Jun 17 at 8:56

0 votes

1 answer

79 views

How can I run snakemake jobs 'remotely'?

I love snakemake and have used it locally as well as on HPC with SLURM! However, now we have a particular setup where it is not as easy to use snakemake as we have done before: We need to run some ...

Sebastian Beyer

170

asked Jun 16 at 10:39

0 votes

0 answers

49 views

Sample UCP AM client failing with error "Destination is unreachable" for localhost

I'm learning UCX by creating a basic wrapper for both the client and server. I am using AM communication. When I run my client, I get below error : [1749297901.816001] [prateek:19822:0] ...

Prateek Joshi

4,093

asked Jun 7 at 18:49

0 votes

0 answers

89 views

Can I use MPI_File_read_all to read non contiguous datatypes directly (as opposed to setview)?

I'm trying to read different subsets of non-contiguous data from a file to different processes. Ie: I have a file with the data: a b c d e f g h i j and two processes who want to read the data from ...

Subject303

15

asked May 28 at 15:06

1 vote

2 answers

95 views

What is the difference between an MPI nonblocking collective write, iwrite_all vs a "nonblocking" noncollective iwrite combined with a file sync?

I'm setting up IO for a largescale CFD code using the MPI library and the file IO is starting to eat into computation time as my problems scale. As far as I can find the "done" thing in the ...

Subject303

15

asked Apr 29 at 4:36

0 votes

0 answers

50 views

Slurm partitions on same node overallocating CPUs

I have a single computation node with 32 CPUs. I have defined two different partitions that both use this node. If I for example send two jobs on partition A requesting 20 CPUs and 25 CPUs, the second ...

Daniel

1

asked Apr 14 at 14:46

0 votes

1 answer

70 views

Snakemake access snakemake.config in profile config.yaml file

I want to run a pipeline on a cluster where the name of the jobs are of the form : smk-{config["simulation"]}-{rule}-{wildcards}. Can I just do : snakemake --profile slurm --configfile ...

Kiffikiffe

153

asked Apr 2 at 16:02

1 vote

1 answer

96 views

Snakemake in cluster different ways

When running snakemake on a cluster, and if we don't have specific requirements for some rules about number of cores/memory, then what is the difference between : Using the classic way, i.e. calling ...

Kiffikiffe

153

asked Apr 2 at 10:17

0 votes

1 answer

89 views

Slurm only running 6 out of 12 array jobs concurrently on my 12-core PC system

I have a 12-core laptop (6 physical cores with hyperthreading) running Slurm for local job scheduling. When I submit job arrays requesting all 12 cores to be used simultaneously, Slurm consistently ...

desert_ranger

1,859

asked Mar 24 at 22:42

0 votes

0 answers

87 views

Automating a resource allocation in bash

I want to automate resource allocation in an HPC server's, node forwarding and open jupyterlab in the same node. Individually I have to go through the following steps: user@login1>salloc -A ...

Ep1c1aN

743

asked Mar 21 at 12:45

0 votes

0 answers

22 views

How can i let the process to different ranks not only on rank 0?

I got some MISTAKE when trying to bind the program with IntelMPI. #define _GNU_SOURCE #include <stdio.h> #include <unistd.h> #include <string.h> #include <sched.h> #include <...

user26958921

1

asked Mar 20 at 12:54

0 votes

0 answers

85 views

How can I specify for R to look in one directory for the dependencies of a package installed in another?

I'm trying to make R scripts run on a HPC cluster (with SLURM workload manager), which need a specific package that I installed in a personal directory since I can't install packages in the server-...

legabgob

1

asked Mar 14 at 17:57

0 votes

1 answer

40 views

Should I loop a container or loop inside a container?

I want to call genetic variants with DeepVariant on an HPC for about 1000 cereal lines. I successfully ran DV for one line with the docker image they provide using Apptainer/Singularity, but for the ...

skranz

65

asked Mar 12 at 8:58

6 votes

3 answers

244 views

Using inclusive scan syntax in OpenMP in the C language

I want to use the inclusive scan operation in OpenMP to implement an algorithm. What follows is a description of my attempt at doing so, and failing to get more than a tepid speedup. The inclusive ...

smilingbuddha

14.9k

asked Mar 9 at 2:04

0 votes

0 answers

143 views

Trouble finding runner for Ollama 0.5.13

I have Ollama version 0.5.13 installed on my university's HPC cluster. Because of lack of sudo access, I have a custom script that runs ollama for me. I am reproducing it below: # Set the custom ...

Ryan Hendricks

111

asked Mar 8 at 21:15

0 votes

0 answers

46 views

AWS PCS cluster creation failed with cloud formation

Im creating a complete HPC architecture on AWS using service AWS PCS. In my cloud formation template literally all resource creation is successful but AWS PCS. Cluster: Type: AWS::PCS::Cluster ...

parthraj panchal

121

asked Mar 7 at 16:40

0 votes

0 answers

75 views

Speed up read access of large (~300mb) samples with H5py

I have a large .h5 file of high resolution images (~300MB each, 200 images per .h5 file) and need to load samples in python. The current setup uses a separate dataset for each sample. data_group....

gekrone

179

asked Mar 7 at 16:11

0 votes

1 answer

155 views

6MPI waitall error "The supplied request in array element 0 was invalid (kind=0)"

I'm trying to implement parallelization into a flowsolver code for my Phd, I've inherited a subroutine that is sending data between predefined subdomains. The subroutine is sending data throught the ...

Subject303

15

asked Mar 6 at 17:02

0 votes

0 answers

95 views

Unrecognised compiler commands in a compiler config file ran using the mpiifort command

Hi I'm trying to compile and run a .f90 code using the intel fortran compiler (ifx) and the intel mpi library on a linux HPC. I'm invoking the compiler through a .sh script with the following lines: ...

Subject303

15

asked Mar 6 at 1:34

0 votes

0 answers

68 views

Job getting killed on HPC cluster, why?

I am trying to solve a nonlinear optimization problem in AMPL. It is quite large but not ridiculously so. I solved a similar problem on my home PC (about 1 order of magnitude less in size though). I ...

apg

101

asked Mar 4 at 16:22

0 votes

0 answers

52 views

How to run software installed in my home folder on a compute node

I have some software (AMPL) installed on my home folder on a Grid Engine based HPC cluster at a university. I'm looking just to source AMPL properly when I run my jobscript in the queue. I need to run ...

apg

101

asked Mar 1 at 13:08

0 votes

0 answers

42 views

SLURM GPU Allocation

I'm brand new to Linux / slurm / HPC so apologies if this seems trivial. I have access to a node, consisting of 4 GPUS, of a HPC. I have a job that when running on a single GPU runs out of memory so ...

Paul

41

asked Feb 24 at 8:19

0 votes

1 answer

54 views

XmlBinaryNodeWriter failing to serialize unicode Group Managed Service Account password for web service transmission

Backstory: We are submitting an HPC job using the microsoft HPC pack 2019 SP3 SDK. HPC Doesn't natively support Active Directory gMSA accounts, so we obtain the gMSA account password via AD. The MSA ...

Jon Barker

1,828

asked Feb 21 at 23:08

0 votes

0 answers

21 views

MLP Speed-Up in PySpark fluctuates with more cores – possible cache memory issue?

enter image description here I have conducted experiments running the MLP (Multi-Layer Perceptron) algorithm on a PC cluster with Apache Spark, with configurations ranging from small data to large ...

Syahel Razaba

1

asked Feb 16 at 22:23

0 votes

0 answers

65 views

Can I use VS Code Remote for Multi-Hop Interactive HPC Sessions?

Without an IDE, I can log in to an HPC interactive node by first sshing in to the server using: ssh servername Then I request an interactive node using qrsh # Sun Grid Engine # OR qsub -I # Slurm ...

David LeBauer

31.9k

asked Feb 4 at 17:29

0 votes

1 answer

124 views

Mental Model for Hybrid MPI/OpenMP with SLURM

Question I am trying to develop a clear mental model for using SLURM to request resources on HPC systems for hybrid MPI/OpenMP jobs. In thinking about it more, I realized there are some gaps in my ...

Jared

714

asked Jan 30 at 13:34

0 votes

0 answers

50 views

MPI Collective communication along axes with uneven data distribution per rank

I am attempting to implement a method in MPI for a well established particle simulation program that involves image processing. The program runs a loop for millions of iterations that performs a ...

William Betancourt

1

asked Jan 24 at 5:29

0 votes

1 answer

81 views

Assessing the contribution of communication to the runtime of an MPI program

Background Let's say I have a complex MPI program with multiple message passing events and computations. The communication pattern is that of bidirectional ring messaging as shown in the figure below. ...

Nitin Malapally

648

asked Jan 17 at 11:20

1 vote

0 answers

77 views

Simple MS-MPI program fails with mixed AMD/Intel CPUs

The following code example simply calls MPI_Barrier in a loop. On a 2 computer cluster of Intel machines, it runs correctly. When run from an Intel machine, with an AMD machine, it completes the first ...

Jeffrey Faust

653

asked Jan 9 at 22:31

3 votes

1 answer

299 views

Easiest way to run SLURM on multiple files

I have a Python script that processes approximately 10,000 FITS files one by one. For each file, the script generates an output in the same directory as the input files and creates a single CSV file ...

Falco Peregrinus

607

asked Dec 16, 2024 at 12:43

1 vote

2 answers

145 views

Do programers need to manually implement optimization such as loop unfolding, etc, when writing Python code?

I am recently learning some HPC topics and get to know that modern C/C++ compilers is able to detect places where optimization is entitled and conduct it using corresponding techniques such as SIMD, ...

PkDrew

2,301

asked Dec 12, 2024 at 14:15

0 votes

0 answers

45 views

Are the allocated nodes of the login node supposed to be empty?

Because I am trying to find the reasons and solve another problem (this one with mpirun saying I have a problem with my current allocation), I tried to find the allocations of my nodes in a multinode ...

KansaiRobot

10.6k

asked Dec 12, 2024 at 6:29

1 vote

0 answers

93 views

How to solve the issue with getting free ports in Pytorch DDP?

I am facing issues with getting a free port in the DDP setup block of PyTorch for parallelizing my deep learning training job across multiple GPUs on a Linux HPC cluster. I am trying to submit a deep ...

Shataneek Banerjee

11

asked Dec 6, 2024 at 18:30

1 vote

0 answers

69 views

C++ Hypre - Solver returns unexpected result

I'm trying to use Hypre to solve a system of linear equations: #include <stdio.h> #include <stdlib.h> #include <string.h> #include <math.h> #include "HYPRE_krylov.h" #...

Huy Hoàng Nguyễn

11

asked Dec 3, 2024 at 19:24

0 votes

0 answers

62 views

Submit too many commands from different files as a single batch job

I want to use bash to run a batch job on an HPC. The commands to be executed are saved to a text file. Previously, I used the following to run each line of the text file separately as a batch job. ...

Ahmed El-Gabbas

537

asked Dec 3, 2024 at 19:06

1 vote

0 answers

36 views

Netlogo-headless.sh error when running on HPC

Half of my jobs I submit to my HPC return the following error message in the out file and ends my Job: /sw/rl8/zen/app/NetLogo/6.4.0-64/netlogo-headless.sh: line 34: 111089 Killed "$JAVA" &...

Bart de Bruin

71

asked Nov 30, 2024 at 2:12

7 votes

3 answers

235 views

Openmp Tasks for Recursion

I am new to Openmp programming and I have a question regarding task parallelism on recursions Let's consider this demo C code: #include <stdio.h> #include <stdlib.h> #include <sys/time....

hpc_beginner

71

asked Nov 23, 2024 at 7:15

0 votes

1 answer

529 views

How can I know if NCCL is installed?

Very simple question. I have access to a multi-node machine and I have to do some NCCL tests. In the readme it says If CUDA is not installed in /usr/local/cuda, you may specify CUDA_HOME. Similarly, ...

KansaiRobot

10.6k

asked Nov 17, 2024 at 13:17

0 votes

1 answer

65 views

Slurm: Use cores from multiple nodes for Python parallelization

This question is somehow similar with this one, Slurm: Use cores from multiple nodes for R parallelization But it is for python. I have a python program which can use multiple cores on a PC, it does ...

Quantum Monte Carlo

1

asked Nov 14, 2024 at 10:13

1 vote

0 answers

47 views

MPI_Bcast not Bcasting

I am running an MPI application on 32 processes. The stdout of the rank 0 process tgets sent to a separate file for startup error logging, we will call this file STARTUP_ERROR while the stdout of all ...

Defcon97

121

asked Nov 8, 2024 at 16:07

1 vote

0 answers

70 views

Overlaying openMP onto MPI program causes slow down of the region parallelised with openMP

I have a particle simulation in C which is split over 4 MPI processes and running fast (compared to serial). However, one region of my implementation is N^2 complexity, where I need to compare each ...

Luna Morrow

11

asked Nov 3, 2024 at 23:16

Collectives™ on Stack Overflow