1,609 questions
Advice
0
votes
3
replies
39
views
HPC: /usr/lib folder not accessible to nodes
I am using a HPC system in which the folder /usr/ is not NFS. Therefore, the libraries installed in the master node do not seem available in the computation nodes, that is, if I ssh to a computer node ...
1
vote
1
answer
107
views
Best practices for SLURM job pipeline with wrapper scripts - avoiding complex job ID extraction
I'm building a SLURM pipeline where each stage is a bash wrapper script that generates and submits SLURM jobs. Currently I'm doing complex job ID extraction which feels clunky:
# Current approach
...
1
vote
0
answers
45
views
How to run Neo4j Docker container using Singularity on HPC without shutdown during data import?
I'm trying to run the Neo4j Docker container using Singularity on an HPC system. The container starts successfully, but it shuts down automatically when I try to add data to the database (e.g., via ...
1
vote
1
answer
64
views
Debugging parallel python program in interruptible sleep
I have a mpi4py program, which runs well with mpiexec -np 30 python3 -O myscript.py at 100% CPU usage on each of the 30 CPUs.
Now I am launching 8 instances with mpiexec -np 16 python3 -O myscript.py. ...
1
vote
0
answers
88
views
Slurm: salloc gets allocated then fails immediately with ExitCode=1:0 (Start=End same second), while equivalent sbatch works
I’ve been using salloc to allocate compute nodes without issues before. Recently, after switching to another user account (same .bashrc config, only the conda path changed), salloc stopped working. I ...
0
votes
0
answers
49
views
Postgresql, Postgis, QGIS in container launched from charliecloud
I need to migrate my work for geospatial processing (using mainly qgis processing and postgis functions from python scripts) to a HPC cluster. As neither qgis nor postgis are installed on the HPC I ...
0
votes
1
answer
176
views
Spack `spack load` not setting LD\_LIBRARY\_PATH or CPATH environment variables as expected
I'm using Spack on Linux Mint to manage scientific libraries, including armadillo. I have installed Armadillo and its dependencies via Spack in an enviroment.
Problem:
When I run spack load armadillo, ...
0
votes
0
answers
56
views
slurmstepd: error: execve(): mkdir: No such file or directory
I tried to use the sbatch file from this link (Running WindNinja on an HPC Cluster) to run the WindNinja software (WindNinja introduction) installed on HPC. However, it always produce the "...
0
votes
0
answers
78
views
How to force Slurm to pack GPU jobs onto partially occupied nodes to free full nodes?
When users request 1-2 GPUs via sbatch --gres=gpu:1, Slurm locks the entire 8-GPU node. This fragments our cluster:
Multiple small requests spread across nodes (e.g., four 1-GPU jobs occupy four ...
0
votes
1
answer
58
views
how to use mkl_dcsrgemv or other functions in OneAPI to cal. scalar prodoct between mass dim sparse matrix and vector?
I program in fortran with Intel OneAPI compiler ifx and MKL packages.
I want to cal. the scalar product between a mass dim sparse matrix and a vector.
When the dim of the sparse matrix could be ...
0
votes
1
answer
79
views
How can I run snakemake jobs 'remotely'?
I love snakemake and have used it locally as well as on HPC with SLURM!
However, now we have a particular setup where it is not as easy to use snakemake as we have done before:
We need to run some ...
0
votes
0
answers
49
views
Sample UCP AM client failing with error "Destination is unreachable" for localhost
I'm learning UCX by creating a basic wrapper for both the client and server. I am using AM communication. When I run my client, I get below error :
[1749297901.816001] [prateek:19822:0] ...
0
votes
0
answers
89
views
Can I use MPI_File_read_all to read non contiguous datatypes directly (as opposed to setview)?
I'm trying to read different subsets of non-contiguous data from a file to different processes.
Ie:
I have a file with the data:
a b c d e f g h i j
and two processes who want to read the data from ...
1
vote
2
answers
95
views
What is the difference between an MPI nonblocking collective write, iwrite_all vs a "nonblocking" noncollective iwrite combined with a file sync?
I'm setting up IO for a largescale CFD code using the MPI library and the file IO is starting to eat into computation time as my problems scale.
As far as I can find the "done" thing in the ...
0
votes
0
answers
50
views
Slurm partitions on same node overallocating CPUs
I have a single computation node with 32 CPUs. I have defined two different partitions that both use this node. If I for example send two jobs on partition A requesting 20 CPUs and 25 CPUs, the second ...
0
votes
1
answer
70
views
Snakemake access snakemake.config in profile config.yaml file
I want to run a pipeline on a cluster where the name of the jobs are of the form : smk-{config["simulation"]}-{rule}-{wildcards}. Can I just do :
snakemake --profile slurm --configfile ...
1
vote
1
answer
96
views
Snakemake in cluster different ways
When running snakemake on a cluster, and if we don't have specific requirements for some rules about number of cores/memory, then what is the difference between :
Using the classic way, i.e. calling ...
0
votes
1
answer
89
views
Slurm only running 6 out of 12 array jobs concurrently on my 12-core PC system
I have a 12-core laptop (6 physical cores with hyperthreading) running Slurm for local job scheduling. When I submit job arrays requesting all 12 cores to be used simultaneously, Slurm consistently ...
0
votes
0
answers
87
views
Automating a resource allocation in bash
I want to automate resource allocation in an HPC server's, node forwarding and open jupyterlab in the same node. Individually I have to go through the following steps:
user@login1>salloc -A ...
0
votes
0
answers
22
views
How can i let the process to different ranks not only on rank 0?
I got some MISTAKE when trying to bind the program with IntelMPI.
#define _GNU_SOURCE
#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <sched.h>
#include <...
0
votes
0
answers
85
views
How can I specify for R to look in one directory for the dependencies of a package installed in another?
I'm trying to make R scripts run on a HPC cluster (with SLURM workload manager), which need a specific package that I installed in a personal directory since I can't install packages in the server-...
0
votes
1
answer
40
views
Should I loop a container or loop inside a container?
I want to call genetic variants with DeepVariant on an HPC for about 1000 cereal lines. I successfully ran DV for one line with the docker image they provide using Apptainer/Singularity, but for the ...
6
votes
3
answers
244
views
Using inclusive scan syntax in OpenMP in the C language
I want to use the inclusive scan operation in OpenMP to implement an algorithm. What follows is a description of my attempt at doing so, and failing to get more than a tepid speedup.
The inclusive ...
0
votes
0
answers
143
views
Trouble finding runner for Ollama 0.5.13
I have Ollama version 0.5.13 installed on my university's HPC cluster.
Because of lack of sudo access, I have a custom script that runs ollama for me. I am reproducing it below:
# Set the custom ...
0
votes
0
answers
46
views
AWS PCS cluster creation failed with cloud formation
Im creating a complete HPC architecture on AWS using service AWS PCS.
In my cloud formation template literally all resource creation is successful but AWS PCS.
Cluster:
Type: AWS::PCS::Cluster
...
0
votes
0
answers
75
views
Speed up read access of large (~300mb) samples with H5py
I have a large .h5 file of high resolution images (~300MB each, 200 images per .h5 file) and need to load samples in python. The current setup uses a separate dataset for each sample.
data_group....
0
votes
1
answer
155
views
6MPI waitall error "The supplied request in array element 0 was invalid (kind=0)"
I'm trying to implement parallelization into a flowsolver code for my Phd, I've inherited a subroutine that is sending data between predefined subdomains.
The subroutine is sending data throught the ...
0
votes
0
answers
95
views
Unrecognised compiler commands in a compiler config file ran using the mpiifort command
Hi I'm trying to compile and run a .f90 code using the intel fortran compiler (ifx) and the intel mpi library on a linux HPC.
I'm invoking the compiler through a .sh script with the following lines:
...
0
votes
0
answers
68
views
Job getting killed on HPC cluster, why?
I am trying to solve a nonlinear optimization problem in AMPL. It is quite large but not ridiculously so. I solved a similar problem on my home PC (about 1 order of magnitude less in size though).
I ...
0
votes
0
answers
52
views
How to run software installed in my home folder on a compute node
I have some software (AMPL) installed on my home folder on a Grid Engine based HPC cluster at a university.
I'm looking just to source AMPL properly when I run my jobscript in the queue.
I need to run ...
0
votes
0
answers
42
views
SLURM GPU Allocation
I'm brand new to Linux / slurm / HPC so apologies if this seems trivial. I have access to a node, consisting of 4 GPUS, of a HPC. I have a job that when running on a single GPU runs out of memory so ...
0
votes
1
answer
54
views
XmlBinaryNodeWriter failing to serialize unicode Group Managed Service Account password for web service transmission
Backstory: We are submitting an HPC job using the microsoft HPC pack 2019 SP3 SDK. HPC Doesn't natively support Active Directory gMSA accounts, so we obtain the gMSA account password via AD. The MSA ...
0
votes
0
answers
21
views
MLP Speed-Up in PySpark fluctuates with more cores – possible cache memory issue?
enter image description here
I have conducted experiments running the MLP (Multi-Layer Perceptron) algorithm on a PC cluster with Apache Spark, with configurations ranging from small data to large ...
0
votes
0
answers
65
views
Can I use VS Code Remote for Multi-Hop Interactive HPC Sessions?
Without an IDE, I can log in to an HPC interactive node by first sshing in to the server using:
ssh servername
Then I request an interactive node using
qrsh # Sun Grid Engine
# OR
qsub -I # Slurm
...
0
votes
1
answer
124
views
Mental Model for Hybrid MPI/OpenMP with SLURM
Question
I am trying to develop a clear mental model for using SLURM to request resources on HPC systems for hybrid MPI/OpenMP jobs. In thinking about it more, I realized there are some gaps in my ...
0
votes
0
answers
50
views
MPI Collective communication along axes with uneven data distribution per rank
I am attempting to implement a method in MPI for a well established particle simulation program that involves image processing. The program runs a loop for millions of iterations that performs a ...
0
votes
1
answer
81
views
Assessing the contribution of communication to the runtime of an MPI program
Background
Let's say I have a complex MPI program with multiple message passing events and computations. The communication pattern is that of bidirectional ring messaging as shown in the figure below.
...
1
vote
0
answers
77
views
Simple MS-MPI program fails with mixed AMD/Intel CPUs
The following code example simply calls MPI_Barrier in a loop. On a 2 computer cluster of Intel machines, it runs correctly. When run from an Intel machine, with an AMD machine, it completes the first ...
3
votes
1
answer
299
views
Easiest way to run SLURM on multiple files
I have a Python script that processes approximately 10,000 FITS files one by one. For each file, the script generates an output in the same directory as the input files and creates a single CSV file ...
1
vote
2
answers
145
views
Do programers need to manually implement optimization such as loop unfolding, etc, when writing Python code?
I am recently learning some HPC topics and get to know that modern C/C++ compilers is able to detect places where optimization is entitled and conduct it using corresponding techniques such as SIMD, ...
0
votes
0
answers
45
views
Are the allocated nodes of the login node supposed to be empty?
Because I am trying to find the reasons and solve another problem (this one with mpirun saying I have a problem with my current allocation), I tried to find the allocations of my nodes in a multinode ...
1
vote
0
answers
93
views
How to solve the issue with getting free ports in Pytorch DDP?
I am facing issues with getting a free port in the DDP setup block of PyTorch for parallelizing my deep learning training job across multiple GPUs on a Linux HPC cluster.
I am trying to submit a deep ...
1
vote
0
answers
69
views
C++ Hypre - Solver returns unexpected result
I'm trying to use Hypre to solve a system of linear equations:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include "HYPRE_krylov.h"
#...
0
votes
0
answers
62
views
Submit too many commands from different files as a single batch job
I want to use bash to run a batch job on an HPC. The commands to be executed are saved to a text file. Previously, I used the following to run each line of the text file separately as a batch job.
...
1
vote
0
answers
36
views
Netlogo-headless.sh error when running on HPC
Half of my jobs I submit to my HPC return the following error message in the out file and ends my Job:
/sw/rl8/zen/app/NetLogo/6.4.0-64/netlogo-headless.sh: line 34: 111089 Killed "$JAVA" &...
7
votes
3
answers
235
views
Openmp Tasks for Recursion
I am new to Openmp programming and I have a question regarding task parallelism on recursions
Let's consider this demo C code:
#include <stdio.h>
#include <stdlib.h>
#include <sys/time....
0
votes
1
answer
529
views
How can I know if NCCL is installed?
Very simple question. I have access to a multi-node machine and I have to do some NCCL tests.
In the readme it says
If CUDA is not installed in /usr/local/cuda, you may specify
CUDA_HOME. Similarly, ...
0
votes
1
answer
65
views
Slurm: Use cores from multiple nodes for Python parallelization
This question is somehow similar with this one,
Slurm: Use cores from multiple nodes for R parallelization
But it is for python.
I have a python program which can use multiple cores on a PC, it does ...
1
vote
0
answers
47
views
MPI_Bcast not Bcasting
I am running an MPI application on 32 processes.
The stdout of the rank 0 process tgets sent to a separate file for startup error logging, we will call this file STARTUP_ERROR while the stdout of all ...
1
vote
0
answers
70
views
Overlaying openMP onto MPI program causes slow down of the region parallelised with openMP
I have a particle simulation in C which is split over 4 MPI processes and running fast (compared to serial). However, one region of my implementation is N^2 complexity, where I need to compare each ...