I am currently looking for a solution to an issue relating to managing and maximizing resources I am accessing from a national HPC service.
The service has 2 main queues of relevance: 1) Intel Xeon Gold 6148 Skylake processors (2x20 cores per node), and to a lesser extent 2) Intel Xeon Phi 7210 KNL processors.
When I run my code [which is an ocean physics/biogeochemistry model] on 1 node, I experience no issues. However, when I run on more than 1 node, I get a segmentation error. The only way to avoid this segmentation error in my experience has been to request fewer than the full compliment of cores per node. This practice is however prohibitive. For example, 1 node has 40 cores. The more nodes I request, the fewer cores I can request. The cap I have met is 6 nodes with 15 cores per node, to give 90 cores overall.
This all naturally means that the ambition I had in terms of speed-up are considerably curtailed.
I understand that the launch node has unlimited stack space, but that every additional node uses a lower default stack size. Consequently, I tried the command ulimit -s unlimited, but no luck.
[UPDATE In response to the first comment below from Gilles, I have tried this suggestion previously to no avail]
I suspect that some proprietary setting on the HPC service I'm using is throttling things, as the same model run on similar HPC services in other countries is apparently OK and gets the desired scale-up.
I would appreciate any suggestions in relation to slurm configuration and HPC. Due to time constraints, I can't engage in extensive rewriting of MPI aspects to this code which has been thoroughly developed by multiple research agencies.
I reiterate that this code works fine on other HPCs and slurm/pbs scripting.
ulimit -s unlimited; a.outand thenmpirun <that script>. If you usempirun, you might want to give direct launch (e.g.srun a.out) a try (or the other way around).KMP_STACKSIZEenvironment variable with Intel compilers. If you use Infiniband, you might also have toulimit -l unlimited.