2

I am running a simulation on a Linux machine with the following specs.

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                80
On-line CPU(s) list:   0-79
Thread(s) per core:    2
Core(s) per socket:    20
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
Stepping:              4
CPU MHz:               3099.902
CPU max MHz:           3700.0000
CPU min MHz:           1000.0000
BogoMIPS:              4800.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              28160K

This is the run command line script for my solver.

/path/to/meshfree/installation/folder/meshfree_run.sh    # on 1 (serial) worker
/path/to/meshfree/installation/folder/meshfree_run.sh N  # on N parallel MPI processes

I share the system with another colleague of mine. He uses 10 cores for his solution. What would be the fastest option for me in this case? Using 30 MPI processes?

I am a Mechanical Engineer with very little knowledge on parallel computing. So please excuse me if the question is too stupid.

5
  • 1
    Is the question - Am I better off running this process one time, or kicking it off 30 times? This is a very application specific question and depends on too many variables to answer conclusively. In summary it's a case of 'try it and find out' Commented Feb 28, 2020 at 13:52
  • Its just one job. I need to allocate the right resources to it. So on a 40 core machine, with 10 cores already being used, am i better off running the code on 30 processors? Also considering hyperthreading. Commented Feb 28, 2020 at 13:58
  • Whats the alternative option you suggest? Why can't you run it and find out what's best? Commented Feb 28, 2020 at 13:59
  • The alternative would be to use 60 processes, but i am not sure how the processes are split between the processors. It takes around 4 days for one whole simulation and I am running short on time. I am already in the middle of a simulation. Commented Feb 28, 2020 at 14:09
  • Well you'd think before kicking off a 4 day long simulation you'd know the best way to maximize your compute power - usually via using a cut down version (e.g. 10% sample). Again - not a question any one can answer for you as it's too bespoke. Commented Feb 28, 2020 at 14:18

1 Answer 1

1

Q : "What would be the fastest option for me in this case? ...running short on time. I am already in the middle of a simulation."

Salutes to Aachen. If it were not for the ex-post remark, the fastest option would be to pre-configure the computing eco-system so that:

  • check full details of your NUMA device - using lstopo, or lstopo-no-graphics -.ascii not the lscpu enter image description here
  • initiate your jobs having as many as possible MPI-worker processes mapped on physical (and best each one exclusively mapped onto its private) CPU-core ( as these deserve this as they carry the core FEM / meshing processing workload )
  • if your FH policy does not forbid one doing so, you may ask system administrator to introduce CPU-affinity mapping ( that will protect your in-cache data from eviction and expensive re-fetches, that would make 10-CPUs mapped exclusively for use by your colleague and the said 30-CPUs mapped exclusively for your application runs and the rest of the listed resources ~ the 40-CPUs ~ being "shared"-for-use by both, by your respective CPU-affinity masks.

Q : "Using 30 MPI processes?"

No, this is not a reasonable assumption for ASAP processing - use as many CPUs for workers, as possible for an already MPI-parallelised FEM-simulations ( they have high degree of parallelism and most often a by-nature "narrow"-locality ( be it represented as a sparse-matrix / N-band-matrix ) solvers, so the parallel-portion is often very high, compared to other numerical problems ) - the Amdahl's Law explains why.enter image description here

Sure, there might be some academic-objections about some slight difference possible, for cases, where the communication overheads might got slightly reduced on one-less worker(s), yet the need for a brute-force processing rules in FEM/meshed-solvers ( communication costs are typically way less expensive, than the large-scale, FEM-segmented numerical computing part, sending but a small amount of neighbouring blocks' "boundary"-node's state data )

The htop will show you the actual state ( may note process:CPU-core wandering around, due to HT / CPU-core Thermal-balancing tricks, that decrease the resulting performance )

enter image description here

And do consult the meshfree Support for their Knowledge Base sources on Best Practices.


Next time - the best option would be to acquire a less restrictive computing infrastructure for processing critical workloads ( given a business-critical conditions consider this to be the risk of smooth BAU, the more if impacting your business-continuity ).

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you. That helps a lot.
Always welcome, @vikingd - you may like this stackoverflow.com/a/60427809 for both the performance impacting details and interactive graphical tool (a simulator) of the Amdahl's Law net-speedups on the real-world [SERIAL]-[PARALLEL] workloads

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.