MPI

SLURM

In this section, we'll focus on MPI applications that run under SLURM (or a similar job scheduler). On most systems, the latter sets the affinity mask of the Julia processes (MPI ranks) based on the options set by the user (e.g. via #SBATCH). Consequently, one has to do little to nothing on the Julia side to achieve the desired pinning pattern.

MPI only

If your MPI-parallel application is single threaded (i.e. one Julia thread per MPI rank), you likely don't have to do anything on the Julia side to pin the MPI ranks. Instead, you can just use the SLURM options.

Multinode example, 1 MPI rank per socket:

#!/usr/bin/env sh
#SBATCH -N 2 # two nodes
#SBATCH -n 4 # four MPI ranks in total
#SBATCH --ntasks-per-socket 1 # one MPI rank per socket
#SBATCH -o sl_mpi_multinode_%j.out
#SBATCH -A pc2-mitarbeiter
#SBATCH -p all
#SBATCH -t 00:02:00
#=
ml lang JuliaHPC # load Julia module (system specific!)
srun -n 4 julia --project -t 1 $(scontrol show job $SLURM_JOBID | awk -F= '/Command=/{print $2}')
exit
# =#
using MPI
using ThreadPinning

MPI.Init()
nranks = MPI.Comm_size(MPI.COMM_WORLD)
rank = MPI.Comm_rank(MPI.COMM_WORLD)
MPI.Barrier()
sleep(2*rank)
println("Rank $rank:")
println("\tHost: ", gethostname())
println("\tCPUs: ", getcpuids())
print_affinity_masks()

Output (manually cleaned up a bit):

Rank 0:
    Host: n2fpga19
    CPUs: [0]
1:   |1000000000000000000000000000000000000000000000000000000000000000|0000000000000000000000000000000000000000000000000000000000000000|

Rank 1:
    Host: n2fpga19
    CPUs: [64]
1:   |0000000000000000000000000000000000000000000000000000000000000000|1000000000000000000000000000000000000000000000000000000000000000|

Rank 2:
    Host: n2fpga33
    CPUs: [0]
1:   |1000000000000000000000000000000000000000000000000000000000000000|0000000000000000000000000000000000000000000000000000000000000000|

Rank 3:
    Host: n2fpga33
    CPUs: [64]
1:   |0000000000000000000000000000000000000000000000000000000000000000|1000000000000000000000000000000000000000000000000000000000000000|

Hybrid: MPI + Threads

If your MPI-parallel application is multithreaded (i.e. multiple Julia threads per MPI rank), you can use pinthreads(:affinitymask) to pin Julia threads of each MPI rank according to the affinity mask set by SLURM (according to the user-specified options). If you don't use pinthreads(:affinitymask), the Julia threads are only bound to a range of CPU-threads, they can migrate, and they can also overlap (occupy the same CPU-thread). See Process Affinity Mask for more information.

Multinode example, 1 MPI rank per socket, 25 threads per rank:

#!/usr/bin/env sh
#SBATCH -N 2
#SBATCH -n 4
#SBATCH --ntasks-per-socket 1
#SBATCH --cpus-per-task 25
#SBATCH -o sl_hybrid_multinode_affinitymask_%j.out
#SBATCH -A pc2-mitarbeiter
#SBATCH -p all
#SBATCH -t 00:02:00
#=
ml lang JuliaHPC
srun -n 4 julia --project -t 25 $(scontrol show job $SLURM_JOBID | awk -F= '/Command=/{print $2}')
exit
# =#
using MPI
using ThreadPinning
pinthreads(:affinitymask)

MPI.Init()
nranks = MPI.Comm_size(MPI.COMM_WORLD)
rank = MPI.Comm_rank(MPI.COMM_WORLD)
MPI.Barrier()
sleep(2*rank)
println("Rank $rank:")
println("\tHost: ", gethostname())
println("\tCPUs: ", getcpuids())

Output:

Rank 0:
    Host: n2cn0853
    CPUs: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]
Rank 1:
    Host: n2cn0853
    CPUs: [64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88]
Rank 2:
    Host: n2cn0854
    CPUs: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]
Rank 3:
    Host: n2cn0854
    CPUs: [64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88]

Manual

In this section, we describe how you can pin the Julia threads of your MPI ranks manually, that is without any "help" from an external affinity mask (e.g. as set by SLURM, see above).

TODO: pinthreads_mpi