< First Group next lesson >

Content of this lesson

Materials provided

Scalability of BigDFT with MPI and OpenMP

This lesson has been created for current stable version. Versions more recent than 1.5.0 versions are fully capable of running this tutorial but input files may have to be changed according to possible earlier formats.

Introduction

Boron 80 cluster

There are two levels of parallelization implemented in BigDFT: MPI and OpenMP. The former works on distributed memory architectures while the latter corresponds to shared memory architectures. Each has some particular advantages as well as disadvantages. BigDFT can benefit from selected advantages of both MPI and OpenMP if a proper combination of the two is chosen. The aim of this lesson is to provide the user an idea of the differences in behaviour of the two parallelisations, in order to learn how to use the code for bigger systems and/or architectures

The MPI parallelization in BigDFT relies on the orbital distribution scheme, in which the orbitals of the system under investigation are distributed over the assigned MPI processes. This scheme reaches its limit when the number of MPI processes is equal to the number of orbitals in the simulation. To equally distribute the orbitals, the number of processors must be a factor (divisor) of the number of orbitals. If this is not the case, the orbital distribution is not optimal, but BigDFT tries to balance the load over the processors. For example, if we have 5 orbitals and 4 processors, the orbitals will have the distribution: 2/1/1/1.

Within an orbital, where the functionality is activated by the proper compiler option, BigDFT uses OpenMP to do the parallelization. For a given MPI task, most of the operations in BigDFT are OpenMP parallelized, and they are indeed sped up very well.

When a part of the code is parallelized by OpenMP, the corresponding work is performed in parallel by some OpenMP threads. To obtain an optimal behaviour, for each MPI process we need to assign a number of cores (or cpus), one core per thread, that will be responsible for managing the OpenMP threads. If we plan to use, for example, OpenMP with 6 threads, we have to assign at least 6 cores (cpus) to each MPI processes.

Lesson details

Material for the lesson

In this lesson, we will explain how to run BigDFT with MPI and/or OpenMP. We will show some examples of how to interpret scaling of BigDFT with the number of MPI processes and OpenMP threads. We will also provide a small bash script to extract total or partial times. For this purpose, we will run BigDFT jobs for a particular system, a Boron cluster with 80 atoms and 120 orbitals, shown in the top left figure. The files which can be used are the following:

Job submission script

Since the format of the posinp.xyz and input.dft files has been discussed in previous lessons, we will have a look only at the script for submitting a BigDFT job (go.sub) with MPI and OpenMP. The given example is valid for for NICS Keeneland machine, but of course is not general. For example, on Todi, a Swiss Cray XE6, it looks like:

#!/bin/sh
#SBATCH --job-name="handson"
#SBATCH --ntasks=12
#SBATCH --ntasks-per-node=6
#SBATCH --cpus-per-task=2
#SBATCH --time=00:19:00
export OMP_NUM_THREADS=2
cd /users/huantd/handon
aprun -n 12 -N 6 -d 2 bigdft | tee Out_mpi12_omp2
      

In general, the parameters which are relevant to MPI and OpenMP parallelization are similar

After preparing the submission script, it is submitted by a given syntax, e.g.,

ccc_msub go.sub
or (in other systems)
qsub go.sub

Timing data to be analyzed

The timing data of a BigDFT job is saved in the file time.yaml. If you resubmit the run, the content of the file is appended in the bottom of the pre-existing file, so that the entire run history can be recovered If other data have to be written, this file can be found in the data-*/ directory. A quick look into the timing file can reveal the time needed for different classes of operations performed in the run. Below you can find an example for the counter associated to the wavefunction optimization of a B80 run (WFN_OPT counter):

WFN_OPT:          #     % ,  Time (s), Max, Min Load (relative) 
  Classes:  
    Communications: [  17.8,  1.94E+01,  1.29,  0.90]
    Convolutions:   [  63.7,  6.95E+01,  1.02,  0.97]
    Linear Algebra: [   2.7,  3.00E+00,  1.02,  0.98]
    Other:          [   8.4,  9.18E+00,  1.11,  0.90]
    Potential:      [   5.1,  5.55E+00,  1.28,  0.07]
    Initialization: [   0.0,  0.00E+00,  0.00,  0.00]
    Finalization:   [   0.0,  0.00E+00,  0.00,  0.00]
    Total:          [  97.7,  1.09E+02,  1.00,  1.00]

In this lesson, we will analyze the 5 most time consuming Classes: Communications, Convolutions, Linear Algebra, Potential, and the remaining operations (Other - essentially Nonlocal PSP applications).

Steps to be done

Discussions

The speedup of BigDFT on Keeneland is provided, as an example, in the figures on the bottom right of this page. Lots of information can be extracted from these curves.

Intranode behaviour of B80 run, Keeneland Internode behaviour of B80 run, Keeneland