Running different MPI calculations side-by-side¶

It is often possible to run calculations side-by-side. One can request 64 processors from a supercomputer and run two VASP calculations simultaneously in the same PBS job. There are a fair number of steps to get this part of Pylada running:

Set up Pylada to run a single MPI programming as described above

Set the environment variable do_multiple_mpi_programs to True.
Set up figure_out_machines. This is a string which contains a small python script. Pylada runs this python script at the start of a PBS/Slurm job to figure out the hostnames of each machine allocated to the job. For each core, the script should print out a line starting with “PYLADA MACHINE HOSTNAME”. It will be launched as an MPI program on all available cores. By default, it is the following simple program:
from socket import gethostname
from mpi import gather, world
hostname = gethostname()
results = gather(world, hostname, 0)
if world.rank == 0:
  for hostname in results:
    print "PYLADA MACHINE HOSTNAME:", hostname
world.barrier()
Note

This is one of two places where boost.mpi is used. By replacing this function with, say, call mpi4py methods, one could remove the boost.mpi in Pylada for most use cases.

It is important that this function prints out to the standard output one line per core (not one line per machine).

The script is launched by pylada.process.mpi.create_global_comm().
The names of the machines determined in the previous step are stored in default_comm‘s pylada.process.mpi.Communicator.machines attribute. This is simply a dictionary mapping the hostnames determined previously to the number of cores. It is possible, however, to modify default_comm after the figure_out_machines script is launched and the results parsed. This is done via the method modify_global_comm(). This method takes a pylada.process.mpi.Communicator instance on input and modifes it in-place. By default, this method does nothing.

On a cray, one could set it up as follows:
def modify_global_comm(comm):
  """ Modifies global communicator to work on cray.

      Replaces hostnames with the host number.
  """
  for key, value in comm.machines.items():
    del comm.machines[key]
    comm.machines[str(int(key[3:]))] = value
This would replace the hostnames with something aprun can use for MPI placement. modify_global_comm() is runned once at the beginning of a Pylada PBS/Slurm script.
To test that the hostnames where determined correctly, one should copy the file “process/tests/globalcomm.py” somewhere, edit it, and launch it. The names of the machines should be printed out correctly, with the right number of cores:
> cd testhere
> cp /path/to/pylada/source/process/tests/globalcomm.py
> vi globalcomm.py
# This is a PBS script.
# Modify it so it can be launched.
> qsub globalcomm.py
# Then, when it finishes:
> cat global_comm_out
EXPECTED N=64 PPN=32
FOUND
n 64
ppn 32
placement ""
MACHINES
PYLADA MACHINE HOSTNAME hector.006 32
PYLADA MACHINE HOSTNAME hector.006 32
...
The above is an example output. One should try and launch this routine on more than one node, with varying number of processes per node, and so forth.
At this point, Pylada knows the name of each machine participating in a PBS/Slurm job. It still needs to be told how to run an MPI job on a subset of these machines. This will depend on the actual MPI implementation installed on the machine. Please first read the manual for your machine’s MPI implementation.

Pylada takes care of MPI placements by formatting the mpirun_exe string adequately. For this reason, it is expected that mpirun_exe contains a “{placement}” tag which will be replaced with the correct value at runtime.

At runtime, before placing the call to an external MPI program, the method pylada.machine_dependent_call_modifier() is called. It takes three arguments: a dictionary with which to format the mpirun_exe string, a dictionary or pylada.process.mpi.Communicator instance containing information relating to MPI, a dictionary containing the environment variables in which to run the MPI program. The first and second dictionary will be merged and used to format the mpirun_exe string. By default, this method creates a nodefile with only those machines involved in the current job. It then sets “placement” to “-machinefile filename” where filename is the nodefile.

On Crays, one could use the following:
def machine_dependent_call_modifier( formatter=None,
                                     comm=None,
                                     env=None ):
  """ Placement modifications for aprun MPI processes.

      aprun expects the machines (not cores) to be given on the
      commandline as a list of "-Ln" with n the machine number.
      """
  from pylada import default_comm
  if formatter is None: return
  if len(getattr(comm, 'machines', [])) == 0: placement = ""
  elif sum(comm.machines.itervalues()) == sum(default_comm.machines.itervalues()):
    placement = ""
  else:
    l = [m for m, v in comm.machines.iteritems() if v > 0]
    placement = "-L{0}".format(','.join(l))
  formatter['placement'] = placement
Note that the above requires the pylada.modify_global_comm() from point 4.

Warning

All external program calls are routed through this function, whether or not it is an MPI program. Hence it is necessary to check that the program is to be launched an MPI or not. In the case of serial programs, “comm” may be None.
The whole deal can be tested using “process/tests/placement.py” This is a PBS job which performs MPI placement on a fake job. It should be copied somewhere, edited, and launched.

At least two arguments should be set prior to running this script. Check the bottom of the script. “ppn” specifies the number of processors per nodes. The job should be launched with 2*”ppn” cores. “path” should point to the source directory of Pylada. This is so that a small program can be found (pifunc) and used for testing. The program can be compiled by Pylada by setting “compile_test True” in cmake.

“placement.py” will launch several simultaneous instances of the “pifunc” program: one long on three quarters of allocated cores, and two smaller calculations on one eigth of the cores each.

One should check the ouput to make sure that the programs are running side-by-side (not all piled up on the same node), that they are runnning simultaneously, and that they run successfully (e.g. mpirun does launch them).

Navigation

Running different MPI calculations side-by-side¶

This Page

Quick search

Navigation