              

PRESTA MPI Bandwidth, Latency and Collective Benchmark v1.4 README File
***********************************************************************


Table of contents
=================

 o Code Description 
 o Benchmark Test Summary
 o Benchmark Test Parameters
 o Benchmark Test Descriptions
 o Building Instructions 
 o Files in this Distribution


Code Description 
=================

This benchmark consists of 3 tests intended to evaluate the
performance of inter-process communication with MPI.  The provided
applications test inter-process communication latency and bandwidth
for standard MPI message passing operations as well as timing for
collective operations.

These benchmarks are written in C with MPI message-passing primitives.


Benchmark Test Summary
======================

This section identifies each included test and defines the measurements
provided by each test.


com : 

This test measures total unidirectional and bidirectional
inter-process bandwidth for pairs of MPI processes, varying both
message size and number of concurrently communicating process pairs.
Intra-SMP or inter-SMP communication, as well as bandwidth out of node
and bisectional bandwidth, can be measured with proper allocation of
the processes.  This benchmark also includes a latency test.


glob :

This test can time several collective MPI operations with various datatypes,
operations, and root configurations.


globalop :

This test, also referred to as the Global-Op benchmark, is provided
for historical reference and times MPI operation loops with and
without a simulated compute phase for MPI_Barrier, MPI_Reduce,
MPI_Bcast, MPI_Allreduce, and MPI_Reduce/MPI_Bcast.



Benchmark Test Parameters
=========================

Operations Per Time Sample :

Timing results are obtained in each of these tests through the use of
the MPI_Wtime function.  To address variation in MPI_Wtime resolution,
all of the tests, with the exception of globalop, allow the
specification of the number of operations to be performed between
MPI_Wtime calls as a command-line argument.  The number of MPI_Wtime
clock ticks per measurement is provided in the output of each test.
This mechanism is provided as a direct means of tuning the test
behavior to the granularity of the MPI_Wtime function.

Note that for the com, laten, and rma tests, each time sample will
include the overhead of performing a Barrier operation by the included
butterfly barrier.  This default behavior is provided as a means of
ensuring concurrent communication during the test.  Use of the '-n'
command-line argument will suppress the use of the barrier.  When
using the default behavior, the effect of the barrier can be minimized
by using larger numbers of operations per sample.


Participating Processes and Allocation : 

For the com test, the allocation of concurrently communicating process
pairs and the number of these pairs can be specified in several ways.
The intent of the test is to measure communication for the following
configurations:

1) Communication within a SMP, increasing the number of concurrently
communicating process pairs.

2) Communication between two SMPs, increasing the number of concurrently
communicating process pairs.

3) Communication between multiple SMPs, increasing the number of SMPs
with concurrently communicating process pairs.  

4) Communication between multiple SMPs, with active processes on each SMP,
increasing the number of communicating processes on each SMP.


Process pair partnering is done by default by selecting 1 process from each
half of the rank ids in MPI_COMM_WORLD.  Pair partners are identified by:

  (MPI_COMM_WORLD Rank ID + Size of MPI_COMM_WORLD/2) % Size of MPI_COMM_WORLD

However, if the number of processes per SMP is provided, the processes
can be paired by nearest off-SMP rank through the use of the '-r'
flag.

The default behavior is to allocate process pairs incrementally based
on rank, beginning with 1 process pair.  For example, an iteration for
4 processes of a 16 process job with 8 processes per SMP would include
the rank pairs (0,8) and (1,9).  Test configuration #3 mentioned above
uses this allocation pattern and begins with 2*processes-per-SMP
active processes.

Since test configuration #4 increases the number of active processes
per SMP, the participating processes must be identified in a cyclic
manner.  Specifying the number of processes per SMP and the '-p c'
option results in cyclic allocation of the active processes. For
example, an iteration for 8 processes of a 32 process job with 8
processes per SMP would include rank pairs (0,16), (8,24), (1,17), and
(9,25).  Test configuration #4 starts with the number of communicating
processes equal to the number of SMPs and increases the process count
by the number of SMPs each iteration.

The provided manage_targets.py script can generate alternative process
pairing source files which can be used with the pair source file
directory command-line option (-s).  The benchmark will iterate through
each file found in the provided directory.

The process counts can also be specified in a source
file with the '-f' option.  Each count must be on a separate line.


Message Sizes :

The com and glob tests uses messages either from a source file or
based on the command-line arguments message size start, message size
stop and message size factor.  Please see the file message.sizes for
an example of the proper manner to specify message sizes and operation
iterations per measurement.  The start and stop command line arguments
are specified in bytes.  The message size factor is the value by which
the current message size is multiplied to derive the message size for
the next iteration.


Benchmark Test Descriptions 
===========================

com :

The "com" test is intended to illustrate at what point the
interconnect between communicating MPI processes becomes saturated for
both unidirectional and bidirectional communication.

As indicated above, com iterates over a range of message sizes for
each set of communicating process pairs.  The entire size/pair
iteration is performed initially for unidirectional and then
bidirectional communication.  The bandwidth and average operation time
are calculated based on the longest time sample of any of the
participating processes.

Intra- and inter-SMP performance can be evaluated based on process allocation.


com : Point-to-Point MPI bandwidth and latency benchmark

  syntax: com [OPTION]...

    -b [message start size]                        default=32
    -d [task target list source file]
    -e [message stop  size]                        default=8388608
    -f [process count source file]
    -h print use information
    -i print process pairs for each measurement    default=false
    -l print hostname information
    -m [message size source file]
    -n use barrier within measurement              default=no barrier
    -o [number of operations between measurements] default=10
    -p [allocate processes: c(yclic) or b(lock)]   default=b
    -q print test names
    -r partner processes with nearby rank          default=false
    -s [directory of task target source files]
    -t [processes per SMP]
    -v print individual rank times                 default=false
    -w '[list of full test names]'
    -x calculate BW by volume/longest task time    default=false


com Test Names :

  Unidirectional
  UnidirAsync
  Bidirectional
  BidirAsync
  Latency


glob:

glob : MPI collective operation benchmark

  syntax: glob [OPTION]...

    -b [message start size]                        default=32
    -e [message stop  size]                        default=8388608
    -f [process count source file]
    -m [message size source file]
    -o [number of operations between measurements]
    -w '[test list]'
    -h : print use information
    -l : print hostname information
    -n : do not use barrier within measurement     default=barrier not used
    -q : print test names
    -r : partner processes with nearby rank        default=false
    -v : print individual rank times               default=false


glob Test Names :

  Reduce:Double-SUM-RX      : MPI_Reduce, MPI_DOUBLE, MPI_SUM, Root cycles
  Reduce:Double-SUM-R0      : MPI_Reduce, MPI_DOUBLE, MPI_SUM, Root is rank 0
  Reduce:Float-SUM-R0       : MPI_Reduce, MPI_FLOAT, MPI_SUM, Root is rank 0
  Reduce:Float-SUM-RX       : MPI_Reduce, MPI_FLOAT, MPI_SUM, Root cycles
  Broadcast:Double          : MPI_Bcast, MPI_DOUBLE, Root is rank 0
  Reduce-Bcast:Double-MIN   : MPI_Reduce/MPI_Bcast, MPI_DOUBLE, MPI_MIN, Root is rank 0
  Allreduce:Double-MIN      : MPI_Allreduce, MPI_DOUBLE, MPI_MIN
  Barrier                   : MPI_Barrier




Building Instructions
=====================

Configure the Makefile to use the appropriate commands and flags as
required to compile MPI applications on the target system.


File in this Distribution
========================

Makefile
README
com.c
glob.c
gloablop.c
manage_targets.py
message.sizes
util.c
util.h


Last modified on June 15, 2006 by Chris Chambreau
For information contact:
Chris Chambreau -- chcham@llnl.gov 

UCRL-CODE-2001-028
