[MITgcm-support] Performance monitoring in MITgcm

Tue May 9 13:30:41 EDT 2006

Quite belatedly and as promised to some of you - one needs to get to the 
latest CVS updates for model/src/the_model_main.F model/src/the_main_loop.F 
model/src/solve_for_pressure.F eesup/src/timers.F and 
pkg/ecco/the_main_loop.F (for ECCO people only)

For the supporting software, PAPI is installed on the ACESgrid machines (IA32 
and IA64), the HPM Toolkit should be available on most AIX SPs and I will 
make sure that there is a PAPI build on GFDL and NASA Altixes.

Constantinos
-- 
Dr. Constantinos Evangelinos
Department of Earth, Atmospheric and Planetary Sciences
Massachusetts Institute of Technology

-------------- next part --------------
Performance monitoring in MITgcm

Performance monitoring in MITgcm falls in two basic categories:
a) Performance information per code sections
b) Performance information per timestep
Moreover one may use external tools (profilers like prof, qprof, pat
etc., command line summary information like perfex, hpm etc., MPI
profiling tools like mpiP, Vampir etc. and unified tools like Paradyn,
Pablo and TAU) to get further performance information down to the
subroutine or even line level (at ever increasing cost in terms of
time and inaccuracy introduced by the instrumentation). Some of the
solutions discussed below are orthogonal so such external tools,
others cannot co-exist with them.

0) Quick recipes:

a) To get time per timestep information:
	genmake2 -ts
b) To get MFlop/s and IPC per timestep with PAPI:
	genmake2 -papis
c) To get MFlop/s (and possibly more) per timestep with PCL:
	genmake2 -pcls
d) To get performance counter information per code section on IBMs:
	genmake2 -hpmt

1) Performance information per code sections

1.1) Default monitoring:

MITgcm by default is setup to record user (time spend on user code),
system (time spent in kernel code on behalf of the user) and wallclock
(self-explanatory) time for code sections (usually calls to a
subroutine) marked by a TIMER_START and a TIMER_STOP calls. This
method of timing is rather heavyweight (and therefore not recommended
for timing code sections in inner loops - or sections - that take too
little time) as it involves an search to locate the section that is
being timed using string matching on the name provided to the routines
as an argument. You can always add sections to be timed by introducing
a pair of TIMER_START and TIMER_STOP calls. The timing is cumulative
over all the calls to the routines during the program's execution and
the results are printed at the very end in summary form. Please note
that the adjoint compiler TAF will strip the code of TIMER_* calls and
therefore adjoint code will not have any left and only the top level
calls that TAF never sees will be used for monitoring. This method of
monitoring provides users with a good idea of where the code spent its
time on an "important" code section basis as opposed to every
subroutine basis that standard profiling would. MPI codes will output
this information on a per process basis.

Unfortunately for adjoint runs, TAF will kill this timing information,
and one would need to manually re-introduce it into ad_taf_output.f to
get more than the very basic code sections timed in this manner for
adjoint code.

Use of this type of monitoring does not conflict with use of any
external performance monitoring tool.

1.2) Performance counter information

1.2.1) Using HPM Toolkit

By using the "-hpmt" flag to genmake2 one can produce performance
counter information for each of the sections timed with the TIMER_*
calls. The HPM Toolkit for Power/AIX, Power/Linux, and Blue Gene
(http://www.research.ibm.com/actc/projects/hardwareperf.shtml) 
is a cross-platform performance counter toolkit that allows the
monitoring of performance counters in predefined groups (e.g. 61
groups for Power4). At the end of the program's execution two files
are created per (each MPI for the case of parallel runs) process:
A perfhpm*.* file will contain an ASCII text summary report of the run
and hpm*_*_*.viz will contain the data that allow the peekperf program
to graphically display HPM output.

For more information on configuring HPM Toolkit usage please consult
the manual. Note that an early STOP statement will not allow any HPM
information to be produced. Keep in mind that if your code has been
compiled to stop at division by zero, you may stop slightly
prematurely as the generation of the summary information at the end
may come up with divisions by zero when the timing resolution was too
low to record any times for a code section that went by very fast.

HPM counter information should not be able to happily co-exist with
other external tools employing performance counters or PAPI/PCL options
("-papi","-papis","-pcl" or "-pcls").

1.2.2) Using PAPI

By using the "-papi" flag to genmake2 one can produce performance
counter information for each of the sections timed with the TIMER_*
calls. PAPI (http://icl.cs.utk.edu/papi/) is a cross-platform
performance counter library that allows the monitoring of performance
counters. At the end of the program's execution, alongside the timing
information usually produced for the program, the PAPI counter events
and their cumulative values for the duration of the timed code
sections are displayed.

To use PAPI for timing code sections one needs to provide a file
called data.papi in the MITgcm working directory whose format is as
follows:

4
PAPI_FP_INS
PAPI_TOT_CYC
PAPI_LST_INS
PAPI_L2_LDM

that is the first line specifies how many events are to be monitored
and the following lines specify the events using their PAPI names. A
list of available events on the platform can be provided by running
the command "papi_avail | grep -i yes" (papi_avail is to be found in
the share/papi/utils subdirectory of a PAPI installation. For more
information please look at the PAPI documentation.

To help users generate a valid data.papi file a utility is to be found
in MITgcm_contrib/PAPI/papi_events.F that will take a data.papi file
and test it for correctness, letting you know after which line the
problems begin. Please note that currently no comment lines are
allowed in the data.papi file.

Unfortunately for adjoint runs, TAF will kill this counter information,
and one would need to manually re-introduce it into ad_taf_output.f to
get more than the very basic code sections counter information in this
manner for adjoint code.

PAPI counter information should not be able to happily co-exist with
other external tools employing performance counters or HPM/PCL options
("-hpmt","-pcl" or "-pcls") or PAPI per timestep information ("-papis").

Please read the FAQ on Floating point counts on the Pentium4 before
trying to interpret any Pentium4 numbers.
http://icl.cs.utk.edu/papi/faq/index.html#213
and
http://icl.cs.utk.edu/papi/faq/index.html#214

Note that an early STOP statement will not allow any PAPI counter
information to be produced. Keep in mind that if your code has been
compiled to stop at division by zero, you may stop slightly
prematurely as the generation of the summary information at the end
may come up with divisions by zero when the timing resolution was too
low to record any times for a code section that went by very fast.

1.2.3) Using PCL

By using the "-pcl" flag to genmake2 one can produce performance
counter information for each of the sections timed with the TIMER_*
calls. PCL (http://www.fz-juelich.de/zam/PCL/) is a cross-platform
performance counter library that allows the monitoring of performance
counters. At the end of the program's execution, alongside the timing
information usually produced for the program, the PCL counter events
and their cumulative values for the duration of the timed code
sections are displayed.

To use PCL for timing code sections one needs to provide a file
called data.pcl in the MITgcm working directory whose format is as
follows:

4
39
40
41
42

that is the first line specifies how many events are to be monitored
and the following lines specify the events using their PCL numbers to
be found in the PCL include file pclh.f. A list of available events on
the platform can be provided by running the PCL command "hpm -s". For
more information please look at the PCL documentation. 

To help users generate a valid data.pcl file a utility is to be found
in MITgcm_contrib/PCL/pcl_events.F that will take a data.pcl file
and test it for correctness, letting you know which lines cause the
problems (given the success of previous lines - remember that not all
events can be counted together). Please note that currently no comment
lines are allowed in the data.pcl file.

Unfortunately for adjoint runs, TAF will kill this counter information,
and one would need to manually re-introduce it into ad_taf_output.f to
get more than the very basic code sections counter information in this
manner for adjoint code.

PCL counter information should not be able to happily co-exist with
other external tools employing performance counters or HPM/PAPI options
("-hpmt","-papi" or "-papis") or PCL per timestep information ("-pcls").

The PAPI FAQ on Floating point counts on the Pentium4 also applies in
the case of PCL - it is assumed that only x87 flop/s are reported.

Note that an early STOP statement will not allow any PCL counter
information to be produced. Keep in mind that if your code has been
compiled to stop at division by zero, you may stop slightly
prematurely as the generation of the summary information at the end
may come up with divisions by zero when the timing resolution was too
low to record any times for a code section that went by very fast.

2) Performance information per timestep

2.1) Time per timestep

By using the "-ts" flag to genmake2 one can produce summary time
information per each timestep (user, system and wallclock
time). Ideally user+system ~ wallclock and if user+system << wallclock
one has one of the following problems:

i) I/O problems reading or writing files (serial or parallel code)
ii) Competition with other processes on the same CPU (serial or
parallel code)
iii) Lack of memory causing extreme swapping effects (serial or
parallel code)
iv) Severe load imbalance (parallel code)
v) A very slow network where message transit time is so slow that
the process relinquishes control of the CPU while waiting for a
message to arrive. (parallel code)

A part of this functionality is offered by Alistair's "runclock" package
but the focus of the latter is to offer a way to stop an execution
before a queuing system kills the model run.

The output ends up in the STDOUT stream and can be statistically
postprocessed using the mitgcm_time script in MITgcm_contrib/timing.

To bypass the fact that TAF will kill this timing information, use the
"-foolad" flag in addition to "-ts" when using the adjoint code. The
information connecting timestep number with the timing information is
also lost unfortunately and if the MITgcm is not setup to do a
pressure solve no timing information will be produced in that case
(while "-ts" alone will produce it in all cases).

Use of "-ts" does not conflict with use of any external performance
monitoring tool or the default per section monitoring in MITgcm.

2.2) Mflop/s (and other performance counter information) per timestep

2.2.1) Using PAPI

By using the "-papis" flag to genmake2 one can produce summary Mflop/s
as well as IPC (Instructions per Cycle) information per each timestep.
The numbers produced are both those wrt. user time and those
wrt. wallclock time. The scripts mitgcm_mflops(2) and mitgcm_mflops_w(2)
(and mitgcm_ipc(2)/mitgcm_ipc_w(2)) to be found in MITgcm_contrib/timing
provide an easy way to do a statistical analysis (arithmetic mean,
standard deviation, geometric mean, minimum and maximum) for this type
of output. (The versions ending in 2 are to be used for adjoint
runs). Once again vast disparity between the user and wallclock
time based figures points to problems such as the ones identified
above. One should aim to achieve above 5% of the peak Mflop/s of a
given platform. MITgcm has been known to achieve 10% or even better in
some cases. 

To bypass the fact that TAF will kill this performance counter
information, use the "-foolad" flag in addition to "-papis" when using
the adjoint code. The information connecting timestep number with the
counter information is also lost unfortunately and if the MITgcm is not
setup to do a pressure solve no counter information will be produced in
that case (while "-papis" alone will produce it in all cases).

Successful compilation using "-papis" requires that the location of
the PAPI include and library directories are specified in the optfile
in variables PAPIINC and PAPILIB respectively. Most supercomputing
centers (at least in the USA) should have a version of PAPI installed
for one of the many supported platforms. 

A further note, taken from the PAPI manual:

"Note that on many platforms there may be subtle differences between
floating point instructions and operations. Instructions are typically
those execution elements most directly measured by the hardware
counters. They may include floating point load and store instructions,
and may count instructions such as FMA as one, even though two
floating point operations have occurred. Consult the hardware
documentation for your system for more details. Operations represent a
derived value where an attempt is made, when possible, to more closely
map to the expected definition of a floating point event."

In our case we try and produce what PAPI provides as Flop/s unless it
is not available on a given platform in which case we try and use
Flip/s (Floating point instructions per second) instead.

Moreover please read the FAQ on Floating point counts on the Pentium4
before trying to interpret any Pentium4 numbers:
http://icl.cs.utk.edu/papi/faq/index.html#213
and
http://icl.cs.utk.edu/papi/faq/index.html#214

PAPI counter information should not be able to happily co-exist with
other external tools employing performance counters or HPM/PCL options
("-hpmt","-pcl" or "-pcls") or PAPI code section information ("-papi").

2.2.1) Using PCL

By using the "-pcls" flag to genmake2 one can produce summary Mflop/s
and possibly IPC (Instructions per Cycle), level one (L1) and two (L2)
cache miss rates and floating point to memory operation ratio
information per each timestep. Depending on the platform and its
available number of hardware performance counters and allowed combination
of countable hardware events at least one if not more of the
aforementioned events should be provided as user output (not
necessarily Mflop/s as some platforms such as systems based on the AMD
Athlon processor do not have this capability). The numbers produced
are wrt. user+system time (which on a serial run with exclusive use of
the processor should be close to wallclock time). The scripts
mitgcm_mflops(2) and and mitgcm_ipc(2) to be found in MITgcm_contrib/timing
provide an easy way to do a statistical analysis (arithmetic mean,
standard deviation, geometric mean, minimum and maximum) for this type
of output. (The versions ending in 2 are to be used for adjoint
runs). One should aim to achieve above 5% of the peak Mflop/s of a
given platform. MITgcm has been known to achieve 10% or even better in
some cases.

To bypass the fact that TAF will kill this performance counter
information, use the "-foolad" flag in addition to "-pcls" when using
the adjoint code. The information connecting timestep number with the
counter information is also lost unfortunately and if the MITgcm is not
setup to do a pressure solve no counter information will be produced in
that case (while "-pcls" alone will produce it in all cases).

Successful compilation using "-pcls" requires that the location of
the PCL include and library directories are specified in the optfile
in variables PCLINC and PCLLIB respectively. A lot of supercomputing
centers (especially in Germany) should have a version of PCL installed
for one of the many supported platforms. 

In our case we produce what PCL provides as Flop/s unless it
is not available on a given platform. The question of whether what is
provided is an accurate representation of Floating Point Operations
per second and not say some Flop/s subset or Flip/s are left to the
user to clarify by looking at the PCL documentation for the
particular platform.

PCL counter information should not be able to happily co-exist with
other external tools employing performance counters or HPM/PAPI options
("-hpmt","-papi" or "-papis") or PCL code section information ("-pcl").

The PAPI FAQ on Floating point counts on the Pentium4 also applies in
the case of PCL - it is assumed that only x87 flop/s are reported.