Documentation updates for Frontier/Crusher installation and usage, including use of cxi transport

giltirn · giltirn · commit 8bd74c0eac83 · 2023-10-10T14:29:49.000-04:00
Documented new single-launch script for services and driver components
diff --git a/sphinx/source/install_usage/install.rst b/sphinx/source/install_usage/install.rst
@@ -43,10 +43,14 @@ Once installed, the unit and integration tests can be run as:
 A note on libfabric providers
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
+We recommend using the system-installed version of libfabric whereever possible. However, if a spack-based manual installation is required, please read this section.
+
 The Mercury library used for the provenance database requires a libfabric provider that supports the **FI_EP_RDM** endpoint. By default spack installs libfabric with the **sockets**, **tcp** and **udp** providers, of which only **sockets** supports this endpoint. However **sockets** is being deprecated as its performance is not as good as other dedicated providers. We recommend installing the **rxm** utility provider alongside **tcp** for most purposes, by appending the spack spec with :code:`^libfabric fabrics=sockets,tcp,rxm`.
 
 For network hardware supporting the Linux Verbs API (such as Infiniband) the **verbs** provider (with **rxm**) may provide better performance. This can be added to the spec as, for example, :code:`^libfabric fabrics=sockets,tcp,rxm,verbs`.
 
+For Slingshot networks (e.g. on Frontier/Crusher), the **cxi** provider may provide better performance. However, manual installation of libfabric with cxi does not appear to be possible due to it being closed-source. We therefore recommend using the system installation on this machine.
+
 Details of how to choose the libfabrics provider used by Mercury can be found :ref:`here <online_analysis>`. For further information consider the `Mercury documentation <https://mercury-hpc.github.io/documentation/#network-abstraction-layer>`_ .
 
 Integrating with system-installed MPI
@@ -79,103 +83,159 @@ Chimbuko can be built without MPI by disabling the **mpi** Spack variant as foll
 
 When used in this mode the user is responsible for manually assigning a "rank" index to each instance of the online AD module, and also for ensuring that an instance of this module is created alongside each instance or rank of the target application (e.g. using a wrapper script that is launched via mpirun). We discuss how this can be achieved :ref:`here <non_mpi_run>`. 
 
-Summit
-~~~~~~
+Frontier/Crusher
+~~~~~~~~~~~~~~~~
 
-While the above instructions are sufficient for building Chimbuko on Summit, it is advantageous to take advantage of the pre-existing modules for many of the dependencies. For convenience we provide a Spack **environment** which can be used to install in a self-contained environment Chimbuko using various system libraries. To install, first download the Chimbuko and Mochi repositories:
+In the PerformanceAnalysis source we also provide a Spack environment yaml for use on Frontier/Crusher, :code:`spack/environments/frontier.yaml` (the same installation and environment can be used for both machines). This environment is designed for the AMD programming environment with Rocm 5.2.0. Installation instructions follow:
+
+First download the Chimbuko and Mochi repositories:
 
 .. code:: bash
 
 	  git clone https://github.com/mochi-hpc/mochi-spack-packages.git
 	  git clone https://github.com/CODARcode/PerformanceAnalysis.git
 
-Copy the file :code:`spack/environments/summit.yaml` from the PerformanceAnalysis git repository to a convenient location and edit the paths in the :code:`repos` section to point to the paths at which you downloaded the repositories:
+Copy the file :code:`spack/environments/frontier.yaml` from the PerformanceAnalysis git repository to a convenient location and edit the paths in the :code:`repos` section to point to the paths at which you downloaded the repositories, e.g.:
 
 .. code:: yaml
 
 	  repos:
 	    - /autofs/nccs-svm1_home1/ckelly/install/mochi-spack-packages
 	    - /autofs/nccs-svm1_home1/ckelly/src/AD/PerformanceAnalysis/spack/repo/chimbuko
-
-This environment uses the :code:`gcc/9.1.0` and :code:`cuda/11.1.0` modules, which must be loaded prior to installation and running:
+      
+This environment uses the following modules, which must be loaded prior to installation and running:
 
 .. code:: bash
 
-	  module load gcc/9.1.0 cuda/11.2.0
+	  module reset
+	  module load PrgEnv-amd/8.3.3
+	  module swap amd amd/5.2.0
+	  module load cray-python/3.9.13.1
+	  module load cray-mpich/8.1.25
+	  module load gmp/6.2.1
+	  module load craype-accel-amd-gfx90a
+	  module unload darshan-runtime  
+	  export LD_LIBRARY_PATH=/opt/gcc/mpfr/3.1.4/lib:$LD_LIBRARY_PATH
 
-Then simply create a new environment and install:
+	  # For some reason not set by the cray-mpich module?
+	  export PATH=${CRAY_MPICH_PREFIX}/bin:${PATH}
+	  export PATH=${ROCM_COMPILER_PATH}/bin:${PATH}
+
+To install the environment:
 
 .. code:: bash
 
-	  spack env create my_chimbuko_env summit.yaml
+	  spack env create my_chimbuko_env frontier.yaml
 	  spack env activate my_chimbuko_env
 	  spack install
 
-Once installed, simply
+To load the environment:
 
 .. code:: bash
 
+	  # For some reason not set by the cray-mpich module?
+	  export PATH=${CRAY_MPICH_PREFIX}/bin:${PATH}
+	  export PATH=${ROCM_COMPILER_PATH}/bin:${PATH}
+
+	  export LD_LIBRARY_PATH=/opt/gcc/mpfr/3.1.4/lib:$LD_LIBRARY_PATH
+
+	  #Looks like spack doesn't pick up cray-xpmem pkg-config loc, put at end so only use as last resort
+	  export PKG_CONFIG_PATH=${PKG_CONFIG_PATH}:/usr/lib64/pkgconfig
+
 	  spack env activate my_chimbuko_env
 	  spack load tau chimbuko-performance-analysis chimbuko-visualization2
 
-after loading the modules above.	  
+GPU support for TAU C++ compilers
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
+While the above installation includes TAU and its support for the Rocm runtime API for GPU tracing, the TAU compiler wrappers it builds do not call the Rocm compiler **hipcc** and are therefore unable to instrument mixed C++ and HIP codes. As a workaround, we recommend manually building TAU against the spack-built dependencies as follows
 
-Crusher
-~~~~~~
+Clone the TAU git repository in a new directory:
 
-In the PerformanceAnalysis source we also provide a Spack environment yaml for use on Crusher, :code:`spack/environments/crusher_rocm5.2_PrgEnv-amd.yaml`. This environment is designed for the AMD programming environment with Rocm 5.2.0. Installation instructions follow:
+.. code:: bash
 
-First download the Chimbuko and Mochi repositories:
+	  git clone https://github.com/UO-OACISS/tau2.git
+
+	  
+Load the spack environment and create a configuration script (e.g. *config.sh*) with the following content:
+
+.. code:: bash
+
+	  #!/bin/bash
+	  new_inst=$(pwd)/install  #or change to preferred install directory                                                                                                                                                                                            
+	  tau_inst=$(spack location -i tau)
+	  spack_conf=$(grep ./configure ${tau_inst}/.spack/spack-build-out.txt | awk '{$1=$2=""; print $0}')
+
+
+	  spack_conf=$(echo $spack_conf | sed 's/-c++=clang++/-c++=hipcc/' |  sed -E "s|-prefix=[^']+'|-prefix=${new_inst}'|")
+	  spack_conf=$(echo $spack_conf | sed 's/-cc=clang/-cc=amdclang/')
+	  spack_conf=$(echo $spack_conf | sed 's/-fortran=flang/-fortran=amdflang/')
+
+	  spack_conf=$(echo $spack_conf | sed -E "s|'-useropt=([^']+)'||")
+
+	  spack_conf=$(echo $spack_conf | sed -E "s|-rocprofiler=([^']+)'||")
+	  spack_conf=$(echo $spack_conf | sed -E "s|-mpiinc=([^']+)'|-mpiinc=\${MPICH_DIR}/include'|")
+	  spack_conf=$(echo $spack_conf | sed -E "s|-mpilib=([^']+)'|-mpilib=\${MPICH_DIR}/lib'|")
+
+	  spack_conf=$(echo $spack_conf | sed "s/'//g")
+	  spack_conf="${spack_conf} -ompt -useropt=-g#-O2#-DTAU_MPI_DISABLE_COMM_WRAPPERS"
+
+	  echo $spack_conf | tee conf_cmd.log
+
+	  eval "$spack_conf 2>&1 | tee conf.log"                                                                                                                                                                                                                       
+	  make install 2>&1 | tee build.log    
+
+
+Executing this script will build and install TAU in the *install* subdirectory of the working directory. Finally, add the TAU installation path to the Linux environment
+
+.. code:: bash
+
+	  export PATH=$(pwd)/install/craycnl/bin:${PATH}
+	  export LD_LIBRARY_PATH=$(pwd)/install/craycnl/lib:${LD_LIBRARY_PATH}
+	  export TAU_MAKEFILE=$(pwd)/install/craycnl/lib/Makefile.tau-rocm-roctracer-amd-clang-papi-ompt-mpi-pthread-pdt-openmp-adios2
+
+The **tau_cxx.sh** wrapper script will now wrap the *hipcc* compiler.
+
+Summit
+~~~~~~
+
+While the above instructions are sufficient for building Chimbuko on Summit, it is advantageous to take advantage of the pre-existing modules for many of the dependencies. For convenience we provide a Spack **environment** which can be used to install in a self-contained environment Chimbuko using various system libraries. To install, first download the Chimbuko and Mochi repositories:
 
 .. code:: bash
 
 	  git clone https://github.com/mochi-hpc/mochi-spack-packages.git
 	  git clone https://github.com/CODARcode/PerformanceAnalysis.git
 
-Copy the file :code:`spack/environments/crusher_rocm5.2_PrgEnv-amd.yaml` from the PerformanceAnalysis git repository to a convenient location and edit the paths in the :code:`repos` section to point to the paths at which you downloaded the repositories:
+Copy the file :code:`spack/environments/summit.yaml` from the PerformanceAnalysis git repository to a convenient location and edit the paths in the :code:`repos` section to point to the paths at which you downloaded the repositories:
 
 .. code:: yaml
 
 	  repos:
 	    - /autofs/nccs-svm1_home1/ckelly/install/mochi-spack-packages
 	    - /autofs/nccs-svm1_home1/ckelly/src/AD/PerformanceAnalysis/spack/repo/chimbuko
-      
-This environment uses the following modules, which must be loaded prior to installation and running:
 
-.. code:: bash
+This environment uses the :code:`gcc/9.1.0` and :code:`cuda/11.1.0` modules, which must be loaded prior to installation and running:
 
-          module reset
-          module load PrgEnv-amd/8.3.3
-          module swap amd amd/5.2.0
-          module load cray-python/3.9.12.1
-          module load cray-mpich/8.1.17
-          module load gmp
-          module load craype-accel-amd-gfx90a
-          export LD_LIBRARY_PATH=/opt/gcc/mpfr/3.1.4/lib:$LD_LIBRARY_PATH
+.. code:: bash
 
-          # For some reason not set by the cray-mpich module?
-          export PATH=${CRAY_MPICH_PREFIX}/bin:${PATH}
-          export PATH=${ROCM_COMPILER_PATH}/bin:${PATH}
+	  module load gcc/9.1.0 cuda/11.2.0
 
-To install the environment:
+Then simply create a new environment and install:
 
 .. code:: bash
 
-	  spack env create my_chimbuko_env spock.yaml
+	  spack env create my_chimbuko_env summit.yaml
 	  spack env activate my_chimbuko_env
 	  spack install
 
-To load the environment:
+Once installed, simply
 
 .. code:: bash
 
-	  #Looks like spack doesn't pick up cray-xpmem pkg-config loc, put at end so only use as last resort
-	  export PKG_CONFIG_PATH=${PKG_CONFIG_PATH}:/usr/lib64/pkgconfig
-
 	  spack env activate my_chimbuko_env
 	  spack load tau chimbuko-performance-analysis chimbuko-visualization2
 
+after loading the modules above.	  
 
 
 .. _ADIOS2: https://github.com/ornladios/ADIOS2
diff --git a/sphinx/source/install_usage/run_chimbuko.rst b/sphinx/source/install_usage/run_chimbuko.rst
@@ -175,16 +175,16 @@ which can be used as follows:
 Running on Slurm-based systems
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-This section we provide specifics on launching on the Spock machine, but the procedure will also apply to other machines using the Slurm task scheduler.
+This section we provide specifics on launching on machines using the Slurm task scheduler.
 
-Spock uses the *slurm* job management system. To control the explicit placement of the ranks we will use the :code:`--nodelist` (:code:`-w`) slurm option to specify the nodes associated with a resource set, the :code:`--nodes` (:code:`-N`) option to specify the number of nodes and the :code:`--overlap` option to allow the AD and application resource sets to coexist on the same node. These options are documented `here <https://slurm.schedmd.com/srun.html>`_.
+To control the explicit placement of the ranks we will use the :code:`--nodelist` (:code:`-w`) slurm option to specify the nodes associated with a resource set, the :code:`--nodes` (:code:`-N`) option to specify the number of nodes and the :code:`--overlap` option to allow the AD and application resource sets to coexist on the same node. These options are documented `here <https://slurm.schedmd.com/srun.html>`_.
 
-The :code:`--nodelist` option requires the range of full hostnames of the nodes to be provided. In order to simplify the generation of this list we provide a script `here <https://github.com/CODARcode/PerformanceAnalysis/blob/ckelly_develop/scripts/spock/get_nodes.pl>`_ that parses the **SLURM_JOB_NODELIST** environment variable and generates the nodelist for the services and application. To use:
+The :code:`--nodelist` option requires the range of full hostnames of the nodes to be provided. For Crusher/Frontier and Spock we provide perl scripts in the appropriately named subdirectories of `here <https://github.com/CODARcode/PerformanceAnalysis/blob/ckelly_develop/scripts>`_ . These scripts parse the **SLURM_JOB_NODELIST** environment variable and generates the nodelist for the services and application. They differ only in adhering to the node naming convention for that particular machine. To use:
 
 .. code:: bash
 
-	  service_node=$(./get_nodes.pl HEAD)
-	  body_nodelist=$(./get_nodes.pl BODY)
+	  service_node=$(path_to_script/get_nodes.pl HEAD)
+	  body_nodelist=$(path_to_script/get_nodes.pl BODY)
 
 We can now set the various :code:`<LAUNCH ..>` commands in the section above:
 
@@ -203,6 +203,30 @@ Where
 
 Note that we have assigned 1 core to each rank of the AD, and so :code:`${n_mpi_ranks_per_node} * (${n_cores_per_rank_main} + 1)` should not exceed 64, the number of available cores.
 
+Running with the CXI network provider on Frontier/Crusher
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Frontier/Crusher and other machines with Cray HPE Slingshot network support an optimized communications provider, **cxi**. Using this requires a few extra steps when running Chimbuko in order to allow the Mochi (provenance database) components to communicate between processes launched under different calls to *srun* (i.e. between our services and clients).
+
+First, add the following slurm options to your batch script header section:
+
+.. code:: bash
+
+	  #SBATCH --network=single_node_vni,job_vni
+
+Then, in the *chimbuko_config.sh*, set the following options (in addition to any other optional arguments):
+
+.. code:: bash
+	  
+	  provdb_engine="cxi"
+	  provdb_extra_args="-db_mercury_auth_key 0:0"
+	  commit_extra_args="-provdb_mercury_auth_key 0:0"  #add this variable if it doesn't yet exist in the setup script
+	  pserver_extra_args="-provdb_mercury_auth_key 0:0"
+	  ad_extra_args="-provdb_mercury_auth_key 0:0"
+
+Alternatively, if Chimbuko's services and online AD components are launched together using the new, experimental launch procedure (see below), it is only necessary to set the *provdb_engine* option.
+	  
+
 Scaling to large job sizes
 ^^^^^^^^^^^^^^^^^^^^^^^^^^
 
@@ -284,6 +308,43 @@ Online analysis of a non-MPI application with a non-MPI installation of Chimbuko
 
 In the context of a non-MPI application, instances of the application must still be associated with an index within Chimbuko that allows for their discrimination. This proceeds much as in the previous section, but with a catch: by default Chimbuko assumes that the instance index passed in by the **-rank <rank>** option matches the rank index reflected by the trace data and the ADIOS trace filename produced by Tau. However for a non-MPI application, Tau assigns rank 0 to **all instances**. In order to communicate this to Chimbuko a second command line option must be used: **-override_rank 0**. Here the 0 tells Chimbuko that the input data is labeled as 0 in both the filename and the trace data. Chimbuko will then overwrite the rank index in the trace data to match that of its internal rank index to ensure that this new label is passed through the analysis. Note that the user must make sure that each application instance is assigned either a different **TAU_ADIOS2_PATH** or **TAU_ADIOS2_FILE_PREFIX** otherwise the trace data files will overwrite each other.
 
+Launching Chimbuko's components together through a single script (advanced, experimental)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In order to simplify the launch procedure we are developing a script that simultaneously instantiates the Chimbuko services and the AD clients. At present we support only the Slurm task manager, and this feature is experimental.
+
+To use it, download the *PerformanceAnalysis* source. Then, in the run script remove the two separate calls to *srun* and the lines associated with the extra gathering of the body and head nodes, and replace with the following:
+
+.. code:: bash
+
+	  tasks_per_node=<SETME>
+	  nodes=${SLURM_NNODES}
+	  app_nodes=$(( nodes - 1 ))
+	  app_tasks=$(( ${app_nodes} * ${tasks_per_node} ))
+	  stasks=$(( ${nodes} * ${tasks_per_node} ))
+	  
+	  srun -N ${nodes} -n ${stasks} --ntasks-per-node ${tasks_per_node} --overlap path_to_PerformanceAnalysis_source/scripts/launch/chimbuko.sh ${app_tasks} &
+
+	  #Wait until server has started
+	  while [ ! -f chimbuko/vars/chimbuko_ad_cmdline.var ]; do sleep 1; done
+
+where **tasks_per_node** is the number of application tasks that you will be launching. It is assumed that the total number of nodes remains one larger than the number of nodes on which the application is to be launched.
+
+As the services are launched here on the *last* node, a simple call to srun for the application suffices to co-locate the application ranks with the AD instances:
+
+.. code:: bash
+
+	  srun --overlap -N${app_nodes} --ntasks-per-node=${tasks_per_node} <YOUR APPLICATION> <YOUR ARGUMENTS>
+
+The *chimbuko.sh* script has an optional argument **--core_bind** to bind the AD processes to specific cores, which can be used alongside Slurm's binding options to ensure the AD instances run on separate resources to the application. The format of the argument is a comma-separated list of core indices *per task on any given node*, with those lists themselves separated by colons (:). For example, with **${tasks_per_node}=8**
+
+.. code:: bash
+
+	  bnd="60,61:62,63:28,29:30,31:44,45:46,47:12,13:14,15"
+	  srun -N ${nodes} -n ${stasks} --ntasks-per-node ${tasks_per_node} --overlap ${rundir}/chimbuko.sh ${app_tasks} --core_bind ${bnd} &
+
+will bind the first AD process on a node to cores 60,61, the second to 62,63 and so on.
+	  
 	  
 .. _benchmark_suite: