Installation on specific platforms

Installation on specific platforms#

The following describes installation details for various systems and platforms that SmartSim may be used on.

Customizing environment variables#

Various environment variables can be used to control the compilers and dependencies for SmartSim. These are particularly important to set before the smart build step to ensure that the Orchestrator and machine-learning backends are compiled with the desired compilation environment.

Note

The compilation environment that SmartSim is compiled with does not necessarily have to be compatible with the SmartRedis library and the simulation application that will be launched by SmartSim. To ensure that this works as intended however, please be sure to set the correct environment for the simulation using the RunSettings.

All of the following environment variables must be exported to ensure that they are used throughout the entire build process. Additionally at runtime, the environment in which the Orchestrator is launched must have the cuDNN and CUDA Toolkit libraries findable by the link loader (e.g. available in the LD_LIBRARY_PATH environment variable).

Compiler environment#

Unlike SmartRedis, we strongly encourage users to only use the GNU compiler chain to build the SmartSim dependencies. Notably, RedisAI has some coding conventions that prevent the use of Intel compiler chain. If a specific compiler should be used (e.g. the Cray Programming Environment wrappers), the following environment variables will control the C and C++ compilers:

CC: Path to the C compiler
CXX: Path the C++ compiler

GPU dependencies (non-root)#

The Nvidia installation instructions for CUDA Toolkit and cuDNN tend to be tailored for users with root access. For those on HPC platforms where root access is rare, manually downloading and installing these dependencies as a user is possible.

wget https://developer.download.nvidia.com/compute/cuda/11.4.4/local_installers/cuda_11.4.4_470.82.01_linux.run
chmod +x cuda_11.4.4_470.82.01_linux.run
./cuda_11.4.4_470.82.01_linux.run --toolkit  --silent --toolkitpath=/path/to/install/location/

For cuDNN, follow Nvidia’s instructions, and copy the cuDNN libraries to the lib64 directory at the CUDA Toolkit location specified above.

OLCF Frontier#

Known limitations#

We are continually working on getting all the features of SmartSim working on Frontier, however we do have some known limitations:

For now, only Torch models are supported. If you need Tensorflow or ONNX support please contact us
All SmartSim experiments must be run from Lustre, _not_ your home directory
The colocated database will fail without specifying custom_pinning. This is because the default pinning assumes that processor 0 is available, but the ‘low-noise’ default on Frontier reserves the processor on each NUMA node. Users should pass a list of processor ids to the custom_pinning argument that avoids the reserved processors
The Singularity-based tests are currently failing. We are investigating how to interact with Frontier’s configuration. Please contact us if this is interfering with your application

Please raise an issue in the SmartSim Github or contact the developers if the above issues are affecting your workflow or if you find any other problems.

One-time Setup#

To install the SmartRedis and SmartSim python packages on Frontier, please follow these instructions, being sure to set the following variables

export PROJECT_NAME=CHANGE_ME

Step 1: Create and activate a virtual environment for SmartSim:

module load PrgEnv-gnu miniforge3 rocm/6.1.3

export SCRATCH=/lustre/orion/$PROJECT_NAME/scratch/$USER/
conda create -n smartsim python=3.11
source activate smartsim

Step 2: Build the SmartRedis C++ and Fortran libraries:

cd $SCRATCH
git clone https://github.com/CrayLabs/SmartRedis.git
cd SmartRedis
make lib-with-fortran
pip install .

Step 3: Install SmartSim in the conda environment:

cd $SCRATCH
pip install git+https://github.com/CrayLabs/SmartSim.git

Step 4: Build Redis, RedisAI, the backends, and all the Python packages:

smart build --device=rocm-6

Step 5: Check that SmartSim has been installed and built correctly:

# Optimizations for inference
export MIOPEN_USER_DB_PATH="/tmp/${USER}/my-miopen-cache"
export MIOPEN_CUSTOM_CACHE_DIR=$MIOPEN_USER_DB_PATH
rm -rf $MIOPEN_USER_DB_PATH
mkdir -p $MIOPEN_USER_DB_PATH

# Run the install validation utility
smart validate --device gpu

The following output indicates a successful install:

[SmartSim] INFO Verifying Tensor Transfer
[SmartSim] INFO Verifying Torch Backend
16:26:35 login SmartSim[557020:MainThread] INFO Success!

Post-installation#

Before running SmartSim, the environment should match the one used to build, and some variables should be set to optimize performance:

# Set these to the same values that were used for install
export PROJECT_NAME=CHANGE_ME

module load PrgEnv-gnu miniforge3 rocm/6.1.3
source activate smartsim

# Optimizations for inference
export MIOPEN_USER_DB_PATH="/tmp/${USER}/my-miopen-cache"
export MIOPEN_CUSTOM_CACHE_DIR=${MIOPEN_USER_DB_PATH}
rm -rf ${MIOPEN_USER_DB_PATH}
mkdir -p ${MIOPEN_USER_DB_PATH}

Binding DBs to Slingshot#

Each Frontier node has four NICs, which also means users need to bind DBs to four network interfaces, hsn0, hsn1, hsn2, hsn3. Typically, orchestrators will need to be created in the following way:

exp = Experiment("my_exp", launcher="slurm")
orc = exp.create_database(db_nodes=3, interface=["hsn0","hsn1","hsn2","hsn3"], single_cmd=True)

NERSC Perlmutter#

One-time Setup#

To install SmartSim on Perlmutter, follow these steps:

Step 1: Create and activate a conda environment for SmartSim:

module load conda cudatoolkit/12.2 cudnn/8.9.3_cuda12 PrgEnv-gnu
conda create -n smartsim python=3.11
conda activate smartsim

Step 2: Build the SmartRedis C++ and Fortran libraries:

git clone https://github.com/CrayLabs/SmartRedis.git
cd SmartRedis
make lib-with-fortran
pip install .
cd ..

Step 3: Install SmartSim in the conda environment:

pip install git+https://github.com/CrayLabs/SmartSim.git

Step 4: Build Redis, RedisAI, the backends, and all the Python packages:

smart build --device=cuda-12

Step 5: Check that SmartSim has been installed and built correctly:

smart validate --device gpu

The following output indicates a successful install:

[SmartSim] INFO Verifying Tensor Transfer
[SmartSim] INFO Verifying Torch Backend
[SmartSim] INFO Verifying ONNX Backend
[SmartSim] INFO Verifying TensorFlow Backend
16:26:35 login SmartSim[557020:MainThread] INFO Success!

Post-installation#

After completing the above steps to install SmartSim in a conda environment, you can reload the conda environment by running the following commands:

module load conda cudatoolkit/12.2 cudnn/8.9.3_cuda12 PrgEnv-gnu
conda activate smartsim

HPE Cray supercomputers#

On certain HPE Cray machines, the SmartSim dependencies have been installed system-wide though specific paths and names might vary (please contact the team if these instructions do not work).

module use -a /lus/scratch/smartsim/local/modulefiles
module load cudatoolkit/11.8 cudnn git-lfs

module unload PrgEnv-cray PrgEnv-intel PrgEnv-gcc
module load PrgEnv-gnu
module switch gcc/11.2.0

export CRAYPE_LINK_TYPE=dynamic

This should provide all the dependencies needed to build the GPU backends for the ML bakcends. Users can thus proceed with their preferred way of installing SmartSim either from PyPI or from source.

Cheyenne at NCAR#

Since SmartSim does not currently support the Message Passing Toolkit (MPT), Cheyenne users of SmartSim will need to utilize OpenMPI.

The following module commands were utilized to run the examples:

$ module purge
$ module load ncarenv/1.3 gnu/8.3.0 ncarcompilers/0.5.0 netcdf/4.7.4 openmpi/4.0.5

With this environment loaded, users will need to build and install both SmartSim and SmartRedis through pip. Usually we recommend users installing or loading miniconda and using the pip that comes with that installation.

$ pip install smartsim
$ smart build --device cpu  #(Since Cheyenne does not have GPUs)

To make the SmartRedis library (C, C++, Fortran clients), follow these steps with the same environment loaded.

# clone SmartRedis and build
$ git clone https://github.com/SmartRedis.git smartredis
$ cd smartredis
$ make lib

Summit at OLCF#

Since SmartSim does not have a built PowerPC build, the build steps for an IBM system are slightly different than other systems.

Luckily for us, a conda channel with all relevant packages is maintained as part of the OpenCE initiative. Users can follow these instructions to get a working SmartSim build with PyTorch and TensorFlow for GPU on Summit. Note that SmartSim and SmartRedis will be downloaded to the working directory from which these instructions are executed.

Note that the available PyTorch version (1.10.2) does not match the one expected by RedisAI 1.2.7 (1.11): it is still compatible and should work, but please open an issue on SmartSim’s GitHub repo if you run into problems.

# setup Python and build environment
export ENV_NAME=smartsim-0.8.0
git clone https://github.com/CrayLabs/SmartRedis.git smartredis
git clone https://github.com/CrayLabs/SmartSim.git smartsim
conda config --prepend channels https://ftp.osuosl.org/pub/open-ce/1.6.1/
conda create --name $ENV_NAME -y  python=3.9 \
                                  git-lfs \
                                  cmake \
                                  make \
                                  cudnn=8.1.1_11.2 \
                                  cudatoolkit=11.2.2 \
                                  tensorflow=2.8.1 \
                                  libtensorflow \
                                  pytorch=1.10.2 \
                                  torchvision=0.11.3
conda activate $ENV_NAME
export CC=$(which gcc)
export CXX=$(which g++)
export LDFLAGS="$LDFLAGS -pthread"
export CUDNN_LIBRARY=/ccs/home/$USER/.conda/envs/$ENV_NAME/lib/
export CUDNN_INCLUDE_DIR=/ccs/home/$USER/.conda/envs/$ENV_NAME/include/
module load cuda/11.4.2
export LD_LIBRARY_PATH=$CUDNN_LIBRARY:$LD_LIBRARY_PATH:/ccs/home/$USER/.conda/envs/$ENV_NAME/lib/python3.9/site-packages/torch/lib
module load gcc/9.3.0
module unload xalt
# clone SmartRedis and build
pushd smartredis
make lib && pip install .
popd

# clone SmartSim and build
pushd smartsim
pip install .

# install PyTorch and TensorFlow backend for the Orchestrator database.
export Torch_DIR=/ccs/home/$USER/.conda/envs/$ENV_NAME/lib/python3.9/site-packages/torch/share/cmake/Torch/
export CFLAGS="$CFLAGS -I/ccs/home/$USER/.conda/envs/$ENV_NAME/lib/python3.9/site-packages/tensorflow/include"
export SMARTSIM_REDISAI=1.2.7
export Tensorflow_BUILD_DIR=/ccs/home/$USER/.conda/envs/$ENV_NAME/lib/python3.9/site-packages/tensorflow/
smart build --device=gpu --torch_dir $Torch_DIR --libtensorflow_dir $Tensorflow_BUILD_DIR -v

# Show LD_LIBRARY_PATH for future reference
echo "SmartSim installation is complete, LD_LIBRARY_PATH=$LD_LIBRARY_PATH"

When executing SmartSim, if you want to use the PyTorch and TensorFlow backends in the orchestrator, you will need to set up the same environment used at build time:

module load cuda/11.4.2
export CUDNN_LIBRARY=/ccs/home/$USER/.conda/envs/$ENV_NAME/lib/
export LD_LIBRARY_PATH=/ccs/home/$USER/.conda/envs/smartsim/lib/python3.8/site-packages/torch/lib/:$LD_LIBRARY_PATH:$CUDNN_LIBRARY
module load gcc/9.3.0
module unload xalt

Site Installation#

Certain HPE customer machines have a site installation of SmartSim. This means that users can bypass the smart build step that builds the ML backends and the Redis binaries. Users on these platforms can install SmartSim from PyPI or from source with the following steps replacing COMPILER_VERSION and SMARTSIM_VERSION with the desired entries.

module use -a /lus/scratch/smartsim/local/modulefiles
module load cudatoolkit/11.8 cudnn smartsim-deps/COMPILER_VERSION/SMARTSIM_VERSION
pip install smartsim
smart build --skip-backends --device gpu [--onnx]

Installation on specific platforms

Contents

Installation on specific platforms#

Customizing environment variables#

Compiler environment#

CUDA-related#

GPU dependencies (non-root)#

OLCF Frontier#

Known limitations#

One-time Setup#

Post-installation#

Binding DBs to Slingshot#

NERSC Perlmutter#

One-time Setup#

Post-installation#

HPE Cray supercomputers#

Cheyenne at NCAR#

Summit at OLCF#

Site Installation#