Installation on specific platforms
Installation on specific platforms¶
The following describes installation details for various systems and platforms that SmartSim may be used on.
Customizing environment variables¶
Various environment variables can be used to control the compilers and
dependencies for SmartSim. These are particularly important to set before the
smart build step to ensure that the Orchestrator and machine-learning
backends are compiled with the desired compilation environment.
The compilation environment that SmartSim is compiled with does not
necessarily have to be compatible with the SmartRedis library and the
simulation application that will be launched by SmartSim. To ensure
that this works as intended however, please be sure to set the
correct environment for the simulation using the
All of the following environment variables must be exported to ensure that
they are used throughout the entire build process. Additionally at runtime, the
environment in which the Orchestrator is launched must have the cuDNN and CUDA
Toolkit libraries findable by the link loader (e.g. available in the
LD_LIBRARY_PATH environment variable).
Unlike SmartRedis, we strongly encourage users to only use the GNU compiler chain to build the SmartSim dependencies. Notably, RedisAI has some coding conventions that prevent the use of Intel compiler chain. If a specific compiler should be used (e.g. the Cray Programming Environment wrappers), the following environment variables will control the C and C++ compilers:
CC: Path to the C compiler
CXX: Path the C++ compiler
GPU dependencies (non-root)¶
The Nvidia installation instructions for CUDA Toolkit and cuDNN tend to be tailored for users with root access. For those on HPC platforms where root access is rare, manually downloading and installing these dependencies as a user is possible.
wget https://developer.download.nvidia.com/compute/cuda/11.4.4/local_installers/cuda_11.4.4_470.82.01_linux.run chmod +x cuda_11.4.4_470.82.01_linux.run ./cuda_11.4.4_470.82.01_linux.run --toolkit --silent --toolkitpath=/path/to/install/location/
For cuDNN, follow Nvidia’s instructions, and copy the cuDNN libraries to the lib64 directory at the CUDA Toolkit location specified above.
Frontier is an AMD CPU/AMD GPU system.
As of 2023-07-06, users can use the following instructions, however we anticipate that all the SmartSim dependencies will be available system-wide via the modules system.
We are continually working on getting all the features of SmartSim working on Frontier, however we do have some known limitations:
For now, only Torch models are supported. We are working to find a recipe to install Tensorflow with ROCm support from scratch
The colocated database will fail without specifying
custom_pinning. This is because the default pinning assumes that processor 0 is available, but the ‘low-noise’ default on Frontier reserves the processor on each NUMA node. Users should pass a list of processor ids to the
custom_pinningargument that avoids the reserved processors
The Singularity-based tests are currently failing. We are investigating how to interact with Frontier’s configuration. Please contact us if this is interfering with your application
Please raise an issue in the SmartSim Github or contact the developers if the above issues are affecting your workflow or if you find any other problems.
To install the SmartRedis and SmartSim python packages on Frontier, please follow these instructions, being sure to set the following variables
export PROJECT_NAME=CHANGE_ME export VENV_NAME=CHANGE_ME
Then continue with the install:
module load PrgEnv-gnu-amd git-lfs cmake cray-python module unload xalt amd-mixed module load rocm/4.5.2 export CC=gcc export CXX=g++ export SCRATCH=/lustre/orion/$PROJECT_NAME/scratch/$USER/ export VENV_HOME=$SCRATCH/$VENV_NAME/ python3 -m venv $VENV_HOME source $VENV_HOME/bin/activate pip install torch==1.11.0+rocm4.5.2 torchvision==0.12.0+rocm4.5.2 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/rocm4.5.2 cd $SCRATCH git clone https://github.com/CrayLabs/SmartRedis.git cd SmartRedis make lib-with-fortran pip install . # Download SmartSim and site-specific files cd $SCRATCH git clone https://github.com/CrayLabs/site-deployments.git git clone https://github.com/CrayLabs/SmartSim.git cd SmartSim pip install -e .[dev]
Next to finish the compilation, we need to manually modify one of the auxiliary cmake files that comes packaged with Torch
export TORCH_CMAKE_DIR=$(python -c 'import torch;print(torch.utils.cmake_prefix_path)') # Manual step: modify all references to the 'rocm' directory to rocm-4.5.2 vim $TORCH_CMAKE_DIR/Caffe2/Caffe2Targets.cmake
Finally, build Redis (or keydb for a more performant solution), RedisAI, and the machine-learning backends using:
KEYDB_FLAG="" # set this to --keydb if desired smart build --device gpu --torch_dir $TORCH_CMAKE_DIR --no_tf -v $(KEYDB_FLAG)
Set up environment¶
Before running SmartSim, the environment should match the one used to build, and some variables should be set to work around some ROCm PyTorch issues:
# Set these to the same values that were used for install export PROJECT_NAME=CHANGE_ME export VENV_NAME=CHANGE_ME
module load PrgEnv-gnu-amd git-lfs cmake cray-python module unload xalt amd-mixed module load rocm/4.5.2 export SCRATCH=/lustre/orion/$PROJECT_NAME/scratch/$USER/ export MIOPEN_USER_DB_PATH=/tmp/miopendb/ export MIOPEN_SYSTEM_DB_PATH=$MIOPEN_USER_DB_PATH mkdir -p $MIOPEN_USER_DB_PATH export MIOPEN_DISABLE_CACHE=1 export VENV_HOME=$SCRATCH/$VENV_NAME/ source $VENV_HOME/bin/activate export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$VENV_HOME/lib/python3.9/site-packages/torch/lib
Binding DBs to Slingshot¶
Each Frontier node has four NICs, which also means users need to bind
DBs to four network interfaces,
hsn3. Typically, orchestrators will need to be created in the
exp = Experiment("my_exp", launcher="slurm") orc = exp.create_database(db_nodes=3, interface=["hsn0","hsn1","hsn2","hsn3"], single_cmd=True)
The same environment set to run SmartSim must be set to run tests. The environment variables needed to run the test suite are the following:
export SMARTSIM_TEST_ACCOUNT=PROJECT_NAME # Change this to above export SMARTSIM_TEST_LAUNCHER=slurm export SMARTSIM_TEST_DEVICE=gpu export SMARTSIM_TEST_PORT=6789 export SMARTSIM_TEST_INTERFACE="hsn0,hsn1,hsn2,hsn3"
HPE Cray supercomputers¶
On certain HPE Cray machines, the SmartSim dependencies have been installed system-wide though specific paths and names might vary (please contact the team if these instructions do not work).
module use -a /lus/scratch/smartsim/local/modulefiles module load cudatoolkit/11.8 cudnn git-lfs module unload PrgEnv-cray PrgEnv-intel PrgEnv-gcc module load PrgEnv-gnu module switch gcc/11.2.0 export CRAYPE_LINK_TYPE=dynamic
Cheyenne at NCAR¶
Since SmartSim does not currently support the Message Passing Toolkit (MPT), Cheyenne users of SmartSim will need to utilize OpenMPI.
The following module commands were utilized to run the examples:
$ module purge $ module load ncarenv/1.3 gnu/8.3.0 ncarcompilers/0.5.0 netcdf/4.7.4 openmpi/4.0.5
With this environment loaded, users will need to build and install both SmartSim and SmartRedis through pip. Usually we recommend users installing or loading miniconda and using the pip that comes with that installation.
$ pip install smartsim $ smart build --device cpu #(Since Cheyenne does not have GPUs)
To make the SmartRedis library (C, C++, Fortran clients), follow these steps with the same environment loaded.
# clone SmartRedis and build $ git clone https://github.com/SmartRedis.git smartredis $ cd smartredis $ make lib
Summit at OLCF¶
Since SmartSim does not have a built PowerPC build, the build steps for an IBM system are slightly different than other systems.
Luckily for us, a conda channel with all relevant packages is maintained as part of the OpenCE initiative. Users can follow these instructions to get a working SmartSim build with PyTorch and TensorFlow for GPU on Summit. Note that SmartSim and SmartRedis will be downloaded to the working directory from which these instructions are executed.
# setup Python and build environment export ENV_NAME=smartsim-0.5.1 git clone https://github.com/CrayLabs/SmartRedis.git smartredis git clone https://github.com/CrayLabs/SmartSim.git smartsim conda config --prepend channels https://ftp.osuosl.org/pub/open-ce/1.4.1/ conda create --name $ENV_NAME -y python=3.9 \ git-lfs \ cmake \ make \ cudnn=8.1.1_11.2 \ cudatoolkit=11.2.2 \ tensorflow=2.6.2 \ libtensorflow=2.6.2 \ pytorch=1.9.0 \ torchvision=0.10.0 conda activate $ENV_NAME export CC=$(which gcc) export CXX=$(which g++) export LDFLAGS="$LDFLAGS -pthread" export CUDNN_LIBRARY=/ccs/home/$USER/.conda/envs/$ENV_NAME/lib/ export CUDNN_INCLUDE_DIR=/ccs/home/$USER/.conda/envs/$ENV_NAME/include/ module load cuda/11.4.2 export LD_LIBRARY_PATH=$CUDNN_LIBRARY:$LD_LIBRARY_PATH:/ccs/home/$USER/.conda/envs/$ENV_NAME/lib/python3.9/site-packages/torch/lib module load gcc/9.3.0 module unload xalt # clone SmartRedis and build pushd smartredis make lib && pip install . popd # clone SmartSim and build pushd smartsim pip install . # install PyTorch and TensorFlow backend for the Orchestrator database. export Torch_DIR=/ccs/home/$USER/.conda/envs/$ENV_NAME/lib/python3.9/site-packages/torch/share/cmake/Torch/ export CFLAGS="$CFLAGS -I/ccs/home/$USER/.conda/envs/$ENV_NAME/lib/python3.9/site-packages/tensorflow/include" export SMARTSIM_REDISAI=1.2.5 export Tensorflow_BUILD_DIR=/ccs/home/$USER/.conda/envs/$ENV_NAME/lib/python3.9/site-packages/tensorflow/ smart build --device=gpu --torch_dir $Torch_DIR --libtensorflow_dir $Tensorflow_BUILD_DIR -v # Show LD_LIBRARY_PATH for future reference echo "SmartSim installation is complete, LD_LIBRARY_PATH=$LD_LIBRARY_PATH"
When executing SmartSim, if you want to use the PyTorch and TensorFlow backends in the orchestrator, you will need to set up the same environment used at build time:
module load cuda/11.4.2 export CUDNN_LIBRARY=/ccs/home/$USER/.conda/envs/$ENV_NAME/lib/ export LD_LIBRARY_PATH=/ccs/home/$USER/.conda/envs/smartsim/lib/python3.8/site-packages/torch/lib/:$LD_LIBRARY_PATH:$CUDNN_LIBRARY module load gcc/9.3.0 module unload xalt
Certain HPE customer machines have a site installation of SmartSim. This means
that users can bypass the
smart build step that builds the ML backends and
the Redis binaries. Users on these platforms can install SmartSim from PyPI or
from source with the following steps replacing
SMARTSIM_VERSION with the desired entries.
module use -a /lus/scratch/smartsim/local/modulefiles module load cudatoolkit/11.8 cudnn smartsim-deps/COMPILER_VERSION/SMARTSIM_VERSION pip install smartsim[ml] smart build --only_python_packages --device gpu [--onnx]