Ray Integration

Starting a Ray Cluster with SmartSim

Before we can begin starting up a Cluster, we first import the relevant modules. We will also define some global variables for clarity and ease of use:

  1. NUM_NODES is the number of Ray nodes we will deploy with the first one will be the head node. We will run one node on each host.

  2. CPUS_PER_WORKER is number of cpus to be used by each worker in the cluster

  3. LAUNCHER is the workload manager that our SmartSim experiment and ray cluster will use

import numpy as np
import os
import ray
from ray import tune
import ray.util

from smartsim import Experiment
from smartsim.exp.ray import RayCluster


Now, we instance a SmartSim experiment with the name "ray-cluster", which we will spin up the Ray cluster. By doing so we will create a ray-cluster directory (relative to the path from where we are executing this notebook). The output files generated by the experment will be located in the ray-cluster directory.

Next, we will instance a RayCluster to connect to the cluster. We are limiting the number each ray node can use to CPUS_PER_WORKER. If we wanted to let it use all the CPUs, it would suffice not to pass ray_args. Notice that the cluster will be password-protected (the password, generated internally, will be shared with worker nodes).

If the hosts are attached to multiple interfaces (e.g. ib, eth0, …), we can specify to which one the Ray nodes should bind by setting the interface argument; it is recommended to always choose the one offering the best performances. On a Cray XC, for example, this will be ipogif0.

Note that this approach only works with ray>=1.6. For previous versions, you have to add password=None to the RayCluster constructor.

exp = Experiment("ray-cluster", launcher=LAUNCHER)
cluster = RayCluster(
    ray_args={"num-cpus": CPUS_PER_WORKER},

We now generate the needed directories. If an experiment with the same name already exists, this call will fail to avoid overwriting existing results. If we want to overwrite, we can simply pass overwrite=True to exp.generate().

exp.generate(cluster, overwrite=True)

Now we are ready to start the cluster!

exp.start(cluster, block=False, summary=False)

Connect to the Ray Cluster

Now we can just connect to our running server.

ctx = ray.init(f"ray://{cluster.get_head_address()}:10001")

We can check that all resources are set properly.

        "This cluster consists of\n"
        f"{len(ray.nodes())} nodes in total\n"
        f"{ray.cluster_resources()['CPU']} CPU resources in total\n"
        f"and the head node is running at {cluster.get_head_address()}"

We can run a Ray Tune example, to see that everything is working.

    stop={"episode_reward_max": 200},
        "framework": "torch",
        "env": "CartPole-v0",
        "num_gpus": 0,
        "lr": tune.grid_search(np.linspace (0.001, 0.01, 50).tolist()),
        "log_level": "ERROR",
    local_dir=os.path.join(exp.exp_path, "ray_log"),

When the Ray job is running, we can connect to the Ray dashboard to monitor the evolution of the experiment. To do this, if Ray is running on a compute node of a remote system, we need to setup a SSH tunnel (we will see later how), to forward the port on which the dashboard is published to our local system. For example, if the head address (printed in the cell above) is <head_ip_address>, and the system name is <remote_sytem_name>, we can establish a tunnel to the dashboard opening a terminal on the local system and entering:

ssh -L 8265:<head_ip_address>:8265 <remote_system_name>

Then, from a browser on the local system, we can go to the address http://localhost:8265 to see the dashboard.

There are two things to know if something does not work:

  1. We are using 8265 as a port, which is the default dashboard port. If that port is not free, we can bind the dashboard to another port, e.g. PORT_NUMBER (by adding "dashboard-port": str(PORT_NUMBER) to ray_args when creating the cluster) and the command changed accordingly.

  2. If the port forwarding fails, it is possible that the interface is not reachable. In that case, you can add "dashboard-address": "" to ray_args when creating the cluster, to bind the dashboard to all interfaces, or select a visible address if one knows it. You can then use the node name (or its public IP) to establish the tunnel, by entering (on the local terminal): bash   ssh -L 8265:<node_name_or_public_IP>:8265 <remote_system_name> Please refer to your system guide to find out how you can get the name and the address of a node.

Stop Cluster and Release Resources

When we are finished with the cluster and ready to deallocate resources, we must first shut down the Ray runtime, followed by disconnecting the context.


Now that all is gracefully stopped, we can stop the job on the allocation.