Getting Started¶
In this notebook, we will walk through the most basic functionalities of SmartSim.
Creating and Running Models
Creating and Running Ensembles
Running and Communicating with the Orchestrator
Ensembles using SmartRedis
1.1 Running Models¶
Experiment
s are how users define workflows in SmartSim. The Experiment
is used to create Model
instances which represent applications, scripts, or largely any program. An experiment can start and stop a Model
and monitor execution.
We begin by importing the modules we need: Experiment
and RunSettings
.
RunSettings
help parameterize how a Model
should be executed provided the system and available computational resources. There are many types of RunSettings
in SmartSim. The base RunSettings
class defines parameters for running locally, meaning on a laptop, workstation or single compute node.
[1]:
import os
from smartsim import Experiment
from smartsim.settings import RunSettings
Throughout this notebook, we will incrementally build an Experiment
. We start from the simplest case: A single Model
instance
Our first Model
will simply print hello
, using the shell command echo
.
[2]:
# Init Experiment and specify to launch locally
exp = Experiment(name="getting-started", launcher="local")
# create our simple model
settings = RunSettings(exe="echo", exe_args="hello")
M1 = exp.create_model(name="tutorial-model", run_settings=settings)
Once the Model
has been created by the Experiment
, it can be started.
By setting summary=True
, we can see a summary of the experiment printed before it is launched. The summary will stay for 10 seconds, and it is useful as a last check. If we set summary=False
, then the experiment would be launched immediately.
We also explicitly set block=True
(even though it is the default), so that Experiment.start
waits until the last Model
has finished before returning: it will act like a job monitor, letting us know if processes run, complete, or fail.
[3]:
exp.start(M1, block=True, summary=True)
=== LAUNCH SUMMARY ===
Experiment: getting-started
Experiment Path: /Users/spartee/Dropbox/Cray/smartsim/tutorials/01_getting_started/getting-started
Launching with: local
# of Ensembles: 0
# of Models: 1
Database: no
=== MODELS ===
tutorial-model
Model Parameters:
{}
Model Run Settings:
Executable: /bin/echo
Executable arguments: ['hello']
19:30:47 C02YN0J3JG5M SmartSim[49287] INFO tutorial-model(49432): Completed
The model has completed. Let’s look at the content of the current working directory.
[5]:
os.listdir('.')
outputfile = './tutorial-model.out'
errorfile = './tutorial-model.err'
print("Content of tutorial-model.out:")
with open(outputfile, 'r') as fin:
print(fin.read())
print("Content of tutorial-model.err:")
with open(errorfile, 'r') as fin:
print(fin.read())
Content of tutorial-model.out:
hello
Content of tutorial-model.err:
We can see that two files, tutorial-model-1.out
and tutorial-model-1.err
have been created. The .out
file contains the output generated by model-1
, and the .err
file would contain the error messages generated by it. Since there were no errors, the .err
file is empty.
Now let’s run two different Model
instances at the same time. This is just as easy as running one Model
, and takes the same steps. This time, we will skip the summary. For each Model
, we create a RunSettings
object: it is recommended to always create separate RunSettings
objects for each Model
.
[6]:
run_settings_1 = RunSettings("sleep", "3")
run_settings_2 = RunSettings("sleep", "5")
model_1 = exp.create_model("tutorial-model-1", run_settings_1)
model_2 = exp.create_model("tutorial-model-2", run_settings_2)
exp.start(model_1, model_2)
19:32:18 C02YN0J3JG5M SmartSim[49287] INFO tutorial-model-1(50330): Completed
19:32:19 C02YN0J3JG5M SmartSim[49287] INFO tutorial-model-2(50337): Running
19:32:20 C02YN0J3JG5M SmartSim[49287] INFO tutorial-model-2(50337): Completed
For users of parallel applications, launch binaries can also be specified in RunSettings
. For example, if mpirun
is installed on the system, we can run a model through it, by specifying it as run_command
in RunSettings
. Since mpirun
takes arguments (e.g. to define how many processes will be run), we pass them by defining run_args
in RunSettings
.
Please note that to run this you need to have OpenMPI installed.
[8]:
openmpi_settings = RunSettings("echo",
"hello world!",
run_command="mpirun",
run_args={"-np": 2}) # note that for base ``RunSettings`` run_args passed literally
ompi_model = exp.create_model("tutorial-model-mpirun", openmpi_settings)
exp.start(ompi_model, summary=True)
=== LAUNCH SUMMARY ===
Experiment: getting-started
Experiment Path: /Users/spartee/Dropbox/Cray/smartsim/tutorials/01_getting_started/getting-started
Launching with: local
# of Ensembles: 0
# of Models: 1
Database: no
=== MODELS ===
tutorial-model-mpirun
Model Parameters:
{}
Model Run Settings:
Executable: /bin/echo
Executable arguments: ['hello', 'world!']
Run Command: mpirun
Run arguments: {'-np': 2}
19:35:47 C02YN0J3JG5M SmartSim[49287] INFO tutorial-model-mpirun(52447): Completed
This time, since we passed -np 2
to mpirun
, in the output file we should find the line hello world!
twice.
[10]:
outputfile = './tutorial-model-mpirun.out'
errorfile = './tutorial-model-mpirun.err'
print("Content of tutorial-model-mpirun.out:")
with open(outputfile, 'r') as fin:
print(fin.read())
Content of tutorial-model-mpirun.out:
hello world!
hello world!
1.2 Running Ensembles¶
In the previous example, the two Model
instances were created separately. There are more convenient ways of doing this, through Ensemble
s. Ensemble
s are groups of Model
instances that can be treated as a single reference. We start by specifying RunSettings
similar to how we did with our Model
s.
[11]:
ens_settings = RunSettings(exe="sleep", exe_args="3")
Then, instead of creating it as we did before, we use create_ensemble
. Let’s assume we want to run the same experiment four times in parallel, then we will pass the replicas=4
argument and simply start the Ensemble
.
[12]:
ensemble = exp.create_ensemble("ensemble-replica", replicas=4, run_settings=ens_settings)
exp.start(ensemble, summary=True)
=== LAUNCH SUMMARY ===
Experiment: getting-started
Experiment Path: /Users/spartee/Dropbox/Cray/smartsim/tutorials/01_getting_started/getting-started
Launching with: local
# of Ensembles: 1
# of Models: 0
Database: no
=== ENSEMBLES ===
ensemble-replica
# of models in ensemble: 4
Launching as batch: False
Run Settings:
Executable: /bin/sleep
Executable arguments: ['3']
19:39:43 C02YN0J3JG5M SmartSim[49287] INFO ensemble-replica_0(54811): Completed
19:39:43 C02YN0J3JG5M SmartSim[49287] INFO ensemble-replica_2(54813): Completed
19:39:44 C02YN0J3JG5M SmartSim[49287] INFO ensemble-replica_1(54812): Completed
19:39:44 C02YN0J3JG5M SmartSim[49287] INFO ensemble-replica_3(54814): Completed
19:39:45 C02YN0J3JG5M SmartSim[49287] INFO ensemble-replica_1(54812): Completed
19:39:45 C02YN0J3JG5M SmartSim[49287] INFO ensemble-replica_3(54814): Completed
From the output, we see that four copies of our Model
, named ensemble-replica_0, ensemble-replica_1, … were run. In each output file, we will see that the same output was generated.
Now let’s imagine that we don’t want to run the same model four times, but we want to run variations of it. One way of doing this would be to define four models, and starting them through the Experiment
.
For few, simple Model
s, this would be OK, but what if we needed to run a large number of models, which only differ for some parameter? Defining and adding each one separately would be tedious. For such cases, we will rely on a parameterized Ensemble
of models.
Our goal is to run
python output_my_parameter.py
with multiple parameter values. Clearly, we could pass the parameters as arguments, but in some cases, this could not be possible (e.g. if the parameters were stored in a file or the executable would not accept them from the command line).
[15]:
rs = RunSettings(exe="python", exe_args="output_my_parameter.py")
Then, we define the parameters we are going to set:
tutorial_name
tutorial_parameter
In the original file output_my_parameter.py
, which acts as a template, they occur as ;tutorial_name;
and ;tutorial_parameter;
. The semi-colons are used to perform a regexp substitution with the desired values. The semi-colon in this case, is called a tag and can be changed.
We pass them to Experiment.create_ensemble
, along with the argument perm_strategy="all_perm"
. This argument means that we want all possible permutations of the given parameters, which are stored in the argument params
. We have two options for both parameters, thus our ensemble will run 4 instances of the same Experiment
, just using a different copy of output_my_parameter.py
created by calling Experiment.generate()
. We attach the template file to the Ensemble
instance,
generate the augmented python files, and run the experiment.
[16]:
params = {
"tutorial_name": ["Ellie", "John"],
"tutorial_parameter": [2, 11]
}
ensemble = exp.create_ensemble("ensemble", params=params, run_settings=rs, perm_strategy="all_perm")
# to_configure specifies that the files attached should be read and tags should be looked for
config_file = "./output_my_parameter.py"
ensemble.attach_generator_files(to_configure=config_file)
exp.generate(ensemble, overwrite=True)
exp.start(ensemble)
19:48:20 C02YN0J3JG5M SmartSim[49287] INFO ensemble_0(60008): Completed
19:48:20 C02YN0J3JG5M SmartSim[49287] INFO ensemble_1(60009): Completed
19:48:20 C02YN0J3JG5M SmartSim[49287] INFO ensemble_2(60010): Completed
19:48:21 C02YN0J3JG5M SmartSim[49287] INFO ensemble_3(60012): Completed
19:48:22 C02YN0J3JG5M SmartSim[49287] INFO ensemble_3(60012): Completed
We can see from the output that four instances of our experiment were run, each one named like the Experiment
, with a numeric suffix at the end: ensemble_0
, ensemble_1
, … each ensemble member generated its own output files, which will be stored in getting-started/ensemble/ensemble_0
, getting-started/ensemble/ensemble_1
, and so on as the call to Experiment.generate()
will created isolated output directories for each created Model
in the ensemble.
[18]:
for ensemble_id in range(4):
outputfile = 'getting-started/ensemble/ensemble_' + str(ensemble_id)+"/ensemble_"+ str(ensemble_id)+".out"
print(f"Content of {outputfile}:")
with open(outputfile, 'r') as fin:
print(fin.read())
Content of getting-started/ensemble/ensemble_0/ensemble_0.out:
Hello, my name is Ellie and my parameter is 2
Content of getting-started/ensemble/ensemble_1/ensemble_1.out:
Hello, my name is Ellie and my parameter is 11
Content of getting-started/ensemble/ensemble_2/ensemble_2.out:
Hello, my name is John and my parameter is 2
Content of getting-started/ensemble/ensemble_3/ensemble_3.out:
Hello, my name is John and my parameter is 11
That’s it! All possible permutations of the input parameters were used to execute the experiment! Sometimes, the parameter space can be too large to be explored exhaustively. In that case, we can use a different permutation strategy, i.e. random
. For example, if we want to only use two possible random combinations of our parameter space, we can run the following code, where we specify n_models=2
and perm_strategy="random"
.
[19]:
params = {
"tutorial_name": ["Ellie", "John"],
"tutorial_parameter": [2, 11]
}
ensemble = exp.create_ensemble("ensemble", params=params, run_settings=rs, perm_strategy="random", n_models=2)
config_file = "./output_my_parameter.py"
ensemble.attach_generator_files(to_configure=config_file)
exp.generate(ensemble, overwrite=True)
exp.start(ensemble)
19:51:29 C02YN0J3JG5M SmartSim[49287] INFO Working in previously created experiment
19:51:34 C02YN0J3JG5M SmartSim[49287] INFO ensemble_0(62039): Completed
19:51:34 C02YN0J3JG5M SmartSim[49287] INFO ensemble_1(62040): Completed
Another possible permutation strategy is stepped
, but it is also possible to pass a function, which will need to generate combinations of parameters starting from the dictionary. Please refer to the documentation to learn more about this.
It is also possible to use different delimiters for the parameter regexp. For example, if instead of ;
, we want to use @
, we can set it as tag
in generate
. We have to use a different version of the parameterized file, one named output_my_parameter_new_tag.py
.
[20]:
rs = RunSettings(exe="python", exe_args="output_my_parameter_new_tag.py")
params = {
"tutorial_name": ["Ellie", "John"],
"tutorial_parameter": [2, 11]
}
ensemble = exp.create_ensemble("ensemble_new_tag", params=params, run_settings=rs, perm_strategy="all_perm")
config_file = "./output_my_parameter_new_tag.py"
ensemble.attach_generator_files(to_configure=config_file)
exp.generate(ensemble, overwrite=True, tag='@')
exp.start(ensemble)
19:52:39 C02YN0J3JG5M SmartSim[49287] INFO Working in previously created experiment
19:52:44 C02YN0J3JG5M SmartSim[49287] INFO ensemble_new_tag_0(62747): Completed
19:52:44 C02YN0J3JG5M SmartSim[49287] INFO ensemble_new_tag_1(62748): Completed
19:52:44 C02YN0J3JG5M SmartSim[49287] INFO ensemble_new_tag_2(62749): Completed
19:52:45 C02YN0J3JG5M SmartSim[49287] INFO ensemble_new_tag_3(62750): Completed
19:52:46 C02YN0J3JG5M SmartSim[49287] INFO ensemble_new_tag_3(62750): Completed
Last, we can see all the kernels we have executed by calling Experiment.summary()
[21]:
exp.summary()
[21]:
Name | Entity-Type | JobID | RunID | Time | Status | Returncode | |
---|---|---|---|---|---|---|---|
0 | tutorial-model | Model | 49432 | 0 | 2.001588 | Completed | 0 |
1 | tutorial-model-1 | Model | 50330 | 0 | 4.217252 | Completed | 0 |
2 | tutorial-model-2 | Model | 50337 | 0 | 6.010539 | Completed | 0 |
3 | tutorial-model-mpirun | Model | 52447 | 0 | 2.004867 | Completed | 0 |
4 | ensemble-replica_0 | Model | 54811 | 0 | 4.628920 | Completed | 0 |
5 | ensemble-replica_2 | Model | 54813 | 0 | 4.216618 | Completed | 0 |
6 | ensemble-replica_1 | Model | 54812 | 0 | 6.428902 | Completed | 0 |
7 | ensemble-replica_3 | Model | 54814 | 0 | 6.017785 | Completed | 0 |
8 | ensemble_2 | Model | 60010 | 0 | 4.218734 | Completed | 0 |
9 | ensemble_3 | Model | 60012 | 0 | 6.013866 | Completed | 0 |
10 | ensemble_0 | Model | 60008 | 0 | 4.631225 | Completed | 0 |
11 | ensemble_0 | Model | 62039 | 1 | 4.216191 | Completed | 0 |
12 | ensemble_1 | Model | 60009 | 0 | 4.426308 | Completed | 0 |
13 | ensemble_1 | Model | 62040 | 1 | 4.011659 | Completed | 0 |
14 | ensemble_new_tag_0 | Model | 62747 | 0 | 4.634087 | Completed | 0 |
15 | ensemble_new_tag_1 | Model | 62748 | 0 | 4.428509 | Completed | 0 |
16 | ensemble_new_tag_2 | Model | 62749 | 0 | 4.219598 | Completed | 0 |
17 | ensemble_new_tag_3 | Model | 62750 | 0 | 6.015937 | Completed | 0 |
1.3 Running and Communicating with the Orchestrator¶
In this section we will see how to use SmartRedis
clients to interact with an in-memory database launched by SmartSim called the Orchestrator
. We start by importing the SmartRedis Client
and the Orchestrator
from SmartSim
[22]:
from smartredis import Client
from smartsim.database import Orchestrator
import numpy as np
REDIS_PORT=6899
We start the Orchestrator
. Since we are setting launcher="local"
in Experiment
, the Orchestrator
will run a single DB instance.
[23]:
exp = Experiment("tutorial-smartredis", launcher="local")
# create and start a database
orc = Orchestrator(port=REDIS_PORT)
exp.generate(orc)
exp.start(orc, block=False)
Now that the Orchestrator
is running, we can use SmartRedis to store NumPy tensors on the Redis DB, and get them back. This is done using the SmartSim Client
. First, we setup a connection to the DB.
[24]:
client = Client(address='127.0.0.1:'+str(REDIS_PORT), cluster=False)
Then, we can use the DB to put and retrieve tensors. We need to assign a unique key to each tensor (or object) we store on the DB.
[25]:
send_tensor = np.ones((4,3,3))
client.put_tensor("tutorial_tensor_1", send_tensor)
receive_tensor = client.get_tensor("tutorial_tensor_1")
print('Receive tensor:\n\n', receive_tensor)
Receive tensor:
[[[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]]
[[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]]
[[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]]
[[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]]]
With the SmartRedis Client
and its possible to store and run a Pytorch neural network directly on the DB node. We first create a one-layer PyTorch Convolutional Neural Network, and save it as a jit-traced, serialized, object.
[26]:
import torch
import torch.nn as nn
# taken from https://pytorch.org/docs/master/generated/torch.jit.trace.html
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv = nn.Conv2d(1, 1, 3)
def forward(self, x):
return self.conv(x)
net = Net()
example_forward_input = torch.rand(1, 1, 3, 3)
module = torch.jit.trace(net, example_forward_input)
# Save the traced model to a file
torch.jit.save(module, "./torch_cnn.pt")
Now we send the model to the database, again, we assign it a unique key, tutorial-cnn
, which we will use to refer to the model when using the Client
.
[27]:
# Set the model in the Redis database from the file
client.set_model_from_file("tutorial-cnn", "./torch_cnn.pt", "TORCH", "CPU")
Now we create a random tensor, store it on the DB, and use it as input to the CNN we just sent. The Orchestrator
will run the neural network and store the output with the key we specify. Using that key, we can retrieve the tensor.
[28]:
# Put a tensor in the database as a test input
data = torch.rand(1, 1, 3, 3).numpy()
client.put_tensor("torch_cnn_input", data)
# Run model and retrieve the output
client.run_model("tutorial-cnn", inputs=["torch_cnn_input"], outputs=["torch_cnn_output"])
out_data = client.get_tensor("torch_cnn_output")
Notice that we could have defined the model as an object (without storing it on disk) and send it to the DB using set_model
instead of set_model_from_file
. We can do the same thing for any Python function. For example, let’s define a simple function takes a NumPy tensor as input.
[29]:
def max_of_tensor(array):
"""Sample torchscript script that returns the
highest element in an array.
"""
# return the highest element
return array.max(1)[0]
sample_array_1 = np.array([np.arange(9.)])
print(sample_array_1)
print("Max:")
print(max_of_tensor(sample_array_1))
[[0. 1. 2. 3. 4. 5. 6. 7. 8.]]
Max:
8.0
Now let’s store this function on the DB, assignign it the key max-of-tensor
:
[30]:
client.set_function("max-of-tensor", max_of_tensor)
Now we perform the same sample computation on the DB.
[31]:
client.put_tensor("script-data-1", sample_array_1)
client.run_script(
"max-of-tensor", # key of our script
"max_of_tensor", # function to be called
["script-data-1"],
["script-output"],
)
out = client.get_tensor("script-output")
print(out)
[8.]
And, as expected, we obtain the same result we obtained when we ran the function locally. To clean up, we need to tear down the DB. We do this by stopping the Orchestrator
.
[32]:
exp.stop(orc)
19:59:29 C02YN0J3JG5M SmartSim[49287] INFO Stopping model orchestrator_0 with job name orchestrator_0-CACWEL8F89TK
1.4 Ensembles using SmartRedis¶
In Section 1.2 we used Ensemble
s. What would happen if Model
s which are part of an Ensemble
tried to put their tensors on the DB using SmartRedis? Unless we used unique keys across the running programs, several tensors (or objects) would have the same key, and this key collision would result in unexpected behavior. In other words, if in the source code of one program, a tensor with key tensor1
was put on the DB, then each replica of the program would put a tensor with the key
tensor1
. SmartSim and SmartRedis can avoid key collision by prepending program-unique prefixes to entities.
Let’s start by setting up the experiment with the Orchestrator
.
[33]:
exp = Experiment("tutorial-smartredis-ensemble", launcher="local")
# create and start a database
orc = Orchestrator(port=REDIS_PORT)
exp.generate(orc)
exp.start(orc, block=False)
Now let’s add two replicas of the same Model
. Basically, it is a simple producer, which puts a tensor on the DB. The code for it is in producer.py
.
[34]:
rs_prod = RunSettings("python", "producer.py --redis-port "+str(REDIS_PORT))
ensemble = exp.create_ensemble(name="producer",
replicas=2,
run_settings=rs_prod)
We add a consumer, which will just retrieve the tensors put by the two producers and check that they are what it expects.
[35]:
rs_consumer = RunSettings("python", "consumer.py --redis-port "+str(REDIS_PORT))
consumer = exp.create_model("consumer", run_settings=rs_consumer)
We need to register incoming entities, i.e. entities for which the prefix will have to be known by other entities. When we will start the Experiment
, environment variables will be set to let all entities know which incoming entities are present.
[36]:
consumer.register_incoming_entity(ensemble[0])
consumer.register_incoming_entity(ensemble[1])
Finally, we attach the files to the experiments, generate them, and run!
[37]:
ensemble.attach_generator_files(to_copy=['producer.py'])
consumer.attach_generator_files(to_copy=['consumer.py'])
exp.generate(ensemble, overwrite=True)
exp.generate(consumer, overwrite=True)
# start the models
exp.start(ensemble, consumer, summary=True)
20:02:07 C02YN0J3JG5M SmartSim[49287] INFO Working in previously created experiment
20:02:07 C02YN0J3JG5M SmartSim[49287] INFO Working in previously created experiment
=== LAUNCH SUMMARY ===
Experiment: tutorial-smartredis-ensemble
Experiment Path: /Users/spartee/Dropbox/Cray/smartsim/tutorials/01_getting_started/tutorial-smartredis-ensemble
Launching with: local
# of Ensembles: 1
# of Models: 1
Database: no
=== ENSEMBLES ===
producer
# of models in ensemble: 2
Launching as batch: False
Run Settings:
Executable: /Users/spartee/.virtualenvs/smartsim/bin/python
Executable arguments: ['producer.py', '--redis-port', '6899']
=== MODELS ===
consumer
Model Parameters:
{}
Model Run Settings:
Executable: /Users/spartee/.virtualenvs/smartsim/bin/python
Executable arguments: ['consumer.py', '--redis-port', '6899']
20:02:22 C02YN0J3JG5M SmartSim[49287] INFO producer_0(68575): Completed
20:02:22 C02YN0J3JG5M SmartSim[49287] INFO producer_1(68576): Completed
20:02:23 C02YN0J3JG5M SmartSim[49287] INFO consumer(68577): Completed
20:02:24 C02YN0J3JG5M SmartSim[49287] INFO consumer(68577): Completed
The producers produced random NumPy tensors, and we can see that the consumer was able to retrieve both of them from the DB, by looking at its output.
[38]:
outputfile = './tutorial-smartredis-ensemble/consumer/consumer.out'
with open(outputfile, 'r') as fin:
print(fin.read())
Tensor for producer_0 is: [[[[0.72651781 0.94967021 0.7009509 ]
[0.12356079 0.10970366 0.17820585]
[0.98406475 0.91311928 0.70532184]]]]
Tensor for producer_1 is: [[[[0.99084866 0.56835187 0.19604226]
[0.08345202 0.82443378 0.50058923]
[0.03786348 0.64053919 0.6278744 ]]]]
As usual, let’s shutdown the DB, by stopping the Orchestrator
.
[39]:
exp.stop(orc)
20:02:45 C02YN0J3JG5M SmartSim[49287] INFO Stopping model orchestrator_0 with job name orchestrator_0-CACWHRLXI83C
[ ]: