Getting Started¶

In this notebook, we will walk through the most basic functionalities of SmartSim.

Creating and Running Models
Creating and Running Ensembles
Running and Communicating with the Orchestrator
Ensembles using SmartRedis

1.1 Running Models¶

Experiments are how users define workflows in SmartSim. The Experiment is used to create Model instances which represent applications, scripts, or largely any program. An experiment can start and stop a Model and monitor execution.

We begin by importing the modules we need: Experiment and RunSettings.

RunSettings help parameterize how a Model should be executed provided the system and available computational resources. There are many types of RunSettings in SmartSim. The base RunSettings class defines parameters for running locally, meaning on a laptop, workstation or single compute node.

[1]:

import os
from smartsim import Experiment
from smartsim.settings import RunSettings

Throughout this notebook, we will incrementally build an Experiment. We start from the simplest case: A single Model instance

Our first Model will simply print hello, using the shell command echo.

[2]:

# Init Experiment and specify to launch locally
exp = Experiment(name="getting-started", launcher="local")

# create our simple model
settings = RunSettings(exe="echo", exe_args="hello")
M1 = exp.create_model(name="tutorial-model", run_settings=settings)

Once the Model has been created by the Experiment, it can be started.

By setting summary=True, we can see a summary of the experiment printed before it is launched. The summary will stay for 10 seconds, and it is useful as a last check. If we set summary=False, then the experiment would be launched immediately.

We also explicitly set block=True (even though it is the default), so that Experiment.start waits until the last Model has finished before returning: it will act like a job monitor, letting us know if processes run, complete, or fail.

[3]:

exp.start(M1, block=True, summary=True)



=== LAUNCH SUMMARY ===
Experiment: getting-started
Experiment Path: /Users/spartee/Dropbox/Cray/smartsim/tutorials/01_getting_started/getting-started
Launching with: local
# of Ensembles: 0
# of Models: 1
Database: no

=== MODELS ===
tutorial-model
Model Parameters:
{}
Model Run Settings:
Executable: /bin/echo
Executable arguments: ['hello']

19:30:47 C02YN0J3JG5M SmartSim[49287] INFO tutorial-model(49432): Completed

The model has completed. Let’s look at the content of the current working directory.

[5]:

os.listdir('.')

outputfile = './tutorial-model.out'
errorfile = './tutorial-model.err'

print("Content of tutorial-model.out:")
with open(outputfile, 'r') as fin:
    print(fin.read())
print("Content of tutorial-model.err:")
with open(errorfile, 'r') as fin:
    print(fin.read())

Content of tutorial-model.out:
hello

Content of tutorial-model.err:

We can see that two files, tutorial-model-1.out and tutorial-model-1.err have been created. The .out file contains the output generated by model-1, and the .err file would contain the error messages generated by it. Since there were no errors, the .err file is empty.

Now let’s run two different Model instances at the same time. This is just as easy as running one Model, and takes the same steps. This time, we will skip the summary. For each Model, we create a RunSettings object: it is recommended to always create separate RunSettings objects for each Model.

[6]:

run_settings_1 = RunSettings("sleep", "3")
run_settings_2 = RunSettings("sleep", "5")

model_1 = exp.create_model("tutorial-model-1", run_settings_1)
model_2 = exp.create_model("tutorial-model-2", run_settings_2)
exp.start(model_1, model_2)

19:32:18 C02YN0J3JG5M SmartSim[49287] INFO tutorial-model-1(50330): Completed
19:32:19 C02YN0J3JG5M SmartSim[49287] INFO tutorial-model-2(50337): Running
19:32:20 C02YN0J3JG5M SmartSim[49287] INFO tutorial-model-2(50337): Completed

For users of parallel applications, launch binaries can also be specified in RunSettings. For example, if mpirun is installed on the system, we can run a model through it, by specifying it as run_command in RunSettings. Since mpirun takes arguments (e.g. to define how many processes will be run), we pass them by defining run_args in RunSettings.

Please note that to run this you need to have OpenMPI installed.

[8]:

openmpi_settings = RunSettings("echo",
                           "hello world!",
                           run_command="mpirun",
                           run_args={"-np": 2}) # note that for base ``RunSettings`` run_args passed literally

ompi_model = exp.create_model("tutorial-model-mpirun", openmpi_settings)
exp.start(ompi_model, summary=True)



=== LAUNCH SUMMARY ===
Experiment: getting-started
Experiment Path: /Users/spartee/Dropbox/Cray/smartsim/tutorials/01_getting_started/getting-started
Launching with: local
# of Ensembles: 0
# of Models: 1
Database: no

=== MODELS ===
tutorial-model-mpirun
Model Parameters:
{}
Model Run Settings:
Executable: /bin/echo
Executable arguments: ['hello', 'world!']
Run Command: mpirun
Run arguments: {'-np': 2}

19:35:47 C02YN0J3JG5M SmartSim[49287] INFO tutorial-model-mpirun(52447): Completed

This time, since we passed -np 2 to mpirun, in the output file we should find the line hello world! twice.

[10]:

outputfile = './tutorial-model-mpirun.out'
errorfile = './tutorial-model-mpirun.err'

print("Content of tutorial-model-mpirun.out:")
with open(outputfile, 'r') as fin:
    print(fin.read())

Content of tutorial-model-mpirun.out:
hello world!
hello world!

1.2 Running Ensembles¶

In the previous example, the two Model instances were created separately. There are more convenient ways of doing this, through Ensembles. Ensembles are groups of Model instances that can be treated as a single reference. We start by specifying RunSettings similar to how we did with our Models.

[11]:

ens_settings = RunSettings(exe="sleep", exe_args="3")

Then, instead of creating it as we did before, we use create_ensemble. Let’s assume we want to run the same experiment four times in parallel, then we will pass the replicas=4 argument and simply start the Ensemble.

[12]:

ensemble = exp.create_ensemble("ensemble-replica", replicas=4, run_settings=ens_settings)
exp.start(ensemble, summary=True)



=== LAUNCH SUMMARY ===
Experiment: getting-started
Experiment Path: /Users/spartee/Dropbox/Cray/smartsim/tutorials/01_getting_started/getting-started
Launching with: local
# of Ensembles: 1
# of Models: 0
Database: no

=== ENSEMBLES ===
ensemble-replica
# of models in ensemble: 4
Launching as batch: False
Run Settings:
Executable: /bin/sleep
Executable arguments: ['3']

19:39:43 C02YN0J3JG5M SmartSim[49287] INFO ensemble-replica_0(54811): Completed
19:39:43 C02YN0J3JG5M SmartSim[49287] INFO ensemble-replica_2(54813): Completed
19:39:44 C02YN0J3JG5M SmartSim[49287] INFO ensemble-replica_1(54812): Completed
19:39:44 C02YN0J3JG5M SmartSim[49287] INFO ensemble-replica_3(54814): Completed
19:39:45 C02YN0J3JG5M SmartSim[49287] INFO ensemble-replica_1(54812): Completed
19:39:45 C02YN0J3JG5M SmartSim[49287] INFO ensemble-replica_3(54814): Completed

From the output, we see that four copies of our Model, named ensemble-replica_0, ensemble-replica_1, … were run. In each output file, we will see that the same output was generated.

Now let’s imagine that we don’t want to run the same model four times, but we want to run variations of it. One way of doing this would be to define four models, and starting them through the Experiment.

For few, simple Models, this would be OK, but what if we needed to run a large number of models, which only differ for some parameter? Defining and adding each one separately would be tedious. For such cases, we will rely on a parameterized Ensemble of models.

Our goal is to run

python output_my_parameter.py

with multiple parameter values. Clearly, we could pass the parameters as arguments, but in some cases, this could not be possible (e.g. if the parameters were stored in a file or the executable would not accept them from the command line).

[15]:

rs = RunSettings(exe="python", exe_args="output_my_parameter.py")

Then, we define the parameters we are going to set:

tutorial_name
tutorial_parameter

In the original file output_my_parameter.py, which acts as a template, they occur as ;tutorial_name; and ;tutorial_parameter;. The semi-colons are used to perform a regexp substitution with the desired values. The semi-colon in this case, is called a tag and can be changed.

We pass them to Experiment.create_ensemble, along with the argument perm_strategy="all_perm". This argument means that we want all possible permutations of the given parameters, which are stored in the argument params. We have two options for both parameters, thus our ensemble will run 4 instances of the same Experiment, just using a different copy of output_my_parameter.py created by calling Experiment.generate(). We attach the template file to the Ensemble instance, generate the augmented python files, and run the experiment.

[16]:

params = {
    "tutorial_name": ["Ellie", "John"],
    "tutorial_parameter": [2, 11]
}
ensemble = exp.create_ensemble("ensemble", params=params, run_settings=rs, perm_strategy="all_perm")

# to_configure specifies that the files attached should be read and tags should be looked for
config_file = "./output_my_parameter.py"
ensemble.attach_generator_files(to_configure=config_file)

exp.generate(ensemble, overwrite=True)
exp.start(ensemble)

19:48:20 C02YN0J3JG5M SmartSim[49287] INFO ensemble_0(60008): Completed
19:48:20 C02YN0J3JG5M SmartSim[49287] INFO ensemble_1(60009): Completed
19:48:20 C02YN0J3JG5M SmartSim[49287] INFO ensemble_2(60010): Completed
19:48:21 C02YN0J3JG5M SmartSim[49287] INFO ensemble_3(60012): Completed
19:48:22 C02YN0J3JG5M SmartSim[49287] INFO ensemble_3(60012): Completed

We can see from the output that four instances of our experiment were run, each one named like the Experiment, with a numeric suffix at the end: ensemble_0, ensemble_1, … each ensemble member generated its own output files, which will be stored in getting-started/ensemble/ensemble_0, getting-started/ensemble/ensemble_1, and so on as the call to Experiment.generate() will created isolated output directories for each created Model in the ensemble.

[18]:

for ensemble_id in range(4):
    outputfile = 'getting-started/ensemble/ensemble_' + str(ensemble_id)+"/ensemble_"+ str(ensemble_id)+".out"

    print(f"Content of {outputfile}:")
    with open(outputfile, 'r') as fin:
        print(fin.read())

Content of getting-started/ensemble/ensemble_0/ensemble_0.out:
Hello, my name is Ellie and my parameter is 2

Content of getting-started/ensemble/ensemble_1/ensemble_1.out:
Hello, my name is Ellie and my parameter is 11

Content of getting-started/ensemble/ensemble_2/ensemble_2.out:
Hello, my name is John and my parameter is 2

Content of getting-started/ensemble/ensemble_3/ensemble_3.out:
Hello, my name is John and my parameter is 11

That’s it! All possible permutations of the input parameters were used to execute the experiment! Sometimes, the parameter space can be too large to be explored exhaustively. In that case, we can use a different permutation strategy, i.e. random. For example, if we want to only use two possible random combinations of our parameter space, we can run the following code, where we specify n_models=2 and perm_strategy="random".

[19]:

params = {
    "tutorial_name": ["Ellie", "John"],
    "tutorial_parameter": [2, 11]
}
ensemble = exp.create_ensemble("ensemble", params=params, run_settings=rs, perm_strategy="random", n_models=2)
config_file = "./output_my_parameter.py"
ensemble.attach_generator_files(to_configure=config_file)

exp.generate(ensemble, overwrite=True)
exp.start(ensemble)

19:51:29 C02YN0J3JG5M SmartSim[49287] INFO Working in previously created experiment
19:51:34 C02YN0J3JG5M SmartSim[49287] INFO ensemble_0(62039): Completed
19:51:34 C02YN0J3JG5M SmartSim[49287] INFO ensemble_1(62040): Completed

Another possible permutation strategy is stepped, but it is also possible to pass a function, which will need to generate combinations of parameters starting from the dictionary. Please refer to the documentation to learn more about this.

It is also possible to use different delimiters for the parameter regexp. For example, if instead of ;, we want to use @, we can set it as tag in generate. We have to use a different version of the parameterized file, one named output_my_parameter_new_tag.py.

[20]:

rs = RunSettings(exe="python", exe_args="output_my_parameter_new_tag.py")
params = {
    "tutorial_name": ["Ellie", "John"],
    "tutorial_parameter": [2, 11]
}
ensemble = exp.create_ensemble("ensemble_new_tag", params=params, run_settings=rs, perm_strategy="all_perm")
config_file = "./output_my_parameter_new_tag.py"
ensemble.attach_generator_files(to_configure=config_file)

exp.generate(ensemble, overwrite=True, tag='@')
exp.start(ensemble)

19:52:39 C02YN0J3JG5M SmartSim[49287] INFO Working in previously created experiment
19:52:44 C02YN0J3JG5M SmartSim[49287] INFO ensemble_new_tag_0(62747): Completed
19:52:44 C02YN0J3JG5M SmartSim[49287] INFO ensemble_new_tag_1(62748): Completed
19:52:44 C02YN0J3JG5M SmartSim[49287] INFO ensemble_new_tag_2(62749): Completed
19:52:45 C02YN0J3JG5M SmartSim[49287] INFO ensemble_new_tag_3(62750): Completed
19:52:46 C02YN0J3JG5M SmartSim[49287] INFO ensemble_new_tag_3(62750): Completed

Last, we can see all the kernels we have executed by calling Experiment.summary()

[21]:

exp.summary()

[21]:

	Name	Entity-Type	JobID	RunID	Time	Status
0	tutorial-model	Model	49432	0	2.001588	Completed
1	tutorial-model-1	Model	50330	0	4.217252	Completed
2	tutorial-model-2	Model	50337	0	6.010539	Completed
3	tutorial-model-mpirun	Model	52447	0	2.004867	Completed
4	ensemble-replica_0	Model	54811	0	4.628920	Completed
5	ensemble-replica_2	Model	54813	0	4.216618	Completed
6	ensemble-replica_1	Model	54812	0	6.428902	Completed
7	ensemble-replica_3	Model	54814	0	6.017785	Completed
8	ensemble_2	Model	60010	0	4.218734	Completed
9	ensemble_3	Model	60012	0	6.013866	Completed
10	ensemble_0	Model	60008	0	4.631225	Completed
11	ensemble_0	Model	62039	1	4.216191	Completed
12	ensemble_1	Model	60009	0	4.426308	Completed
13	ensemble_1	Model	62040	1	4.011659	Completed
14	ensemble_new_tag_0	Model	62747	0	4.634087	Completed
15	ensemble_new_tag_1	Model	62748	0	4.428509	Completed
16	ensemble_new_tag_2	Model	62749	0	4.219598	Completed
17	ensemble_new_tag_3	Model	62750	0	6.015937	Completed

1.3 Running and Communicating with the Orchestrator¶

In this section we will see how to use SmartRedis clients to interact with an in-memory database launched by SmartSim called the Orchestrator. We start by importing the SmartRedis Client and the Orchestrator from SmartSim

[22]:

from smartredis import Client
from smartsim.database import Orchestrator
import numpy as np

REDIS_PORT=6899

We start the Orchestrator. Since we are setting launcher="local" in Experiment, the Orchestrator will run a single DB instance.

[23]:

exp = Experiment("tutorial-smartredis", launcher="local")

# create and start a database
orc = Orchestrator(port=REDIS_PORT)
exp.generate(orc)
exp.start(orc, block=False)

Now that the Orchestrator is running, we can use SmartRedis to store NumPy tensors on the Redis DB, and get them back. This is done using the SmartSim Client. First, we setup a connection to the DB.

[24]:

client = Client(address='127.0.0.1:'+str(REDIS_PORT), cluster=False)

Then, we can use the DB to put and retrieve tensors. We need to assign a unique key to each tensor (or object) we store on the DB.

[25]:

send_tensor = np.ones((4,3,3))

client.put_tensor("tutorial_tensor_1", send_tensor)

receive_tensor = client.get_tensor("tutorial_tensor_1")

print('Receive tensor:\n\n', receive_tensor)

Receive tensor:

 [[[1. 1. 1.]
  [1. 1. 1.]
  [1. 1. 1.]]

 [[1. 1. 1.]
  [1. 1. 1.]
  [1. 1. 1.]]

 [[1. 1. 1.]
  [1. 1. 1.]
  [1. 1. 1.]]

 [[1. 1. 1.]
  [1. 1. 1.]
  [1. 1. 1.]]]

With the SmartRedis Client and its possible to store and run a Pytorch neural network directly on the DB node. We first create a one-layer PyTorch Convolutional Neural Network, and save it as a jit-traced, serialized, object.

[26]:

import torch
import torch.nn as nn

# taken from https://pytorch.org/docs/master/generated/torch.jit.trace.html
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv = nn.Conv2d(1, 1, 3)

    def forward(self, x):
        return self.conv(x)


net = Net()
example_forward_input = torch.rand(1, 1, 3, 3)
module = torch.jit.trace(net, example_forward_input)

# Save the traced model to a file
torch.jit.save(module, "./torch_cnn.pt")

Now we send the model to the database, again, we assign it a unique key, tutorial-cnn, which we will use to refer to the model when using the Client.

[27]:

# Set the model in the Redis database from the file
client.set_model_from_file("tutorial-cnn", "./torch_cnn.pt", "TORCH", "CPU")

Now we create a random tensor, store it on the DB, and use it as input to the CNN we just sent. The Orchestrator will run the neural network and store the output with the key we specify. Using that key, we can retrieve the tensor.

[28]:

# Put a tensor in the database as a test input
data = torch.rand(1, 1, 3, 3).numpy()
client.put_tensor("torch_cnn_input", data)

# Run model and retrieve the output
client.run_model("tutorial-cnn", inputs=["torch_cnn_input"], outputs=["torch_cnn_output"])
out_data = client.get_tensor("torch_cnn_output")

Notice that we could have defined the model as an object (without storing it on disk) and send it to the DB using set_model instead of set_model_from_file. We can do the same thing for any Python function. For example, let’s define a simple function takes a NumPy tensor as input.

[29]:

def max_of_tensor(array):
    """Sample torchscript script that returns the
    highest element in an array.

    """
    # return the highest element
    return array.max(1)[0]

sample_array_1 = np.array([np.arange(9.)])
print(sample_array_1)
print("Max:")
print(max_of_tensor(sample_array_1))

[[0. 1. 2. 3. 4. 5. 6. 7. 8.]]
Max:
8.0

Now let’s store this function on the DB, assignign it the key max-of-tensor:

[30]:

client.set_function("max-of-tensor", max_of_tensor)

Now we perform the same sample computation on the DB.

[31]:

client.put_tensor("script-data-1", sample_array_1)
client.run_script(
    "max-of-tensor",  # key of our script
    "max_of_tensor",  # function to be called
    ["script-data-1"],
    ["script-output"],
)

out = client.get_tensor("script-output")

print(out)

[8.]

And, as expected, we obtain the same result we obtained when we ran the function locally. To clean up, we need to tear down the DB. We do this by stopping the Orchestrator.

[32]:

exp.stop(orc)

19:59:29 C02YN0J3JG5M SmartSim[49287] INFO Stopping model orchestrator_0 with job name orchestrator_0-CACWEL8F89TK

1.4 Ensembles using SmartRedis¶

In Section 1.2 we used Ensembles. What would happen if Models which are part of an Ensemble tried to put their tensors on the DB using SmartRedis? Unless we used unique keys across the running programs, several tensors (or objects) would have the same key, and this key collision would result in unexpected behavior. In other words, if in the source code of one program, a tensor with key tensor1 was put on the DB, then each replica of the program would put a tensor with the key tensor1. SmartSim and SmartRedis can avoid key collision by prepending program-unique prefixes to entities.

Let’s start by setting up the experiment with the Orchestrator.

[33]:

exp = Experiment("tutorial-smartredis-ensemble", launcher="local")

# create and start a database
orc = Orchestrator(port=REDIS_PORT)
exp.generate(orc)
exp.start(orc, block=False)

Now let’s add two replicas of the same Model. Basically, it is a simple producer, which puts a tensor on the DB. The code for it is in producer.py.

[34]:

rs_prod = RunSettings("python", "producer.py --redis-port "+str(REDIS_PORT))
ensemble = exp.create_ensemble(name="producer",
                               replicas=2,
                               run_settings=rs_prod)

We add a consumer, which will just retrieve the tensors put by the two producers and check that they are what it expects.

[35]:

rs_consumer = RunSettings("python", "consumer.py --redis-port "+str(REDIS_PORT))
consumer = exp.create_model("consumer", run_settings=rs_consumer)

We need to register incoming entities, i.e. entities for which the prefix will have to be known by other entities. When we will start the Experiment, environment variables will be set to let all entities know which incoming entities are present.

[36]:

consumer.register_incoming_entity(ensemble[0])
consumer.register_incoming_entity(ensemble[1])

Finally, we attach the files to the experiments, generate them, and run!

[37]:

ensemble.attach_generator_files(to_copy=['producer.py'])
consumer.attach_generator_files(to_copy=['consumer.py'])
exp.generate(ensemble, overwrite=True)
exp.generate(consumer, overwrite=True)

# start the models
exp.start(ensemble, consumer, summary=True)

20:02:07 C02YN0J3JG5M SmartSim[49287] INFO Working in previously created experiment
20:02:07 C02YN0J3JG5M SmartSim[49287] INFO Working in previously created experiment


=== LAUNCH SUMMARY ===
Experiment: tutorial-smartredis-ensemble
Experiment Path: /Users/spartee/Dropbox/Cray/smartsim/tutorials/01_getting_started/tutorial-smartredis-ensemble
Launching with: local
# of Ensembles: 1
# of Models: 1
Database: no

=== ENSEMBLES ===
producer
# of models in ensemble: 2
Launching as batch: False
Run Settings:
Executable: /Users/spartee/.virtualenvs/smartsim/bin/python
Executable arguments: ['producer.py', '--redis-port', '6899']



=== MODELS ===
consumer
Model Parameters:
{}
Model Run Settings:
Executable: /Users/spartee/.virtualenvs/smartsim/bin/python
Executable arguments: ['consumer.py', '--redis-port', '6899']

20:02:22 C02YN0J3JG5M SmartSim[49287] INFO producer_0(68575): Completed
20:02:22 C02YN0J3JG5M SmartSim[49287] INFO producer_1(68576): Completed
20:02:23 C02YN0J3JG5M SmartSim[49287] INFO consumer(68577): Completed
20:02:24 C02YN0J3JG5M SmartSim[49287] INFO consumer(68577): Completed

The producers produced random NumPy tensors, and we can see that the consumer was able to retrieve both of them from the DB, by looking at its output.

[38]:

outputfile = './tutorial-smartredis-ensemble/consumer/consumer.out'

with open(outputfile, 'r') as fin:
    print(fin.read())

Tensor for producer_0 is: [[[[0.72651781 0.94967021 0.7009509 ]
   [0.12356079 0.10970366 0.17820585]
   [0.98406475 0.91311928 0.70532184]]]]
Tensor for producer_1 is: [[[[0.99084866 0.56835187 0.19604226]
   [0.08345202 0.82443378 0.50058923]
   [0.03786348 0.64053919 0.6278744 ]]]]

As usual, let’s shutdown the DB, by stopping the Orchestrator.

[39]:

exp.stop(orc)

20:02:45 C02YN0J3JG5M SmartSim[49287] INFO Stopping model orchestrator_0 with job name orchestrator_0-CACWHRLXI83C

[ ]: