Infrastructure Setup

At Sage, we generally provision an EC2 Linux instance for a Challenge that leverages SynapseWorkflowOrchestrator to run CWL workflows. These workflows will be responsible for evaluating and scoring submissions (see model-to-data-challenge-workflow GitHub for an example workflow). If Sage is responsible for the cloud compute services, please give a general estimate of the computing power (memory, volume) needed. We can also help with the estimates if you are unsure.

What Can Affect the Computing Power

By default, up to ten submissions can be evaluated concurrently, though this number can be increased or decreased accordingly within the orchestrator's .env file. Generally, the more submissions you want to run concurrently, the more power will be required of the instance.

Example

Let’s say a submission file that is very large and/or complex will require up to 10GB of memory for evaluation. If a max of four submissions should be run at the same time, then an instance of at least 40GB memory will be required (give or take some extra memory for system processes as well), whereas ten concurrent submissions would require at least 100GB.

The volume of the instance will be dependent on variables such as the size of the input files and the generated output files. If running a model-to-data challenge, Docker images should also be taken into account. On average, participants will create Docker images that are around 2-4 GB in size, though some have reached up to >10 GB. (When this happens, we do encourage participants to revisit their Dockerfile and source code to ensure they are following best practices, as >10 GB is a bit high).

Sensitive Data

If data is sensitive and cannot leave the external site or data provider, please provide a remote server with (ideally) the following:

Support for Docker and, if possible, docker-compose
- If Docker is not allowed, then support for Singularity and Java 8 is a must
SynapseWorkflowOrchestrator repository

If Sage is not allowed access to the server, then it is the external site’s responsibility to get the Orchestrator running in whatever environment chosen. If Docker is not supported by the system, please let us know as we do have solutions for workarounds (e.g. using Java to execute, etc.).

Typical Infrastructure Setup Steps

Create a workflow infrastructure GitHub repository for the Challenge. We have created two templates in Sage-Bionetworks-Challenges that you may use as a starting point.
Create the Challenge site on Synapse. This can easily be done with challengeutils:
On the live site, go to the CHALLENGE tab and create as many Evaluation Queues as needed, e.g. one per sub-challenge, etc. by clicking on Challenge Tools > Create Evaluation Queue. By default, create-challenge will create an Evaluation Queue for writeups, which you will already see listed here.
While still on the live site, go to the FILES tab and create a new Folder called "Logs" by clicking on Files Tools > Add New Folder.
On the staging site, go to the FILES tab and create a new File by clicking on Files Tools > Upload or Link to a File > Link to URL.
Add an Annotation to the File called ROOT_TEMPLATE by clicking on Files Tools > Annotations > Edit. The "Value" will be the path to the workflow script, written as:
Create a cloud compute environment with the required memory and volume specifications. Once it spins up, log into the instance and clone the orchestrator:
While still on the instance, change directories to SynapseWorkflowOrchestrator/ and create a copy of the .envTemplate file as .env (or simply rename it to .env):
Open .env and enter values for the following property variables:
Return to the workflow infrastructure repository and clone it onto your local machine. Open the repo in your editor of choice and make the following edits to the scripts:
On the instance, change directories to SynapseWorkflowOrchestrator/ and kick-start the orchestrator with:
Go to the staging site and click on the TABLES tab. Create a new Submission View by clicking on Table Tools > Add Submission View. Under "Scope", add the Evaluation Queue(s) you are interested in monitoring (you may add more than one), then click Next. On the next screen, select which information to display, then hit Save. A Synapse table of the submissions and their metadata is now available for viewing and querying.
On the live site, go to the CHALLENGE tab and share the appropriate Evaluation Queues with the Participants team, giving them "Can submit" permissions.
Use the copyWiki command provided by synapseutils to copy over all pages from the staging site to the live site. When using copyWiki, it is important to also specify the destinationSubPageId parameter. This ID can be found in the URL of the live site, where it is the integer following .../wiki/, e.g.
On the instance, enter:

For a visual reference, a diagram of the orchestrator and its interactions with Synapse is provided below:

New Challenge Infrastructure

Infrastructure Setup

What Can Affect the Computing Power

Sensitive Data

Typical Infrastructure Setup Steps

Workflow diagram (credit: Tom Yu)