Infrastructure Setup

At Sage, we generally provision an EC2 Linux instance for a Challenge that leverages SynapseWorkflowOrchestrator to run CWL workflows. These workflows will be responsible for evaluating and scoring submissions (see model-to-data-challenge-workflow GitHub for an example workflow). If Sage is responsible for the cloud compute services, please give a general estimate of the computing power (memory, volume) needed. We can also help with the estimates if you are unsure.

What Can Affect the Computing Power

By default, up to ten submissions can be evaluated concurrently, though this number can be increased or decreased accordingly within the orchestrator's .env file. Generally, the more submissions you want to run concurrently, the more power will be required of the instance.

Example

Let’s say a submission file that is very large and/or complex will require up to 10GB of memory for evaluation. If a max of four submissions should be run at the same time, then an instance of at least 40GB memory will be required (give or take some extra memory for system processes as well), whereas ten concurrent submissions would require at least 100GB.

The volume of the instance will be dependent on variables such as the size of the input files and the generated output files. If running a model-to-data challenge, Docker images should also be taken into account. On average, participants will create Docker images that are around 2-4 GB in size, though some have reached up to >10 GB. (When this happens, we do encourage participants to revisit their Dockerfile and source code to ensure they are following best practices, as >10 GB is a bit high).

Sensitive Data

If data is sensitive and cannot leave the external site or data provider, please provide a remote server with (ideally) the following:

Support for Docker and, if possible, docker-compose
- If Docker is not allowed, then support for Singularity and Java 8 is a must
SynapseWorkflowOrchestrator repository

If Sage is not allowed access to the server, then it is the external site’s responsibility to get the Orchestrator running in whatever environment chosen. If Docker is not supported by the system, please let us know as we do have solutions for workarounds (e.g. using Java to execute, etc.).

Typical Infrastructure Setup Steps

Create a workflow infrastructure GitHub repository for the Challenge. We have created two templates in Sage-Bionetworks-Challenges that you may use as a starting point. The READMEs outline what will need to be updated within the scripts, but we will return to this later in Step 10.
1. data-to-model-challenge-workflow (submission type: prediction files)
2. model-to-data-challenge-workflow (submission type: Docker images)
Create the Challenge site on Synapse. This can easily be done with challengeutils:
challengeutils create-challenge "challenge_name"
This command will create two Synapse Projects: one staging site and one live site. You may think of them as development and production, in that all edits must be done in the staging site, NOT live. Changes to the live site will instead be synced over with challengeutils' mirror-wiki (more on this under Update the Challenge).
Note: at first, the live site will be just one page where a general overview about the Challenge is provided. There will also be a pre-register button that Synapse users can click on if they are interested in the upcoming Challenge:
For the initial deployment of the staging site to live, use synapseutils' copyWiki command, NOT mirror-wiki (more on this under Launch the Challenge).
create-challenge will also create four Synapse Teams for the Challenge: * Preregistrants, * Participants, * Organizers, and * Admin, where * is the Challenge name. Add users to the Organizers and Admin teams as needed.
On the live site, go to the CHALLENGE tab and create as many Evaluation Queues as needed, e.g. one per sub-challenge, etc. by clicking on Challenge Tools > Create Evaluation Queue. By default, create-challenge will create an Evaluation Queue for writeups, which you will already see listed here.

Important: the 7-digits in the parentheses following each Evaluation Queue name is its evaluation IDs, e.g.
You will need these IDs later for Step 9, so make note of them.
While still on the live site, go to the FILES tab and create a new Folder called "Logs" by clicking on Files Tools > Add New Folder.

Important: this will be where the participants' submission logs and prediction files are uploaded, so make note of its Synapse ID for later usage in Step 9.
On the staging site, go to the FILES tab and create a new File by clicking on Files Tools > Upload or Link to a File > Link to URL.

For "URL", enter the link address to the zipped download of the workflow infrastructure repository. You may get this address by going to the repository and clicking on Code > right-clicking Download Zip > Copy Link Address:

Name the File whatever you like (we generally use "workflow"), then hit Save.
Important: this File will be what links the Evaluation Queue to the orchestrator, so make note of its Synapse ID for later usage in Step 9.
Add an Annotation to the File called ROOT_TEMPLATE by clicking on Files Tools > Annotations > Edit. The "Value" will be the path to the workflow script, written as: {infrastructure workflow repo}-{branch}/path/to/workflow.cwl For example, this is the path to workflow.cwl of the model-to-data template repo: model-to-data-challenge-workflow-main/workflow.cwl
Important: the ROOT_TEMPLATE annotation is what the orchestrator uses to determine which file among the repo is the workflow script.
Create a cloud compute environment with the required memory and volume specifications. Once it spins up, log into the instance and clone the orchestrator:
While still on the instance, change directories to SynapseWorkflowOrchestrator/ and create a copy of the .envTemplate file as .env (or simply rename it to .env):
Open .env and enter values for the following property variables:
Return to the workflow infrastructure repository and clone it onto your local machine. Open the repo in your editor of choice and make the following edits to the scripts:
On the instance, change directories to SynapseWorkflowOrchestrator/ and kick-start the orchestrator with:
Go to the staging site and click on the TABLES tab. Create a new Submission View by clicking on Table Tools > Add Submission View. Under "Scope", add the Evaluation Queue(s) you are interested in monitoring (you may add more than one), then click Next. On the next screen, select which information to display, then hit Save. A Synapse table of the submissions and their metadata is now available for viewing and querying.
On the live site, go to the CHALLENGE tab and share the appropriate Evaluation Queues with the Participants team, giving them "Can submit" permissions.
Use the copyWiki command provided by synapseutils to copy over all pages from the staging site to the live site. When using copyWiki, it is important to also specify the destinationSubPageId parameter. This ID can be found in the URL of the live site, where it is the integer following .../wiki/, e.g.
On the instance, enter:

For a visual reference, a diagram of the orchestrator and its interactions with Synapse is provided below:

New Challenge Infrastructure

Infrastructure Setup

What Can Affect the Computing Power

Sensitive Data

Typical Infrastructure Setup Steps

Workflow diagram (credit: Tom Yu)