Containers are a convenient way to bundle up code and dependencies into a rerunnable module which takes the form of a lightweight virtual machine. Although this technology is primarily intended for deployment/management of complex, multi-server systems it provides a useful solution for reproducible scientific computing.
Use Cases
Submission to Challenge
A DREAM challenge involves training predictors using sample data. The data are very large and/or so sensitive that the participants are not permitted to access the data directly. Instead each participant must submit a trainable predictor to the challenge. Their predictors are then trained by challenge organizers and the training output returned. The challenge allows participants to choose the language and tools to use in their predictors. The sequence is:
- Participant crafts a trainable model in the language of their choice.
- Participant bundles the model with all its dependencies into a runnable object.
- Participant pushes the object to a storage location, accessible to the challenge.
- Participant links the runnable object to the source used to build it along with the recipe for creating it, i.e. the object's provenance.
- Participant submits the object to the challenge.
- The challenge framework retreives the object, links it to the withheld data and runs the training algorithm.
- The challenge framework runs the predictor on validation data, scored the output, and publishes the score.
- Optionally, the challenge framework returns the trained predictor to the participant.
- At the end of the challenge the participant makes the runnable object and provenance information available to the larger community.
Reproducible Analysis
A scientist produces a result using a computation-rich protocol. She wishes to make the result reproducible by others. The computation uses various libraries which may be revised over time. To ensure reproducibility she creates a 'snapshot' that includes the current versions of all libraries at the time the computation was done.
- Scientist create Synapse project.
- Scientist uploads data, source code, algorithm description, and recipe for creating rerunnable object. She makes this information public.
- Scientist bundles code into a rerunnable object and uses it to perform computation.
- Scientist pushes object to a storage location accessible to the public and links to the source information above.
- A second scientist wishes to review the work. When he reaches the computation stage he retrieves the rerunnable object, executes it and verifies the results match the ones claimed by the author.
Reusable Analysis
A data scientist creates a computational workflow for processing certain data and she expects it to be broadly useful for processing other datasets. Setting it up can be complex due the various required libraries.
- Scientist bundles workflow as a runnable object.
- Scientist documents how to run, i.e. how to feed inputs, what outputs to expect, what error modes can be encountered.
- Scientist shares runnable object and documentation, either with the public, or with a team charged with performing similar data processing.
- Other data scientists read about the workflow in a project, and reuse the rerunnable object, guided by the 'how to' instructions, to apply the workflow to their own data.
Nearly all of the above can be accomplished today using Synapse along with a Docker registry such as DockerHub. What is missing is the tight integration with Synapse, in particular, ACLs (aka "Sharing Settings") that apply to Docker repositories, allowing a Synapse user to control what other Synapse users and teams can access their content.
Additionally we need to be able to represent Docker repositories and commits in Submissions and provenance records (Activities).
Implementation
The Docker Registry can be freely deployed as a private service and can be configured to defer authorization to a separate service as depicted here:
https://github.com/docker/distribution/blob/master/docs/spec/auth/token.md
There are two points of integration with Synapse: (1) notification when a new repository is created, updated, etc., (2) request to authorize an operation on a repository. When a new Docker repository is created an object will be created in the repository services. This object will be related to an ACL which can be edited by the repository owner or other authorized Synapse users. When an authorization request comes in, the ACL is used to approve or deny the request.
Note: We can allow MULTIPLE Docker registries to delegate authorization to Synapse by ensuring Synapse includes the registry 'host' in the repository object.
The Docker Repository
Docker images are organized into repositories. A repository reference has three parts: registry host (CNAME or IP address with optional port), repository path, and tag, e.g.
index.docker.io:5000/username/reponame:v1
All fields are optional except 'reponame'. If the registry host is omitted then the docker client assumes it is a reference to DockerHub. Each repository has a series of commits, identified by a digest (currently SHA256). Users may also define tags which name commits. These tags are mutable, they do not permanently reference a specific commit. When a repository reference omits a tag, the default tag "latest" is assumed.
Docker Terminology
Registry: A service providing access to a collection of repositories.
Repository: A series of commits/versions of a machine image
Image: A single commit, containing a the content of a machine.
Container: A running virtual machine, started from an image.
User Experience
docker build -t docker.synapse.org/syn1234567/myrepo .
docker login docker.synapse.org
username: mysynuser
password: xxxxxx
docker push docker.synapse.org/syn1234567/myrepo
Will now appear in Synapse under (Project or Folder) syn1234567. See https://app.moqups.com/bruce.hoff@sagebase.org/HY2x6MNWXo/edit/page/a406bb9f1
Cannot be moved or renamed.
Once shared with other users, they can:
docker pull docker.synapse.org/syn1234567/myrepo
docker tag docker.synapse.org/syn1234567/myrepo docker.synapse.org/syn9876543/someotherrepo
docker push docker.synapse.org/syn9876543/someotherrepo
Now it will appear under syn9876543
Schemas:
Docker Repository Schema:
DockerRepository extends Entity
- name (registryhost/repopath)
EntityValidator validates format and that, for external/unmanaged repo's, the registryhost doesn't violate the blacklist.
For managed repo's 'repopath' must start with the Synapse ID of a container (folder or project)
- isManaged: says whether this repository is managed by Synapse or is a reference to an external registry.
We have a 'white list' of registry hosts for which we answer authorization requests and a 'reserved' list of registry hosts for which external repositories can't be created. (This allows us to reserve address spaces for the future. E.g. the white list could contain docker.synapse.org:443 and the reserved list *.synapse.org.)
DockerCommit:
Each Repository has a list of 'commits'. For external repo's the user must provide them. For managed repo's commits are added based on 'push' events received by the listener.
- tag e.g. "v1" Is optional and must be unique for a docker repository
- digest: e.g. "SHA256:a68df63..." Is required and must be unique for a docker repository
Will provide a service to retrieve a repo based on its hash.
Services:
Create, Update and Delete entity: not allowed for entities which are 'managed', i.e. their host field matches that of a managed repo.
Access requirements can be applied to / inherited by repositories, as with Files.
Description | URI | Method | Request Parameters | Request Body | Response Body |
---|---|---|---|---|---|
Authorization Request | /bearerToken | GET | service, scope | -- | BearerToken |
Add a commit to an external repository. (Also changes modifiedBy, modifiedOn for the entity.) | /entity/{id}/dockerCommit | POST | -- | DockerCommit | -- |
Get the commits for a repository. | /entity/{id}/dockerCommit | GET | -- | -- | DockerCommitList |
Get the DockerRepository for a commit. | /entity/dockerDigest/{digest} | GET | -- | -- | DockerRepository |
Get Docker password for a Docker registry. (System will generate automatically.) | /dockerPassword | GET | registryHost | -- | Password |
Invalidate password for a Docker registry. | /dockerPassword | DELETE | registryHost | -- | -- |
org.sagebionetworks.repo.model.oauth.BearerToken: defined here: https://docs.docker.com/registry/spec/auth/jwt/
org.sagebionetworks.repo.model.docker.registry.Event: defined here: https://godoc.org/github.com/docker/distribution/notifications#Event
Details
Authorization Request
if the user making the request is the user specified in the repository reference or a member of the team made in a team reference, the access is approved;
If the repository is not represented in Synapse then deny the request, else answer the authorization question using the ACL associated with the requested repository. (Note: We can leverage existing Governance mechanisms by requiring 'download' access level in order to 'pull' a repository.)
If the image being pulled is used in a Submission in an Evaluation queue for which the user has the appropriate access ('Score') then approve the pull request. (Ideally we would authorize only a pull of the specific commit which was used in the Submission, but the Docker auth request includes just the repo, not the commit (the SHA256).)
Registry Event
Synapse will create, update and delete its representation of a Docker repository in response to received registry events.
Open questions:
- OK to have a Docker-Synapse password different from the user's Synapse password (or API key)?
- Is it OK for Docker Repo's to have Folders as parents or just Projects?