Containers are a convenient way to bundle up code and dependencies into a rerunnable module which takes the form of a lightweight virtual machine. Although this technology is primarily intended for deployment/management of complex, multi-server systems it provides a useful solution for reproducible scientific computing.
Use Cases
Submission to Challenge
A DREAM challenge involves training predictors using sample data. The data are very large and/or so sensitive that the participants are not permitted to access the data directly. Instead each participant must submit a trainable predictor to the challenge. Their predictors are then trained by challenge organizers and the training output returned. The challenge allows participants to choose the language and tools to use in their predictors. The sequence is:
- Participant crafts a trainable model in the language of their choice.
- Participant bundles the model with all its dependencies into a runnable object.
- Participant pushes the object to a storage location, accessible to the challenge.
- Participant links the runnable object to the source used to build it along with the recipe for creating it, i.e. the object's provenance.
- Participant submits the object to the challenge.
- The challenge framework retreives the object, links it to the withheld data and runs the training algorithm.
- The challenge framework runs the predictor on validation data, scored the output, and publishes the score.
- Optionally, the challenge framework returns the trained predictor to the participant.
- At the end of the challenge the participant makes the runnable object and provenance information available to the larger community.
Reproducible Analysis
A scientist produces a result using a computation-rich protocol. She wishes to make the result reproducible by others. The computation uses various libraries which may be revised over time. To ensure reproducibility she creates a 'snapshot' that includes the current versions of all libraries at the time the computation was done.
- Scientist create Synapse project.
- Scientist uploads data, source code, algorithm description, and recipe for creating rerunnable object. She makes this information public.
- Scientist bundles code into a rerunnable object and uses it to perform computation.
- Scientist pushes object to a storage location accessible to the public and links to the source information above.
- A second scientist wishes to review the work. When he reaches the computation stage he retrieves the rerunnable object, executes it and verifies the results match the ones claimed by the author.
Reusable Analysis
A data scientist creates a computational workflow for processing certain data and she expects it to be broadly useful for processing other datasets. Setting it up can be complex due the various required libraries.
- Scientist bundles workflow as a runnable object.
- Scientist documents how to run, i.e. how to feed inputs, what outputs to expect, what error modes can be encountered.
- Scientist shares runnable object and documentation, either with the public, or with a team charged with performing similar data processing.
- Other data scientists read about the workflow in a project, and reuse the rerunnable object, guided by the 'how to' instructions, to apply the workflow to their own data.
Nearly all of the above can be accomplished today using Synapse along with a Docker registry such as DockerHub. What is missing is the tight integration with Synapse, in particular, ACLs (aka "Sharing Settings") that apply to Docker repositories, allowing a Synapse user to control what other Synapse users and teams can access their content.
Additionally we need to be able to represent Docker repositories and commits in Submissions and provenance records (Activities).
Implementation
The Docker Registry can be freely deployed as a private service and can be configured to defer authorization to a separate service as depicted here:
https://github.com/docker/distribution/blob/master/docs/spec/auth/token.md
There are two points of integration with Synapse: (1) notification when a new repository is created, updated, etc., (2) request to authorize an operation on a repository. When a new Docker repository is created an object will be created in the repository services. This object will be related to an ACL which can be edited by the repository owner or other authorized Synapse users. When an authorization request comes in, the ACL is used to approve or deny the request.
There are three choices for representing repositories in Synapse: (1) as a variation of a file, (2) as a new kind of Entity, (3) as a new non-Entity object. The first two options have serious problems: If a repository is a kind of file then the semantics of a file as a document or stream of bytes breaks down. Clients need extra logic that says they cannot expect to do an 'HTTP GET' (for example) on such a file. Docker repositories cannot be entities because we cannot expose Create and Delete operations. The repository objects in Synapse must mirror those in the Docker registry. This means that objects can only be created and deleted as notifications about such events are received from the Docker registry.
Note: We can allow MULTIPLE Docker registries to delegate authorization to Synapse by ensuring Synapse includes the registry 'host' in the repository object.
The Docker Repository
Docker images are organized into repositories. A repository reference has three parts: registry host (CNAME or IP address with optional port), repository path, and tag, e.g.
index.docker.io:5000/username/reponame:v1
All fields are optional except 'reponame'. If the registry host is omitted then the docker client assumes it is a reference to DockerHub. Each repository has a series of commits, identified by a digest (currently SHA256). Users may also define tags which name commits. These tags are mutable, they do not permanently reference a specific commit. When a repository reference omits a tag, the default tag "latest" is assumed.
Docker Terminology
Registry: A service providing access to a collection of repositories.
Repository: A series of commits/versions of a machine image
Image: A single commit, containing a the content of a machine.
Container: A running virtual machine, started from an image.
Docker Repository Schema:
DockerRepository extends Entity
- name (registryhost/reponame)
EntityValidator validates format and that, for managed repo's, the name starts with the parent entity id, that external repos' registryhosts don't violate the blacklist
- isManaged: says whether this repository is managed by Synapse or is a reference to an external registry.
- list of tag/digest pairs
digests must be unique
tags are optional but must be unique
for managed repo's the list can't be edited by the user
We have a 'white list' of registry hosts for which we answer authorization requests and a 'reserved' list of registry hosts for which external repositories can't be created. (This allows us to reserve address spaces for the future. E.g. the white list could contain docker.synapse.org:443 and the reserved list *.synapse.org.)
Services:
Create, Update and Delete entity: not allowed for entities which are 'managed', i.e. their host field matches that of a managed repo.
Access requirements can be applied to / inherited by repositories, as with Files.
Description | URI | Method | Request Parameters | Request Body | Response Body |
---|---|---|---|---|---|
Authorization Request | /bearerToken | GET | service, scope | -- | BearerToken |
Post a registry event. This will create or delete repository objects in Synapse. | /registryEvent | POST | -- | Event | -- |
org.sagebionetworks.repo.model.oauth.BearerToken: defined here: https://docs.docker.com/registry/spec/auth/jwt/
org.sagebionetworks.repo.model.docker.registry.Event: defined here: https://godoc.org/github.com/docker/distribution/notifications#Event
Details
Authorization Request
if the user making the request is the user specified in the repository reference or a member of the team made in a team reference, the access is approved;
If the repository is not represented in Synapse then deny the request, else answer the authorization question using the ACL associated with the requested repository. (Note: We can leverage existing Governance mechanisms by requiring 'download' access level in order to 'pull' a repository.)
If the image being pulled is used in a Submission in an Evaluation queue for which the user has the appropriate access ('Score') then approve the pull request. (Ideally we would authorize only a pull of the specific commit which was used in the Submission, but the Docker auth request includes just the repo, not the commit (the SHA256).)
Registry Event
Synapse will create, update and delete its representation of a Docker repository in response to received registry events.
Open questions:
- Should there be a Docker-Synapse password different from the user's Synapse password (or API key)?
- Is it OK for Docker Repo's to have Folders as parents or just Projects?