Introduction
Please read the following before continuing:
This document covers the Synapse API changes that will be needed to accommodate both documents. This document will not cover any UI changes. Since the documents covers a broad range of requested changes, we will attempt to tackle the changes in phases based both on logical dependencies and priority.
Basics
A key part of any API is its conceptual object model. For example, one of the key features of Synapse is to act as a curated data repository of files/datasets. Therefore, the Synapse conceptual model includes Projects, Folders, Files, and collections of Files call Datasets. These are common objects in many other software systems, therefore our users tend to already be familiar with these basic concepts.
However, the conceptual model for the Synapse governance sub-system is slightly more esoteric. While a data consumer might be aware that some types of data require special conditions-for-use, they may not have thought deeply about how such conditions are modeled. In addition, all of the nomenclature of the Synapse governance sub-system was developed in-house. This means it is unlikely that new users will be familiar these terms and their underlining concepts even if they have experience with similar scientific data repositories. When the Synapse governance features were first develop many years ago, industry standardization was not as established as it is today. One of our goals for this effort will be to better align with these new standards.
The current governance conceptual model in Synapse is centered around: Access Restrictions (AR). Currently there are two main types of ARs in Synapse:
Click-Wrap - This type is used when a user needs to agree to additional terms-of-use before accessing the data. This type is considered self-services as a user is automatically approved if they agree to the terms of the AR,
Managed - This type is used for more restricted data. For this type, data consumers must submit a request for access, and then, be approved before downloading the data. Currently, only a member of the ACT is allowed to approve or reject these submissions.
For both AR types, only a member of the ACT can create or update the ARs.
One of the fields of an AR is the list of subjects that the AR applies to. A subject can be either an Entity or a Team, but for this document we will be focusing on to Entity subjects. The subjects of an AR are governed by that AR. If a subject is a container, such a Project or Folder, then all children under that hierarchy (recursively) are indirect subjects of that AR. It is important to note that ARs are completely independent of Access Control Lists (ACL) that control the permissions of an Entity. Here are a few key distinctions:
ACL - Is managed by the owner of the file. The file owner can grant a user/team permission to download a file by adding that user/team to the file’s ACL with the permission “DOWNLOAD”.
AR - Is managed by the ACT to add additional conditions to a file. These conditions apply to any users that wishes to download a file.
In order for a user to download a files they must have been granted the “DOWNLOAD” permission on that file’s ACL, AND meet all additional conditions for any AR that either directly or indirect include that file as a subject.
Note: A single file can be the subject of multiple ARs. When this occurs, the user must meet the terms of EACH AR before they will be permitted to download the file.
Standardization & Discovery
One of the challenges for consumers of governed data is finding datasets that are appropriate for their situation. For example, a data consumer working for a for-profit company, might be restricted to data that is available for commercial uses. Another example might be a Psychology researcher that wishes to exclude all datasets that are restricted to areas of research outside of Psychology. It would also be nice if the system for tagging data with restriction information was standardized, not just in Synapse but across all data repositories. The Data Use Ontology (DUO) was developed to provide such a standard:
DUO allows to semantically tag datasets with restriction about their usage, making them discoverable automatically based on the authorization level of users, or intended usage.
Today, when a file is the subject of an AR in Synapse, there is no way to discover the file based on metadata of the condition. ARs do not currently have metadata, like the annotations on Entities. Nor is there a way to created a file view with condition based faceted navigation.
We would like to extend the current Synapse governance sub-system as follows:
Conform to the industry standard nomenclature for restriction metadata using DUO.
Provide services for data consumers to discover datasets based on restriction metadata
Scalability
ACT Governance Permissions
As mentioned above, members of the ACT currently have global governance permissions across all projects in Synapse. The ACT is the only group of users with these governance level permissions. This means only members of the ACT can perform the following actions in Synapse:
Create or Update ARs
Set the subjects of ARs
Approve or Reject data access submissions
This also means the future growth of Synapse is limited by the resources available to ACT. In addition, some new projects require that community managers have more governance control over their data.
Non-heterogeneous Projects
Another scalability issue stems from the fact that most projects are not made up of heterogeneous files. For example, a single project might host files of many different types or from multiple geographical regions. Each type or region might have its own set of special conditions-for-use.
Today, community managers must setup folders within their project for each combination of file type/region. They must then coordinate with ACT to ensure all of the necessary ARs are bound to the appropriate folders. This create a type of micromanagement for the both members of the ACT and community manages.
However, it is unlikely that we can simply grant data providers the permission to assign AR as they see fit due to a conflicts of interest. Funding agencies are interested in sharing the data they fund as broadly as is ethically possible to maximize the return-on-investment. The researchers that receive the founding are typically only interested in sharing their data after they have extract all possible insights from it. It is common for funders add sharing conditions to research they fund to encourage researcher to share. According to our governance team, it is typical for a researcher to select “true” for each question of the DUO questionnaire, in an effort to lock down their data. The assumption is they can claim to have shared the data while at the same time ensuring it is nearly impossible for anyone to actually access the data.
One of the key roles for ACT is to ensure that the minimally appropriate level of restrictions are applied to a project. This role requires a deep understanding of data use rule/ethics, a commitment to open-data, and an understanding of the types of data. Some projects will have an equivalent governing body the is able to provide this role, while other project will continue to rely on our ACT. To continue to scale ACT will need the following:
Delegation - For projects with an equivalent governing body as ACT, ACT needs to be able to delegate the restriction management to that governing body.
Set levels without micromanagement - For projects that do not have an equivalent governing body, ACT will need to continue to set the appropriate levels. However, we need to minimize the amount of micromanagement needed apply the set levels to files within the project.
New Automatic Approval Type
Consider the following example: A file has the DUO attribute “Institution Specific Requirement ” = “true” and “Institution” = “Stanford“. For this project, ACT determines that it is not sufficient for data consumers to simply “agree” that they are part of the institution. Instead, data consumers must provide proof they they are indeed part of that institution. To setup this type of AR today, an ACT member would need to create a managed AR. Potential data consumers would then need to submit documents that prove they are part of the institution. A member of the ACT would then need to manually process the submission in order to approve it.
The GA4GH Passports and the Authorization and Authentication Infrastructure was developed to automate this type of approval. The workflow for this type of system might look something like this:
A member of the ACT creates a new passport AR, configured to require an “access token” from Standford’s governance services.
This AR is either manually, or automatically bound to files as appropriate.
When a data consumer attempts to download the file, Synapse detects that the file is the subject of the AR, and that the data consumer in not yet approved.
Synapse would redirect the user to the appropriate governing services, resulting an access token exchange and validation.
Upon successful token validation, Synapse would automatically set the user as approved for this AR, and the user would be able to download the file.
It might be possible to use this type of automatic approval for many cases that currently require “manual” approval.
Sources of Metadata
In the main governance story, metadata is originating from multiple sources:
Patient Table - Keyed by patient ID, contains data about each patient in a study, such as age, diagnosis, and location.
Treatment Table - Table that maps biological samples to both patients IDs and treatments.
Assay Data - Raw assay data. Each type of assay will have its own type of data. Each row would be keyed by sample ID.
DUO questionnaire- The answers to this questionnaire are produced through a collaboration between ACT and a PI.
Restriction Data - One example of restriction data would be the date of a data use embargo.
Annotations - The annotations on data files.
Deep metadata - This would include metadata that is contained within data files.
To get the full picture of all of the all of the metadata for a single file, one might join all of these sources of data into a single materialized view. Good normalization practices would require that all of this metadata remains in separate locations to avoid data duplication. This also ensures that an update to a single value would automatically, propagate to all effected files. For example, if a data embargo date where to change, it should only be changed in a single location.
However, there is a downside to normalization. When a user inspects the Entity page for a single file, how can they see all of the related metadata?
Implementation Phases
There are a lot of areas that will need changes/improvements. Therefore, we will break up the implementation work into phases:
Phase One - Delegation of AR approval to non-ACT Synapse users.
Phase Two - TBD
Phase One
Goal
The goal of phase one is to expand the permissions of Synapse Governance systems to enable members of the ACT to grant non-ACT users permissions to both review and approve managed access for specific cases. Items that are considered out-of-scope for this phase:
Multiple Reviewers - With multiple reviewers for the same AR, there is a potential for various types of conflicts. We will likely need to add add features to prevent or minimize such conflicts sometime in the future. For this phase we will assume that there will be a single reviewer.
Notification Improvements - The extensions document outlines multiple improvements to the notification system around approvals. For this phase we will simply be extending the existing notification system to include the new reviewer.
Delegation of AR creation/update - For this phase, we will only support the the delegation of reviewing access submissions to managed ARs. This phase will not include the support for the delegation of the creation/update of ARs.
Background
Currently, only members of the Access and Compliance Team (ACT) can review and approves data access request. The current authorization system simply rejects any user that is not a member of the ACT when calling the following services:
GET /dataAccessSubmission/openSubmissions - Lists ARs that have open requests.
POST /accessRequirement/{requirementId}/submissions - Used to list the open requests for an AR.
PUT /dataAccessSubmission/{submissionId} - Approves or rejects a request.
DELETE /dataAccessSubmission/{submissionId} - Deletes a request.
POST /accessApproval - Used for both self-signed, and managed ARs. For managed ARs, only ACT is allowed.
POST /accessApproval/group - Used exclusively by ACT to list approved users.
POST /accessApproval/notifications - Used exclusively by ACT to list notification messages that have been sent to a recipients for particular ARs.
PUT /accessApproval/group/revoke - Used exclusively by ACT to revoke access to all approvals previously granted to a submitter on an AR.
DELETE /accessApproval - Used by either an individual to revoke their own access to an AR, or by ACT to revoke access to another user.
Proposed API Changes
It is clear that non-ACT users should not be granted global governance permissions in the same way as a member of the ACT. Instead, members of the ACT need a mechanism to limit the scope of the granted permissions to specific cases.
In Synapse we already use Access Control Lists (ACL) to allow one set of users to grant permissions to another set of users on a specific object. For example, a project owner wants to allow another user permission to upload files to their project. To do this, the project owner would edit the ACL on that project and add an entry that would grant the user permission to “create”. After saving the changes to the ACL, there other user would then be allowed to add files and folders to that specific project. In this example, the “scope” of the grant is the project.
For phase one, we want to provide a way for members of the ACT to grant non-ACT users permissions to both review and approve AR access requests. For this case, a natural place to start would be use the AR as the scope of the grant. Specifically, this would involve extending the existing ACL system to allow ACLs to be placed on ARs.
Note: The main obstacle to using the AR as the permission scope is that it cannot be used as a mechanisms to grant permissions to create ARs. However, for at least phase one, only ACT member will be creating ARs, so this seems like a reasonable limitation.
We would then need to change all submission related APIs to perform the following permission check using the AR as the scope:
If the caller is an Admin, then GRANT
If the caller is a member of the ACT, then GRANT
If the caller is anonymous, then DENY
If the caller is has the “REVIEW” permission on the AR, then GRANT
All other cases, then DENY
Note: When a user calls any services that lists multiple submissions, only submission with a GRANT on the corresponding AR will be listed. Submissions with a DENY will be excluded.
This phase will involve the following technical tasks:
Creating/updating ACLs on ARs - This would be a new feature that would allow a member of the ACT to create an ACL on an existing AR. Only a member of the ACT would be allowed to create/update these ACLs. The ACT member would grant a non-ACT user permission, to “REVIEW” submissions to this by updating this ACL. Note: ACT members will automatically retain the “REVIEW” permission for all ARs even if not explicitly listed in the ACL of that AR. Any AR that does not have an ACL will still be fully accessible by members of the ACT.
GET /dataAccessSubmission/openSubmissions - This services would need to be changed to list open submission for any user that has been granted the “REVIEW” permission. Should an ACT member also see listings for submissions to ARs that have granted non-ACT members “REVIEW”?
POST /accessRequirement/{requirementId}/submissions - This service would need to be changed similar to the above.
PUT /dataAccessSubmission/{submissionId} - This service would need to be changed similar to the above.
DELETE /dataAccessSubmission/{submissionId} - This service would need to be changed similar to the above.