Document toolboxDocument toolbox

API Changes to support: Extension of Data Access Management to Users outside of Sage ACT

 

Introduction

 

Please read the following before continuing:

This document covers the Synapse API changes that will be needed to accommodate both documents. This document will not cover any UI changes. Since the documents covers a broad range of requested changes, we will attempt to tackle the changes in phases based both on logical dependencies and priority.

 

Basics

 

A key part of any API is its conceptual object model. For example, one of the key features of Synapse is to act as a curated data repository of files/datasets. Therefore, the Synapse conceptual model includes Projects, Folders, Files, and collections of Files call Datasets. These are common objects in many other software systems, therefore our users tend to already be familiar with these basic concepts.

 

However, the conceptual model for the Synapse governance sub-system is slightly more esoteric. While a data consumer might be aware that some types of data require special conditions-for-use, they may not have thought deeply about how such conditions are modeled. In addition, all of the nomenclature of the Synapse governance sub-system was developed in-house. This means it is unlikely that new users will be familiar these terms and their underlining concepts even if they have experience with similar scientific data repositories. When the Synapse governance features were first develop many years ago, industry standardization was not as established as it is today. One of our goals for this effort will be to better align with these new standards.

 

The current governance conceptual model in Synapse is centered around: Access Restrictions (AR). Currently there are two main types of ARs in Synapse:

  • Click-Wrap - This type is used when a user needs to agree to additional terms-of-use before accessing the data. This type is considered self-services as a user is automatically approved if they agree to the terms of the AR,

  • Managed - This type is used for more restricted data. For this type, data consumers must submit a request for access, and then, be approved before downloading the data. Currently, only a member of the ACT is allowed to approve or reject these submissions.

For both AR types, only a member of the ACT can create or update the ARs.

 

One of the fields of an AR is the list of subjects that the AR applies to. A subject can be either an Entity or a Team, but for this document we will be focusing on to Entity subjects. The subjects of an AR are governed by that AR. If a subject is a container, such a Project or Folder, then all children under that hierarchy (recursively) are indirect subjects of that AR. It is important to note that ARs are completely independent of Access Control Lists (ACL) that control the permissions of an Entity. Here are a few key distinctions:

  • ACL - Is managed by the owner of the file. The file owner can grant a user/team permission to download a file by adding that user/team to the file’s ACL with the permission “DOWNLOAD”.

  • AR - Is managed by the ACT to add additional conditions to a file. These conditions apply to any users that wishes to download a file.

In order for a user to download a files they must have been granted the “DOWNLOAD” permission on that file’s ACL, AND meet all additional conditions for any AR that either directly or indirect include that file as a subject.

Note: A single file can be the subject of multiple ARs. When this occurs, the user must meet the terms of EACH AR before they will be permitted to download the file.

 

Standardization & Discovery

 

One of the challenges for consumers of governed data is finding datasets that are appropriate for their situation. For example, a data consumer working for a for-profit company, might be restricted to data that is available for commercial uses. Another example might be a Psychology researcher that wishes to exclude all datasets that are restricted to areas of research outside of Psychology. It would also be nice if the system for tagging data with restriction information was standardized, not just in Synapse but across all data repositories. The Data Use Ontology (DUO) was developed to provide such a standard:

DUO allows to semantically tag datasets with restriction about their usage, making them discoverable automatically based on the authorization level of users, or intended usage.

 

Today, when a file is the subject of an AR in Synapse, there is no way to discover the file based on metadata of the condition. ARs do not currently have metadata, like the annotations on Entities. Nor is there a way to created a file view with condition based faceted navigation.

We would like to extend the current Synapse governance sub-system as follows:

  • Conform to the industry standard nomenclature for restriction metadata using DUO.

  • Provide services for data consumers to discover datasets based on restriction metadata

Scalability

ACT Governance Permissions

As mentioned above, members of the ACT currently have global governance permissions across all projects in Synapse. The ACT is the only group of users with these governance level permissions. This means only members of the ACT can perform the following actions in Synapse:

  • Create or Update ARs

  • Set the subjects of ARs

  • Approve or Reject data access submissions

This also means the future growth of Synapse is limited by the resources available to ACT. In addition, some new projects require that community managers have more governance control over their data.

Non-heterogeneous Projects

Another scalability issue stems from the fact that most projects are not made up of heterogeneous files. For example, a single project might host files of many different types or from multiple geographical regions. Each type or region might have its own set of special conditions-for-use.

 

Today, community managers must setup folders within their project for each combination of file type/region. They must then coordinate with ACT to ensure all of the necessary ARs are bound to the appropriate folders. This create a type of micromanagement for the both members of the ACT and community manages.

 

However, it is unlikely that we can simply grant data providers the permission to assign AR as they see fit due to a conflicts of interest. Funding agencies are interested in sharing the data they fund as broadly as is ethically possible to maximize the return-on-investment. The researchers that receive the founding are typically only interested in sharing their data after they have extract all possible insights from it. It is common for funders to add sharing conditions to research they fund to encourage researcher to share. According to our governance team, it is typical for a researcher to select “true” for each question of the DUO questionnaire, in an effort to lock down their data. The assumption is they can claim to have shared the data while at the same time ensuring it is nearly impossible for anyone to actually access the data.

 

One of the key roles for ACT is to ensure that the minimally appropriate level of restrictions are applied to a project. This role requires a deep understanding of data use rule/ethics, a commitment to open-data, and an understanding of the types of data. Some projects will have an equivalent governing body the is able to provide this role, while other project will continue to rely on our ACT. To continue to scale ACT will need the following:

  • Delegation - For projects with an equivalent governing body as ACT, ACT needs to be able to delegate the restriction management to that governing body.

  • Set levels without micromanagement - For projects that do not have an equivalent governing body, ACT will need to continue to set the appropriate levels. However, we need to minimize the amount of micromanagement needed apply the set levels to files within the project.

  • New Automatic Approval Types - For some cases a user might gain approval via an independent 3rd party. For such chases Synapse would need to communicate with the third party’s system to validate a user’s approval. We will cover this in more detail in the next section.

New Automatic Approval Type

Consider the following example: A file has the DUO attribute “Institution Specific Requirement ” = “true” and “Institution” = “Stanford“. For this project, ACT determines that it is not sufficient for data consumers to simply “agree” that they are part of the institution. Instead, data consumers must provide proof they they are indeed part of that institution. To setup this type of AR today, an ACT member would need to create a managed AR. Potential data consumers would then need to submit documents that prove they are part of the institution. A member of the ACT would then need to manually process the submission in order to approve it.

 

The GA4GH Passports and the Authorization and Authentication Infrastructure was developed to automate this type of approval. The workflow for this type of system might look something like this:

  1. A member of the ACT creates a new passport AR, configured to require an “access token” from Standford’s governance services.

  2. This AR is either manually, or automatically bound to files as appropriate.

  3. When a data consumer attempts to download the file, Synapse detects that the file is the subject of the AR, and that the data consumer in not yet approved.

  4. Synapse would redirect the user to the appropriate governing services, resulting an access token exchange and validation.

  5. Upon successful token validation, Synapse would automatically set the user as approved for this AR, and the user would be able to download the file.

It might be possible to use this type of automatic approval for many cases that currently require “manual” approval.

 

Sources of Metadata

 

In the main governance story, metadata is originating from multiple sources:

  • Patient Table - Keyed by patient ID, contains data about each patient in a study, such as age, diagnosis, and location.

  • Treatment Table - Table that maps biological samples to both patients IDs and treatments.

  • Assay Data - Raw assay data. Each type of assay will have its own type of data. Each row would be keyed by sample ID.

  • DUO questionnaire- The answers to this questionnaire are produced through a collaboration between ACT and a PI.

  • Restriction Data - One example of restriction data would be the date of a data use embargo.

  • Annotations - The annotations on data files.

  • Deep metadata - This would include metadata that is contained within data files.

To get the full picture of all of the all of the metadata for a single file, one might join all of these sources of data into a single materialized view. Good normalization practices would require that all of this metadata remains in separate locations to avoid data duplication. This also ensures that an update to a single value would automatically, propagate to all effected files. For example, if a data embargo date where to change, it should only be changed in a single location.

 

However, there is a downside to normalization. When a user inspects the Entity page for a single file, how can they see all of the related metadata?

Data Examples

The Patients table is a data gathered by the clinician. It provides basic information about each patient involved in the study.

Patients

patientId

birthDate

country

patientId

birthDate

country

1

04/24/55

USA

2

08/12/49

Germany

3

01/01/41

USA

4

11/11/62

Germany

..

The Treatment table is also gathered by the clinician. It maps bio samples gathered before and after a treatment for each patient.

Treatment

patientId

treatment

sampleId

patientId

treatment

sampleId

4

before

1

4

after

2

2

before

3

2

after

4

1

before

5

1

after

6

3

before

7

3

after

8

The assay results are files upload to Synapse by the Lab Technicians. Each file represents the results of a single assay run on a batch of bio samples. When uploading each file the lab technician will be able annotate each file with two annotations: samplesIds and assayType.

Note: The Lab technicians have no knowledge of patients or treatments.

Assay Files

Synapse ID

sampleIds

assayType

Synapse ID

sampleIds

assayType

syn1

1,2,3,4

genomic

syn2

1,2,34

imaging

syn3

1,2,3,4

clinical

syn4

5,6,7,8

genomic

syn5

5,6,7,8

imaging

syn6

5,6,7,8

clinical

According to the governance team’s evaluation of this project, all files should be under a click-wrap data embargo AR (AR.id = 1). In addition, any file with an assayType= genomic and patient country = Germany, should be under a click-wrap that informs the user that the data file cannot leave Germany (AR.Id = 2). According to these rule syn1 should be under AR.ID = 2, while all files should be under AR.ID = 1.

Add MaterializedView to show the join.

Implementation Phases

 

There are a lot of areas that will need changes/improvements. Therefore, we will break up the implementation work into phases:

  • Phase One - Delegation of AR approval to non-ACT Synapse users.

  • Phase Two - Make progress towards the integration of DUO metadata on individual files.

  • Phase Three - TBD.

 

Phase One

Goal

The goal of phase one is to expand the permissions of Synapse Governance systems to enable members of the ACT to grant non-ACT users permissions to both review and approve managed access for specific cases. Items that are considered out-of-scope for this phase:

  • Multiple Reviewers - With multiple reviewers for the same AR, there is a potential for various types of conflicts. We will likely need to add add features to prevent or minimize such conflicts sometime in the future. For this phase we will assume that there will be a single reviewer.

  • Notification Improvements - The extensions document outlines multiple improvements to the notification system around approvals. For this phase we will simply be extending the existing notification system to include the new reviewer.

  • Delegation of AR creation/update - For this phase, we will only support the the delegation of reviewing access submissions to managed ARs. This phase will not include the support for the delegation of the creation/update of ARs.

Background

Currently, only members of the Access and Compliance Team (ACT) can review and approves data access request. The current authorization system simply rejects any user that is not a member of the ACT when calling the following services:

Proposed API Changes

It is clear that non-ACT users should not be granted global governance permissions in the same way as a member of the ACT. Instead, members of the ACT need a mechanism to limit the scope of the granted permissions to specific cases.

In Synapse we already use Access Control Lists (ACL) to allow one set of users to grant permissions to another set of users on a specific object. For example, a project owner wants to allow another user permission to upload files to their project. To do this, the project owner would edit the ACL on that project and add an entry that would grant the user permission to “create”. After saving the changes to the ACL, there other user would then be allowed to add files and folders to that specific project. In this example, the “scope” of the grant is the project.

For phase one, we want to provide a way for members of the ACT to grant non-ACT users permissions to both review and approve AR access requests. For this case, a natural place to start would be use the AR as the scope of the grant. Specifically, this would involve extending the existing ACL system to allow ACLs to be placed on ARs.

 

Note: The main obstacle to using the AR as the permission scope is that it cannot be used as a mechanisms to grant permissions to create ARs. However, for at least phase one, only ACT member will be creating ARs, so this seems like a reasonable limitation.

 

We would then need to change all submission related APIs to perform the following permission check using the AR as the scope:

  • If the caller is an Admin, then GRANT

  • If the caller is a member of the ACT, then GRANT

  • If user is not validated, then DENY

  • If the caller is anonymous, then DENY

  • If the caller is has the “REVIEW_SUBMISSIONS” permission on the AR, then GRANT

  • All other cases, then DENY

Note: When a user calls any services that lists multiple submissions, only submission with a GRANT on the corresponding AR will be listed. Submissions with a DENY will be excluded.

 

This phase will involve the following technical tasks:

 

  • Creating/updating ACLs on ARs - This would be a new feature that would allow a member of the ACT to create an ACL on an existing AR. Only a member of the ACT would be allowed to create/update these ACLs. The ACT member would grant a non-ACT user permission, to “REVIEW_SUBMISSIONS” submissions to this by updating this ACL. Note: ACT members will automatically retain the “REVIEW_SUBMISSIONS” permission for all ARs even if not explicitly listed in the ACL of that AR. Any AR that does not have an ACL will still be fully accessible by members of the ACT.

  • GET /dataAccessSubmission/openSubmissions - This services would need to be changed to list open submission for any user that has been granted the “REVIEW_SUBMISSIONS” permission. Should an ACT member also see listings for submissions to ARs that have granted non-ACT members “REVIEW_SUBMISSIONS”?

  • POST /accessRequirement/{requirementId}/submissions - This service would need to be changed similar to the above.

  • PUT /dataAccessSubmission/{submissionId} - This service would need to be changed similar to the above.

  • DELETE /dataAccessSubmission/{submissionId} - This service would need to be changed similar to the above.

Phase Two

Goal

The goal for phase two will be to expand the use of DUO metadata to individual data files. We would also like to provide a system to “automatically” associate ARs with files base on their annotation values.

Background

 

As mentioned above, the Data Use Ontology (DUO) was develop to help standardize the data use nomenclature of biomedical data. Unfortunately, DUO was defined using Web Ontology Language (OWL), in a manner that allows broad interpretation of the definitions. For example, DUO does not actually define the expected keys or their corresponding values types. However, DUO does define 26 categories or data restrictions, with some comments about how each might be used.

 

Since Synapse already has support for defining metadata using JSON-Schemas, it is worth exploring the possibly of translating DUO into a JSON-schema. The following code was used to first define each of the 26 data use categories into JSON-schema, and then combining all of them into a single umbrella JSON-schema:

https://github.com/Sage-Bionetworks/Synapse-Repository-Services/pull/4643/files

All of the schemas were then loaded into a test environment to generate the following “validation schema”:

 

{ "$schema": "http://json-schema.org/draft-07/schema", "$id": "https://repo-prod.prod.sagebase.org/repo/v1/schema/type/registered/ebispot.duo-duo", "title": "Full DUO schema", "description": "...", "allOf": [ {"$ref": "#/definitions/ebispot.duo-D0000004"}, {"$ref": "#/definitions/ebispot.duo-D0000006"}, {"$ref": "#/definitions/ebispot.duo-D0000007"}, {"$ref": "#/definitions/ebispot.duo-D0000011"}, {"$ref": "#/definitions/ebispot.duo-D0000012"}, {"$ref": "#/definitions/ebispot.duo-D0000015"}, {"$ref": "#/definitions/ebispot.duo-D0000016"}, {"$ref": "#/definitions/ebispot.duo-D0000018"}, {"$ref": "#/definitions/ebispot.duo-D0000019"}, {"$ref": "#/definitions/ebispot.duo-D0000020"}, {"$ref": "#/definitions/ebispot.duo-D0000021"}, {"$ref": "#/definitions/ebispot.duo-D0000022"}, {"$ref": "#/definitions/ebispot.duo-D0000024"}, {"$ref": "#/definitions/ebispot.duo-D0000025"}, {"$ref": "#/definitions/ebispot.duo-D0000026"}, {"$ref": "#/definitions/ebispot.duo-D0000027"}, {"$ref": "#/definitions/ebispot.duo-D0000028"}, {"$ref": "#/definitions/ebispot.duo-D0000029"}, {"$ref": "#/definitions/ebispot.duo-D0000042"}, {"$ref": "#/definitions/ebispot.duo-D0000043"}, {"$ref": "#/definitions/ebispot.duo-D0000044"}, {"$ref": "#/definitions/ebispot.duo-D0000045"}, {"$ref": "#/definitions/ebispot.duo-D0000046"} ], "definitions": { "ebispot.duo-D0000004": { "$schema": "http://json-schema.org/draft-07/schema", "properties": {"NRES": { "type": "boolean", "description": "This data use permission indicates there is no restriction on use." }}, "title": "no restriction" }, "ebispot.duo-D0000006": { "$schema": "http://json-schema.org/draft-07/schema", "properties": {"HMB": { "type": "boolean", "description": "This data use permission indicates that use is allowed for health/medical/biomedical purposes; does not include the study of population origins or ancestry." }}, "title": "health or medical or biomedical research" }, "ebispot.duo-D0000007": { "$schema": "http://json-schema.org/draft-07/schema", "properties": {"DS": { "type": "boolean", "description": "This data use permission indicates that use is allowed provided it is related to the specified disease." }}, "title": "disease specific research", "if": {"properties": {"DS": {"const": true}}}, "then": {"properties": {"DS_disease": { "type": "string", "description": "DUO recommends MONDO be used, to provide the basis for automated evaluation. For more information see https://github.com/EBISPOT/DUO/blob/master/MONDO_Overview.md", "enum": [ "cancer", "alzheimer", "amnesia", "..." ] }}} }, "ebispot.duo-D0000011": { "$schema": "http://json-schema.org/draft-07/schema", "properties": {"POA": { "type": "boolean", "description": "This data use permission indicates that use of the data is limited to the study of population origins or ancestry." }}, "title": "population origins or ancestry research only" }, "ebispot.duo-D0000012": { "$schema": "http://json-schema.org/draft-07/schema", "properties": {"RS": { "type": "boolean", "description": "This data use modifier indicates that use is limited to studies of a certain research type." }}, "title": "research specific restrictions", "if": {"properties": {"RS": {"const": true}}}, "then": {"properties": {"RS_research_type": { "type": "string", "description": "...", "enum": [ "cancer", "..." ] }}} }, "ebispot.duo-D0000015": { "$schema": "http://json-schema.org/draft-07/schema", "properties": {"NMDS": { "type": "boolean", "description": "This data use modifier indicates that use does not allow methods development research (e.g., development of software or algorithms)." }}, "title": "no general methods research" }, "ebispot.duo-D0000016": { "$schema": "http://json-schema.org/draft-07/schema", "properties": {"GSO": { "type": "boolean", "description": "This data use modifier indicates that use is limited to genetic studies only (i.e., studies that include genotype research alone or both genotype and phenotype research, but not phenotype research exclusively)" }}, "title": "genetic studies only" }, "ebispot.duo-D0000018": { "$schema": "http://json-schema.org/draft-07/schema", "properties": {"NPUNCU": { "type": "boolean", "description": "This data use modifier indicates that use of the data is limited to not-for-profit organizations and not-for-profit use, non-commercial use." }}, "title": "not for profit, non commercial use only" }, "ebispot.duo-D0000019": { "$schema": "http://json-schema.org/draft-07/schema", "properties": {"PUB": { "type": "boolean", "description": "This data use modifier indicates that requestor agrees to make results of studies using the data available to the larger scientific community." }}, "title": "publication required" }, "ebispot.duo-D0000020": { "$schema": "http://json-schema.org/draft-07/schema", "properties": {"COL": { "type": "boolean", "description": "This data use modifier indicates that the requestor must agree to collaboration with the primary study investigator(s)." }}, "title": "collaboration required", "if": {"properties": {"COL": {"const": true}}}, "then": {"properties": {"COL_PI": { "type": "string", "description": "This could be coupled with a string describing the primary study investigator(s)." }}} }, "ebispot.duo-D0000021": { "$schema": "http://json-schema.org/draft-07/schema", "properties": {"IRB": { "type": "boolean", "description": "This data use modifier indicates that the requestor must provide documentation of local IRB/ERB approval." }}, "title": "ethics approval required" }, "ebispot.duo-D0000022": { "$schema": "http://json-schema.org/draft-07/schema", "properties": {"GS": { "type": "boolean", "description": "This data use modifier indicates that use is limited to within a specific geographic region." }}, "title": "geographical restriction", "if": {"properties": {"GS": {"const": true}}}, "then": {"properties": {"GS_location": { "type": "string", "description": "This should be coupled with an ontology term describing the geographical location the restriction applies to." }}} }, "ebispot.duo-D0000024": { "$schema": "http://json-schema.org/draft-07/schema", "properties": {"MOR": { "type": "boolean", "description": "This data use modifier indicates that requestor agrees not to publish results of studies until a specific date." }}, "title": "publication moratorium", "if": {"properties": {"MOR": {"const": true}}}, "then": {"properties": {"MOR_date": { "type": "string", "description": "This should be coupled with a date specified as ISO8601", "format": "date" }}} }, "ebispot.duo-D0000025": { "$schema": "http://json-schema.org/draft-07/schema", "properties": {"TS": { "type": "boolean", "description": "This data use modifier indicates that use is approved for a specific number of months." }}, "title": "time limit on use", "if": {"properties": {"TS": {"const": true}}}, "then": {"properties": {"TS_number_of_months": { "type": "integer", "description": "This should be coupled with an integer value indicating the number of months." }}} }, "ebispot.duo-D0000026": { "$schema": "http://json-schema.org/draft-07/schema", "properties": {"US": { "type": "boolean", "description": "This data use modifier indicates that use is limited to use by approved users." }}, "title": "user specific restriction" }, "ebispot.duo-D0000027": { "$schema": "http://json-schema.org/draft-07/schema", "properties": {"PS": { "type": "boolean", "description": "This data use modifier indicates that use is limited to use within an approved project." }}, "title": "project specific restriction", "if": {"properties": {"PS": {"const": true}}}, "then": {"properties": {"PS_project": { "type": "string", "description": "???" }}} }, "ebispot.duo-D0000028": { "$schema": "http://json-schema.org/draft-07/schema", "properties": {"IS": { "type": "boolean", "description": "This data use modifier indicates that use is limited to use within an approved institution." }}, "title": "institution specific restriction", "if": {"properties": {"IS": {"const": true}}}, "then": {"properties": {"IS_institution": { "type": "string", "description": "???" }}} }, "ebispot.duo-D0000029": { "$schema": "http://json-schema.org/draft-07/schema", "properties": {"RTN": { "type": "boolean", "description": "This data use modifier indicates that the requestor must return derived/enriched data to the database/resource." }}, "title": "return to database or resource" }, "ebispot.duo-D0000042": { "$schema": "http://json-schema.org/draft-07/schema", "properties": {"GRU": { "type": "boolean", "description": "This data use permission indicates that use is allowed for general research use for any research purpose." }}, "title": "general research use" }, "ebispot.duo-D0000043": { "$schema": "http://json-schema.org/draft-07/schema", "properties": {"CC": { "type": "boolean", "description": "This data use modifier indicates that use is allowed for clinical use and care. Clinical Care is defined as Health care or services provided at home, in a healthcare facility or hospital. Data may be used for clinical decision making." }}, "title": "clinical care use" }, "ebispot.duo-D0000044": { "$schema": "http://json-schema.org/draft-07/schema", "properties": {"NPOA": { "type": "boolean", "description": "This data use modifier indicates use for purposes of population, origin, or ancestry research is prohibited." }}, "title": "population origins or ancestry research prohibited" }, "ebispot.duo-D0000045": { "$schema": "http://json-schema.org/draft-07/schema", "properties": {"NPU": { "type": "boolean", "description": "This data use modifier indicates that use of the data is limited to not-for-profit organizations." }}, "title": "not for profit organisation use only" }, "ebispot.duo-D0000046": { "$schema": "http://json-schema.org/draft-07/schema", "properties": {"NCU": { "type": "boolean", "description": "This data use modifier indicates that use of the data is limited to not-for-profit use. This indicates that data can be used by commercial organisations for research purposes, but not commercial purposes." }}, "title": "non-commercial use only" } } }

DUO Schema

 

We chose to implement each category as a separate JSON-schema, each consisting of at least one boolean property, using the ontology:shorthand as the key. For example, D0000007 use the key DS. In each case, the default value of each boolean is false. For some cases, when a value of “true” is provided, extra data is expected. For example, D0000024 which is labeled as: publication moratorium, with a key of MOR, include an extra field to capture the moratorium date: MOR_date, when MOR is set to true.

 

Next we want to setup some representative ARs that would be associated with the example project from the governance stories document. Let’s assume that a member of ACT has created the following ARs for this project:

access requirement ID

Name

Description

Expected Annotations

access requirement ID

Name

Description

Expected Annotations

1

Cancer Research Requirement

Data under this AR can only be used for Cancer research.

RS = true

RS_research_type = Cancer

2

Ethics Approval Required

Data under this AR can only be accessed if the user has IRB approval.

IRB = true

3

Publication Moratorium

Data under this AR is under a publication moratorium that limits a download from publishing before this date.

MOR = true

MOR_date = 2022-05-20

 

4

Germany Geographical Restriction

Data under this AR has a geographical restriction stating that the data cannot leave Germany.

GS = true

GS_location = Germany

Access Requirements for the Project

Note: We will reference each of these ARs by their access requirement ID. The expected annotations column describes the expected DUO annotation key-value-pairs that would be any file that would under that AR.

 

The next step is to build a project specific JSON schema that defines how DUO should be applied. It is expected that the DUO specific elements and ARs in this schema would be the result of a collaboration between ACT and the Community Manager/Data Curator. The following JSON schema is an examples of such schema that could be bound to the project from the main governance narrative:

{ "$schema": "http://json-schema.org/draft-07/schema", "title": "Schema for Some Project", "$id": "some.project-main-1.3", "description": "This schema defines how DUO should be used with Some Project.", "allOf": [ { "$ref": "org.sagebionetworks-repo.model.FileEntity-1.0.0" }, { "$ref": "ebispot.duo-duo-1.0.1" }, { "if": { "properties": { "patientLocation": { "const": "Germany" }, "assayType": { "const": "genomic" } } }, "then": { "properties": { "GS": { "title": "geographical restriction", "type": "boolean", "const": true }, "GS_location": { "type": "string", "description": "This data cannot leave Germany", "const": "Germany" }, "_accessRequirementIds": { "type": "array", "contains": { "const": 4 } } }, "required": [ "GS_location" ] } }, { "if": { "properties": { "patientLocation": { "const": "USA" }, "assayType": { "const": "genomic" } } }, "then": { "properties": { "sourceGeography": { "const": "US" }, "jurisdiction ": { "const": "HIPAA" }, "dataLabel": { "const": "De-identified" } }, "required": [ "sourceGeography", "jurisdiction", "dataLabel" ] } } ], "properties": { "assayType": { "description": "Identifies they type of data for this files.", "type": "string", "enum": [ "clinical", "assay", "imaging", "genomic" ] }, "patientLocation": { "description": "The location of the patient associated with the data", "type": "string", "enum": [ "USA", "Germany" ] }, "RS": { "title": "research specific restrictions", "type": "boolean", "const": true }, "RS_research_type": { "title": "Restricted to cancer research", "type": "string", "const": "cancer" }, "IRB": { "title": "ethics approval required", "type": "boolean", "const": true }, "MOR": { "title": "publication moratorium", "type": "boolean", "const": true }, "MOR_date": { "title": "publication moratorium date", "type": "string", "format": "date", "const": "2022-05-20" }, "_accessRequirmentIds": { "type": "array", "allOf": [ { "contains": { "const": 1 } }, { "contains": { "const": 2 } }, { "contains": { "const": 3 } } ] } }, "required": [ "assayType", "patientLocation", "NRES", "HMB", "DS", "POA", "RS", "NMDS", "GSO", "NPUNCU", "PUB", "COL", "IRB", "GS", "MOR", "TS", "US", "PS", "IS", "RTN", "GRU", "CC", "NPOA", "NPU", "NCU" ], }

DUO applied to a Project

 

Notice that line:8 indicates that this schema applies to FileEntities, while line:11 indicates that the schema “extends” the DUO schema. The first two properties of the schema: assayType (line:80) and patientLocation (line:90) are drivers that will determine what conditional properties must be applied to each file. There are two if/then blocks (line:14 to 78) that define what conditional properties should be applied based on assayType and patientLocation. The properties RS, RS_research_type, IRB, MOR, MOR_date, are all unconditional properties, with constant values that must be applied to all files in the project.

 

Synapse would be expected to use the"_accessRequirementIds" properties for guidance to “automatically” associate files with ARs according to the rules defined in the schema. Specifically "const": 4 (line:39) indicates that ARid:4 should be applied to any file with the annotations: "assayType": "genomic" and "patientLocation": "Germany". While _accessRequirmentIds (line:124 - 144) indicates that ARids: 1,2,3 should be applied to all files in the project unconditionally.

We will also add derived annotations with with each of these keys and a value: true to each entity. This would allow a view designer to add these columns to a view, thus enabling end-users to query for files based on ARs.

Note: We will need to block users from adding or updating any annotations with the key "_accessRequirmentIds".

 

The following JSON is an example of what “valid” properties could be for syn1 from the example above:

{ "name": "GermanGenomic.data", "description": "Genomic data from patients in Germany", "id": "syn1", "etag": "some-etag", "createdOn": "2020-05-20T20:20:39+00:00", "modifiedOn": "2020-05-20T20:20:39+00:00", "createdBy": "123456789", "modifiedBy": "123456789", "parentId": "syn444", "versionLabel": "one", "versionComment": "leaving blank", "versionNumber": 1, "dataFileHandleId": "98765", "fileNameOverride": "", "concreteType": "org.sagebionetworks.repo.model.FileEntity", "assayType": "genomic", "patientLocation": "Germany", "NRES": false, "HMB": false, "DS": false, "POA": false, "RS": true, "RS_research_type": "cancer", "NMDS": false, "GSO": false, "NPUNCU": false, "PUB": false, "COL": false, "IRB": true, "GS": true, "GS_location": "Germany", "MOR": true, "MOR_date": "2022-05-20", "TS": false, "US": false, "PS": false, "IS": false, "RTN": false, "GRU": false, "CC": false, "NPOA": false, "NPU": false, "NCU": false, "_accessRequirmentIds" : [1,2,3,4] }

syn1.json

Since syn1 includes "assayType": "genomic" and "patientLocation": "Germany", it must include "GS": true and "GS_location": "Germany" according to the rules of the first if/then. Most of the properties between lines: 19 to 44, are all constants based on this projects schema.

 

The following JSON is an example of what “valid” properties could be for syn4 from the example above:

syn4.json

Since syn4 has "assayType": "genomic" and "patientLocation": "USA", according to the if/then statements it must also the following constant properties: "sourceGeography":"US", "jurisdiction": "HIPAA", "dataLabel":"De-identified". Note: For syn4 "GS": false because the patient location does not equal Germany.

 

In both of these examples (syn1 & syn4) we included annotations for "_accessRequirementIds". Both files have [1,2,3] for AR IDs 1-3, since they are unconditional. Syn1 has [1,2,3,4], indicating it requires that the three unconditional ARs (1-3) and the conditional AR 4.

Derived Annotations

The example above for both syn1 and syn4, indicates that all of the the governance specific metadata is derived from two sources:

  • Project specific constant - For example “publication moratorium” ("MOR": true & "MOR_date": "2022-05-20") apply to all files in the project, as defined by the schema ("$id": "some.project-main")

  • User Provided properties - For example, the if/then blocks define additional properties based on the values of the user provided: "assayType" and "patientLocation"

 

Note: For this phase we are glossing over the fact that the value of patientLocation is a transitive. Its value would be found by joining the patient table, with the treatment table, and then joining with the sampleIds of each file. We will attempt to address this in a later phase.

 

Given that there are 30+ governance annotations for this project, and all of then values can be derived, it does not seem reasonable to ask any user to provide these annotation values. Instead, it would be better if Synapse could “automatically” provided these value-key-pairs.

 

One of the governance narratives includes a case where the "patientLocation" value for a given file was mistakenly given the wrong value. For example, lets assume that syn4 was incorrectly given "patientLocation": "USA", as the patients location is actually Germany. Correcting this single value on syn4 would require that five other governance annotations would need to change. It might not even be obvious to the user making the correction that these additional changes are needed. Instead, it would be better if the Synapse could “automatically” re-derive the governance annotations.

 

It should be noted that a system that could automatically derive annotations would be useful for many external use cases. For example, one of the main JSON schema use cases involves setting annotations on files that are uploaded in bulk. For some of these use cases, a few key values provided by the uploader might be enough to automatically derive the rest of the value-key pairs.

 

For this discussion we are defining the following terms:

  • Actual Annotation - This is a value-key-pair that is provided by user not the system.

  • Derived Annotation - This is a value-key-pair that is automatically provide by the system using a combination of JSON schema and actual annotations.

Derived Annotations Algorithm

Given a valid JSON schema, and a JSON representation of actual annotation value-key-pairs as input, calculate the list of derived annotation value-key-pairs as output. The algorithm must meet the following requirements:

  1. Only actual annotations are to be considered. In other words, a derived annotation value-key-pair cannot be used to derive another value-key-pair.

  2. Only JSON schema properties that are defined to have a constant or default value will be considered as derived annotation candidates.

    1. If an actual annotation exists with the same key, the candidate will be eliminated. This means that derived annotations will never “correct” invalid actual annotations.

    2. If the candidate is in an unreachable logic branch, then it will be eliminated. For example, if the candidate resides in a "then" block, that is unreachable because the corresponding "if" evaluates to “false”, then the candidate will be eliminated.

    3. Default values will only be used if there are no overriding “const” for the same key.

    4. Any candidate that is not eliminated will be added to the results as a derived annotation value-key-pair.

 

Derived Annotations API

Derived annotations are to be considered “transient” data. This means they are subject to be recalculated any time, either the input JSON schema changes or the actual annotations change. This implies that derived data will not be migrated between stacks, but instead, recalculated on each stack.

Derived annotations are to be considered separate from the actual annotation of a Entity. For example, an actual annotation is part of the persisted data of an Entity. While a derived annotation might be cached, it will not be part of the persisted data of an Entity.

 

JSON Schema Binding API Changes

Currently the API: PUT /entity/{id}/schema/binding is used to bind a JSON schema to a Entity. We propose extending the BindSchemaToEntityRequest object to include a new boolean property called “automaticallyIncludeDerivedAnnotations” with default value of “false”. With this value set to “false” Synapse will not attempt to calculate derived annotations for the Entities bound to this schema. However, when “automaticallyIncludeDerivedAnnotations=true”, Synapse will automatically, calculate the derived annotations for the Entities bound to this schema. Note: This new property value will be persisted with the JSON Schema’s binding data.

 

Entity Services API Changes

Currently there are three APIs for getting the annotations of an Entity:

Each API returns the annotations of the given entity id/(version). In order to get the derived annotations of an Entity, we propose extending each of these APIs to include a new boolean parameter named “includeDerivedAnnotations” with a default value of “false”. When the “includeDerivedAnnotations=true”, the results will include both the actual annotations, and the derived annotations.

 

We will need to provide a service that would list the derived keys for a given entity id:

Response

API

Description

Response

API

Description

List<Strings> derivedKeys

GET /entity/{id}/derivedKeys

Get the derived keys for the given Entity ID

 

 

Entity View API Changes

Currently, an EntityView is configured with a list of ColumnModels that define the schema of that view. Users will typically use the following asynchronous service to get the possible ColumnModels when setting the schema of their views: POST /column/view/scope/async/start. In order to create an EntityView that includes derived columns, we propose extending this API’s request object: ViewColumnModelRequest to include a boolean parameter named: “includeDerivedColumns“ with a default value of “false”. When this parameter is set to “true”, the services will include derived columns as possible results. In this way, users will be able to configure their views to include derived columns.

 

Entity Manifest Changes

Currently, when a user downloads a FileEntity via the packaging option of their download list (POST /download/list/package/async/start), the DownloadListPackageRequest include an option to include a manifest. When “includeManifest=true”, the package will include a CSV file contain all of the annotations for any FileEntity include in the download. We propose extending this manifest to automatically include all derived annotations.

 

AccessRequirement API Changes

New AccessRequirement Types

Currently, AccessRequirement (AR) include a list of “subjectIds” that define what Entities (or Teams) the AR applies too. There are currently six types of ARs:

Currently all six AR’s include a subjectIds list within the actual AR. Subjects are added/removed from these ARs by updating the actual AR object using either the CREATE or UPDATE services. We will likely need to continue to maintain each of these ARs types for the foreseeable future.

With this design, we are proposing a new system for assigning the subjects to ARs. Rather than explicitly modifying the subjects of each AR, the new system will allow subjects to be “automatically” bound to ARs based on the new derived _accessRequirementIds annotations on Entities. We will likely need to apply this new system to three of the six AR types: SelfSignAccessRequirement, TermsOfUseAccessRequirement, & ManagedACTAccessRequirement. Rather than define multiple new AR types, we proposed extending all ARs by adding the following property:

This new boolean will allow for the configuration of an AR to be either define by either the ‘subjectIds’ list or _accessRequirementIds annotation.

The GET /accessRequirement/{requirementId} API returns the full list of ‘subjectIds’ for existing ARs. This means that the entire subject list must fit in both client-side and server-side memory. Considering that existing ARs are managed by hand, it is reasonable to assume that the full list will be small enough to prevent memory problems. In fact, it is common for the ‘subjectIds’ to be container IDs (Projects & Folders), to minimize the micromanagement required to maintain an AR. As a result, a short ‘subjectIds’ list can restrict thousands of Entites, since a container can contain up to 40K children. This type of data compression is not likely to extend to new ARs with subjectsDefinedByAnnotations = true. While it will be possible to bind _accessRequirementIds annotations to containers, it is far more likely that these annotations will be bound to individual files. After all, the new derived annotations features make it easy to apply annotation to millions of entities with only a few lines of schema code. This means we must assume that the subjects of ARs with subjectsDefinedByAnnotations = true might not fit in memory. Therefore, we cannot return all of the subject’s for such ARs for calls that GET the AR. However, since the subjects of such ARs are controlled by JSON schemas, it is not clear that listing the subject will even be needed. If we find that we do need to provide all of the subjects of these new ARs then we will need to add a new API that provides a paginated list of subjectIds to avoid out-of-memory problems.

The _accessRequirementIds annotation Lifecycle

The above examples demonstrate the need for _accessRequirementIds annotation as derived annotations. We will be able to use these derived annotations to bind ARs to entities and to filter Entity data in views that include the _accessRequirementIds column. We do not currently have a use case for users to directly create or update _accessRequirementIds annotations on entities. Therefore, we will block all users from directly creating or updating _accessRequirementIds annotations.

Invalid Annotations

Currently, derived annotations are reevaluated for any type of Entity change event. This includes JSON schema binding change events, and annotation changes events. We will need to check the AR binding of an Entity each time the derived annotations are reevaluated. However, what happens if a change puts an entity into an invalid state? For such a case, we would not be able to to determine what the correct derived annotations should be. By extension we would not be able to determine the correct AR bindings of an invalid Entity. It seems wrong to allow users to download a file with no ARs simply becuase the file’s annotations are invalid. To prevent this cases, we will automatically add an invalid-metadata-access-restriction to any file that has invalid annotations and is bound to a JSON schema that includes _accessRequirementIds. This invalid-metadata AR will function similarly to the existing LockAccessRequirement.