Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagejson
{
	"$schema": "http://json-schema.org/draft-07/schema",
	"title": "Schema for Some Project",
	"$id": "some.project-main-1.3",
	"description": "This schema defines how DUO should be used with Some Project.",
	"allOf": [
		{
			"$ref": "org.sagebionetworks-repo.model.FileEntity-1.0.0"
		},
		{
			"$ref": "ebispot.duo-duo-1.0.1"
		},
		{
			"if": {
				"properties": {
					"patientLocation": {
						"const": "Germany"
					},
					"assayType": {
						"const": "genomic"
					}
				}
			},
			"then": {
				"properties": {
					"GS": {
						"title": "geographical restriction",
						"type": "boolean",
						"const": true
					},
					"GS_location": {
						"type": "string",
						"description": "This data cannot leave Germany",
						"const": "Germany"
					},
					"_ar4":{
accessRequirementIds": {
						"type": "array",
						"contains": {
							"const": true4
						}
					}
				},
				"required": [
					"GS_location"
				]
			}
		},
		{
			"if": {
				"properties": {
					"patientLocation": {
						"const": "USA"
					},
					"assayType": {
						"const": "genomic"
					}
				}
			},
			"then": {
				"properties": {
					"sourceGeography": {
						"const": "US"
					},
					"jurisdiction ": {
						"const": "HIPAA"
					},
					"dataLabel": {
						"const": "De-identified"
					}
				},
				"required": [
					"sourceGeography",
					"jurisdiction",
					"dataLabel"
				]
			}
		}
	],
	"properties": {
		"assayType": {
			"description": "Identifies they type of data for this files.",
			"type": "string",
			"enum": [
				"clinical",
				"assay",
				"imaging",
				"genomic"
			]
		},
		"patientLocation": {
			"description": "The location of the patient associated with the data",
			"type": "string",
			"enum": [
				"USA",
				"Germany"
			]
		},
		"RS": {
			"title": "research specific restrictions",
			"type": "boolean",
			"const": true
		},
		"RS_research_type": {
			"title": "Restricted to cancer research",
			"type": "string",
			"const": "cancer"
		},
		"IRB": {
			"title": "ethics approval required",
			"type": "boolean",
			"const": true
		},
		"MOR": {
			"title": "publication moratorium",
			"type": "boolean",
			"const": true
		},
		"MOR_date": {
			"title": "publication moratorium date",
			"type": "string",
			"format": "date",
			"const": "2022-05-20"
		},
		"_ar1accessRequirmentIds": {
			"const" : true
		},
		"_ar2":"type": "array",
			"allOf": [
				{
					"contains": {
						"const": : true
1
					}
				},
				{
					"_ar3contains": {
						"const" : true2
					}
				},
		"required		{
					"contains": [{
						"const": 3
					}
				}
            ]
        }
	},
	"required": [
		"assayType",
		"patientLocation",
		"NRES",
		"HMB",
		"DS",
		"POA",
		"RS",
		"NMDS",
		"GSO",
		"NPUNCU",
		"PUB",
		"COL",
		"IRB",
		"GS",
		"MOR",
		"TS",
		"US",
		"PS",
		"IS",
		"RTN",
		"GRU",
		"CC",
		"NPOA",
		"NPU",
		"NCU"
	],
}

...

Notice that line:8 indicates that this schema applies to FileEntities, while line:11 indicates that the schema “extends” the DUO schema. The first two properties of the schema: assayType (line:7780) and patientLocation (line:8790) are drivers that will determine what conditional properties must be applied to each file. There are two if/then blocks (line:14 to 7578) that define what conditional properties should be applied based on assayType and patientLocation. The properties RS, RS_research_type, IRB, MOR, MOR_date, are all unconditional properties, with constant values that must be applied to all files in the project.

...

Synapse would be expected to use the"_ar#accessRequirementIds" properties for guidance to “automatically” associate files with ARs according to the rules defined in the schema. Specifically "_ar4const": true4 (line:3739) indicates that ARid:4 should be applied to any file with the annotations: "assayType": "genomic" and "patientLocation": "Germany". While "_ar1", "_ar2", "_ar3" _accessRequirmentIds (line:121 124 - 129144) indicates that ARids: 1,2,3 should be applied to all files in the project unconditionally.

...

Note: We will need to block users from adding or updating any annotations with the prefix key "_araccessRequirmentIds".

The following JSON is an example of what “valid” properties could be for syn1 from the example above:

Code Block
languagejson
{
	"name": "GermanGenomic.data",
	"description": "Genomic data from patients in Germany",
	"id": "syn1",
	"etag": "some-etag",
	"createdOn": "2020-05-20T20:20:39+00:00",
	"modifiedOn": "2020-05-20T20:20:39+00:00",
	"createdBy": "123456789",
	"modifiedBy": "123456789",
	"parentId": "syn444",
	"versionLabel": "one",
	"versionComment": "leaving blank",
	"versionNumber": 1,
	"dataFileHandleId": "98765",
	"fileNameOverride": "",
	"concreteType": "org.sagebionetworks.repo.model.FileEntity",
	"assayType": "genomic",
	"patientLocation": "Germany",
	"NRES": false,
	"HMB": false,
	"DS": false,
	"POA": false,
	"RS": true,
	"RS_research_type": "cancer",
	"NMDS": false,
	"GSO": false,
	"NPUNCU": false,
	"PUB": false,
	"COL": false,
	"IRB": true,
	"GS": true,
	"GS_location": "Germany",
	"MOR": true,
	"MOR_date": "2022-05-20",
	"TS": false,
	"US": false,
	"PS": false,
	"IS": false,
	"RTN": false,
	"GRU": false,
	"CC": false,
	"NPOA": false,
	"NPU": false,
	"NCU": false,
	"_ar1accessRequirmentIds" : true,
	"_ar2" : true,
	"_ar3" : true,
	"_ar4" : true
}

syn1.json

...

[1,2,3,4]
}

syn1.json

Since syn1 includes "assayType": "genomic" and "patientLocation": "Germany", it must include "GS": true and "GS_location": "Germany" according to the rules of the first if/then. Most of the properties between lines: 19 to 44, are all constants based on this projects schema.

...

Code Block
languagejson
{
	"name": "USGenomic.data",
	"description": "Genomic data from patients in USA",
	"id": "syn4",
	"etag": "some-etag",
	"createdOn": "2020-05-20T20:20:39+00:00",
	"modifiedOn": "2020-05-20T20:20:39+00:00",
	"createdBy": "123456789",
	"modifiedBy": "123456789",
	"parentId": "syn444",
	"versionLabel": "one",
	"versionComment": "leaving blank",
	"versionNumber": 1,
	"dataFileHandleId": "98765",
	"fileNameOverride": "",
	"concreteType": "org.sagebionetworks.repo.model.FileEntity",
	"assayType": "genomic",
	"patientLocation": "USA",
	"NRES": false,
	"HMB": false,
	"DS": false,
	"POA": false,
	"RS": true,
	"RS_research_type": "cancer",
	"NMDS": false,
	"GSO": false,
	"NPUNCU": false,
	"PUB": false,
	"COL": false,
	"IRB": true,
	"GS": false,
	"MOR": true,
	"MOR_date": "2022-05-20",
	"TS": false,
	"US": false,
	"PS": false,
	"IS": false,
	"RTN": false,
	"GRU": false,
	"CC": false,
	"NPOA": false,
	"NPU": false,
	"NCU": false,
	"sourceGeography":"US",
	"jurisdiction": "HIPAA",
	"dataLabel":"De-identified",
	"_ar1accessRequirmentIds" : true,
	"_ar2" : true,
	"_ar3" : true
[1,2,3]
}

syn4.json

Since syn4 has "assayType": "genomic" and "patientLocation": "USA", according to the if/then statements it must also the following constant properties: "sourceGeography":"US", "jurisdiction": "HIPAA", "dataLabel":"De-identified". Note: For syn4 "GS": false because the patient location does not equal Germany.

...

In both of these examples (syn1 & syn4) we included annotations for "_ar#accessRequirementIds". Both files have ‘true’ [1,2,3] for AR IDs 1-3, since they are unconditional. Syn1 has "_ar4" : true [1,2,3,4], indicating it requires that condition AR. Syn4 excludes "_ar4" since it is not required for that condition. the three unconditional ARs (1-3) and the conditional AR 4.

Derived Annotations

The example above for both syn1 and syn4, indicates that all of the the governance specific metadata is derived from two sources:

...

Currently, when a user downloads a FileEntity via the packaging option of their download list (POST /download/list/package/async/start), the DownloadListPackageRequest include an option to include a manifest. When “includeManifest=true”, the package will include a CSV file contain all of the annotations for any FileEntity include in the download. We propose extending this manifest to automatically include all derived annotations.

AccessRequirement API Changes

New AccessRequirement Types

Currently, AccessRequirment AccessRequirement (AR) include a list of “subjectIds” that define what Entities (or Teams) the AR applies too. With the proposed changes, they subjectIds would no longer be provided as part of the AR. Instead, Synapse would “automatically” determine which files are associated with an AR based on the JSON schema of the project. The would most likely be a new AR type, so we can maintain backwards compatibility for existing ARsThere are currently six types of ARs:

Currently all six AR’s include a subjectIds list within the actual AR. Subjects are added/removed from these ARs by updating the actual AR object using either the CREATE or UPDATE services. We will likely need to continue to maintain each of these ARs types for the foreseeable future.

With this design, we are proposing a new system for assigning the subjects to ARs. Rather than explicitly modifying the subjects of each AR, the new system will allow subjects to be “automatically” bound to ARs based on the new derived _accessRequirementIds annotations on Entities. We will likely need to apply this new system to three of the six AR types: SelfSignAccessRequirement, TermsOfUseAccessRequirement, & ManagedACTAccessRequirement. Rather than define multiple new AR types, we proposed extending all ARs by adding the following property:

Code Block
languagejson
	"properties": {
...
		"subjectsDefinedByAnnotations": {
			"type": "boolean",
			"description": "Defaults to 'false'.  When 'true', the subjects controlled by this AR are defined by the the'_accessRequirementIds' annotations on individual entities.  This property is mutually exclusive with 'subjectIds'.  If this is set to 'true' then 'subjectIds' must be excluded."
		},
...
		"subjectIds": {
			"type": "array",
			"description": "The IDs of the items controlled by this Access Requirement.  Required when creating or updating.",
			"transient": true,
			"items": {
        		"type":"object",
				"$ref":"org.sagebionetworks.repo.model.RestrictableObjectDescriptor"
			}
		}
...
	}

This new boolean will allow for the configuration of an AR to be either define by either the ‘subjectIds’ list or _accessRequirementIds annotation.

The GET /accessRequirement/{requirementId} API returns the full list of ‘subjectIds’ for existing ARs. This means that the entire subject list must fit in both client-side and server-side memory. Considering that existing ARs are managed by hand, it is reasonable to assume that the full list will be small enough to prevent memory problems. In fact, it is common for the ‘subjectIds’ to be container IDs (Projects & Folders), to minimize the micromanagement required to maintain an AR. As a result, a short ‘subjectIds’ list can restrict thousands of Entites, since a container can contain up to 40K children. This type of data compression is not likely to extend to new ARs with subjectsDefinedByAnnotations = true. While it will be possible to bind _accessRequirementIds annotations to containers, it is far more likely that these annotations will be bound to individual files. After all, the new derived annotations features make it easy to apply annotation to millions of entities with only a few lines of schema code. This means we must assume that the subjects of ARs with subjectsDefinedByAnnotations = true might not fit in memory. Therefore, we cannot return all of the subject’s for such ARs for calls that GET the AR. However, since the subjects of such ARs are controlled by JSON schemas, it is not clear that listing the subject will even be needed. If we find that we do need to provide all of the subjects of these new ARs then we will need to add a new API that provides a paginated list of subjectIds to avoid out-of-memory problems.

The _accessRequirementIds annotation Lifecycle

The above examples demonstrate the need for _accessRequirementIds annotation as derived annotations. We will be able to use these derived annotations to bind ARs to entities and to filter Entity data in views that include the _accessRequirementIds column. We do not currently have a use case for users to directly create or update _accessRequirementIds annotations on entities. Therefore, we will block all users from directly creating or updating _accessRequirementIds annotations.

Invalid Annotations

Currently, derived annotations are reevaluated for any type of Entity change event. This includes JSON schema binding change events, and annotation changes events. We will need to check the AR binding of an Entity each time the derived annotations are reevaluated. However, what happens if a change puts an entity into an invalid state? For such a case, we would not be able to to determine what the correct derived annotations should be. By extension we would not be able to determine the correct AR bindings of an invalid Entity. It seems wrong to allow users to download a file with no ARs simply becuase the file’s annotations are invalid. To prevent this cases, we will automatically add an invalid-metadata-access-restriction to any file that has invalid annotations and is bound to a JSON schema that includes _accessRequirementIds. This invalid-metadata AR will function similarly to the existing LockAccessRequirement.