...
The goal for phase two will be to expand the use of DUO metadata to individual data files. We would also like to provide a system to “automatically” associate ARs with files base on their annotation values.
Background
As mentioned above, the Data Use Ontology (DUO) was develop to help standardize the data use nomenclature of biomedical data. Unfortunately, DUO was defined using Web Ontology Language (OWL), in a manner that allows broad interpretation of the definitions. For example, DUO does not actually define the expected keys or their corresponding values types. However, DUO does define 26 categories or data restrictions, with some comments about how each might be used.
...
We chose to implement each category as a separate JSON-schema, each consisting of at least one boolean property, using the ontology:shorthand as the key. For example, D0000007
use the key DS
. In each case, the default value of each boolean is false. For some cases, when a value of “true” is provided, extra data is expected. For example, D0000024
which is labeled as: publication moratorium
, with a key of MOR
, include an extra field to capture the moratorium date: MOR_date
, when MOR
is set to true.
The next step is build a project specific JSON schema that defines how DUO should be applied. The following JSON schema is an examples of such schema that could be bound to the project from the main governance narrative:
...
language | json |
---|
...
Next we want to setup some representative ARs that would be associated with the example project from the governance stories document. Let’s assume that a member of ACT has created the following ARs for this project:
access requirement ID | Name | Description | Expected Annotations |
---|---|---|---|
1 | Cancer Research Requirement | Data under this AR can only be used for Cancer research. | RS = true RS_research_type = Cancer |
2 | Ethics Approval Required | Data under this AR can only be accessed if the user has IRB approval. | IRB = true |
3 | Publication Moratorium | Data under this AR is under a publication moratorium that limits a download from publishing before this date. | MOR = true MOR_date = 2022-05-20 |
4 | Germany Geographical Restriction | Data under this AR has a geographical restriction stating that the data cannot leave Germany. | GS = true GS_location = Germany |
Access Requirements for the Project
Note: We will reference each of these ARs by their access requirement ID. The expected annotations column describes the expected DUO annotation key-value-pairs that would be any file that would under that AR.
The next step is build a project specific JSON schema that defines how DUO should be applied. It is expected that the DUO specific elements and ARs in this schema would be the result of a collaboration between ACT and the Community Manager/Data Curator. The following JSON schema is an examples of such schema that could be bound to the project from the main governance narrative:
Code Block | ||
---|---|---|
| ||
{ "$schema": "http://json-schema.org/draft-07/schema", "title": "Schema for Some Project", "$id": "some.project-main-1.3", "description": "This schema defines how DUO should be used with Some Project.", "allOf": [ { "$ref": "org.sagebionetworks-repo.model.FileEntity-1.0.0" }, { "$ref": "ebispot.duo-duo-1.0.1" }, { "if": { "properties": { "patientLocation": { "const": "Germany" }, "assayType": { "const": "genomic" } } }, "then": { "properties": { "GS": { "title": "geographical restriction", "type": "boolean", "const": true }, "GS_location": { "type": "string", "description": "This data cannot leave Germany", "const": "Germany" }, "_accessRequirementIds":{ "const": "4" } }, "required": [ "GS_location" ] } }, { "if": { "properties": { "patientLocation": { "const": "USA" }, "assayType": { "const": "genomic" } } }, "then": { "properties": { "sourceGeography": { "const": "US" }, "jurisdiction ": { "const": "HIPAA" }, "dataLabel": { "const": "De-identified" } }, "required": [ "sourceGeography", "jurisdiction", "dataLabel" ] } } ], "properties": { "assayType": { "description": "Identifies they type of data for this files.", "type": "string", "enum": [ "clinical", "assay", "imaging", "genomic" ] }, "patientLocation": { "description": "The location of the patient associated with the data", "type": "string", "enum": [ "USA", "Germany" ] }, "RS": { "title": "research specific restrictions", "type": "boolean", "const": true }, "RS_research_type": { "title": "Restricted to cancer research", "type": "string", "const": "cancer" }, "IRB": { "title": "ethics approval required", "type": "boolean", "const": true }, "MOR": { "title": "publication moratorium", "type": "boolean", "const": true }, "MOR_date": { "title": "publication moratorium date", "type": "string", "format": "date", "const": "2022-05-20-05-20" }, "_accessRequirementIds":{ "const": "1,2,3" } }, "required": [ "assayType", "patientLocation", "NRES", "HMB", "DS", "POA", "RS", "NMDS", "GSO", "NPUNCU", "PUB", "COL", "IRB", "GS", "MOR", "TS", "US", "PS", "IS", "RTN", "GRU", "CC", "NPOA", "NPU", "NCU" ], } |
DUO applied to a Project
...
Notice that line:8 indicates that this schema applies to FileEntities, while line:11 indicates that the schema “extends” the DUO schema. The first two properties of the schema: assayType
(line:7475) and patientLocation
(line:8485) are drivers that will determine what conditional properties must be applied to each file. There are two if/then blocks (line:14 13 to 7173) that define what conditional properties should be applied based on assayType
and patientLocation
. The properties RS
, RS_research_type
, IRB
, MOR
, MOR_date
, are all unconditional properties, with constant values that must IRB
, MOR
, MOR_date
, are all unconditional properties, with constant values that must be applied to all files in the project.
Synapse would be expected to use these "_accessRequirementIds"
properties for guidance to “automatically” associate files with ARs according to the rules defined in the schema. Specifically "_accessRequirementIds":[ "4" ]
(line:36) indicates that ARid:4 should be applied to any file with the annotations: "assayType": "genomic"
and "patientLocation": "Germany"
. While "_accessRequirementIds":[ "1","2","3" ]
(line:119) indicates that ARids: 1,2,3 should be applied to all files in the project unconditionally.
The following JSON is an example of what “valid” properties could be for syn1 from the example above:
...
Since syn4 has "assayType": "genomic"
and "patientLocation": "USA"
, according to the if/then statements it must also the following constant properties: "sourceGeography":"US"
, "jurisdiction": "HIPAA"
, "dataLabel":"De-identified"
. Note: For syn4 "GS": false
because the patient location does not equal Germany.
In both of these examples (syn1 & syn4) we did not include an annotation for "_accessRequirementIds"
even though that was part of the schema. This is because the "_accessRequirementIds"
is a special key that helps Synapse determine how to associate ARs, but should not actually be implemented as an actual annotation key-value-pair. In fact, Synapse should reject any attempt to set an annotation with the special key: "_accessRequirementIds"
.
Derived Annotations
The example above for both syn1 and syn4, indicates that all of the the governance specific metadata is derived from two sources:
...
It should be noted that a system that could automatically derive annotations could would be useful for many external use cases. For example, one of the main JSON schema use cases involves setting annotations on files that are uploaded in bulk. For some of these use cases, a few key values provided by the upload uploader might be enough to automatically derive the rest of the value-key pairs.
...
Only actual annotations are to be considered. In other words, a derived annotation value-key-pair cannot be used to derive another value-key-pair.
Only JSON schema properties that are defined to have a constant value (for example :
"const": "cancer"
) or default value will be considered as derived annotation candidates.If an actual annotation exists with the same key, the candidate will be eliminated. This means that derived annotations will never “correct” invalid actual annotations.
If the candidate is in an unreachable logic branch, then it will be eliminated. For example, if the candidate resides in a
"then"
block, that is unreachable because the corresponding"if"
evaluates to “false”, then the candidate will be eliminated.Default values will only be used if there are no overriding “const” for the same key.
Any candidate that is not eliminated will be added to the results as a derived annotation value-key-pair.
...
Each API returns the annotations of the given entity id/(version). In order to get the derived annotations of an Entity, we propose extending each of these APIs to include a new boolean parameter named “includeDerivedAnnotations” with a default value of “false”. When the “includeDerivedAnnotations=true”, the results will include both the actual annotations, and the derived annotations.
Note: Each service currently returns the Entity’s “etag” in addition to the annotations. When a user wishes to update the annotations of an entity, they must include the provided “etag” with the update request. However, when “includeDerivedAnnotations=true” each service will not return the “etag”. This is done to prevent the user from accidentally updating the annotations of an Entity with the transient derived annotations.
...
When the “includeDerivedAnnotations=true”, the
...
results will include both the actual annotations
...
, and the derived annotations
...
.
We will need to provide a service that would list the derived keys for a given entity id:
Response | API | Description |
---|---|---|
List<Strings> derivedKeys | GET /entity/{id}/derivedKeys | Get the derived keys for the given Entity ID |
Entity View API Changes
Currently, an EntityView is configured with a list of ColumnModels that define the schema of that view. Users will typically use the following asynchronous service to get the possible ColumnModels when setting the schema of their views: POST /column/view/scope/async/start. In order to create an EntityView that includes derived columns, we propose extending this API’s request object: ViewColumnModelRequest to include a boolean parameter named: “includeDerivedColumns“ with a default value of “false”. When this parameter is set to “true”, the services will include derived columns as possible results. In this way, users will be able to configure their views to include derived columns.
...
Currently, AccessRequirment (AR) include a list of “subjectIds” that define what Entities (or Teams) the AR applies too…too. With the proposed changes, they subjectIds would no longer be provided as part of the AR. Instead, Synapse would “automatically” determine which files are associated with an AR based on the JSON schema of the project. The would most likely be a new AR type, so we can maintain backwards compatibility for existing ARs.