Document toolboxDocument toolbox

Data Passports

Synapse can host controlled data and has governance features allowing the Sage Governance team (or their delegates) to approve access to said data. GA4GH Data Passports allow approval to be given by an authority which is external to Synapse.

In the GA4GH Data Passport parlance a “Broker” is a system that issues information about researchers (data passports) that can be used to grant access to data. This would be a governance system external to Synapse. A Claim Clearinghouse is a system that receives data passports and makes decisions about whether to allow access to data. Synapse would be considered such a clearinghouse. The terminology is defined here.

GA4GH Data Passports builds on top of OpenID Connect (OIDC, which is itself an extension of OAuth 2.0). It’s worth reviewing how a user logs in to Synapse using an OIDC identity provider (IdP) like Google:

The flow starts when the user indicates to Synapse that they want to login with an external IdP. Synapse redirects the browser (the “user agent”) to the IdP, which, after authenticating them, returns an authorization code. This is forwarded to Synapse which uses its so called client credentials to exchange the authorization code for an access token, id token and (optionally) a refresh token. The inclusion of the id token is the fundamental extension to OAuth 2.0 by OpenID Connect: In addition to authorizing access (via the access token) the IdP returns information it has about the user. The id token is a JSON Web Token (JWT) so it has a JSON payload, i.e. a key-value map. The keys are “claims” about the user, like “family_name” or “email”, and the values are the user data. If the IdP is a so-called “Broker” then it can return a GA4GH data passport. The claim name is “passport_jwt_v11” and the value is another, embedded JWT which has a claim “ga4gh_passport_v1". The value of this claim is an array of GA4GH “visas”, described further below. An example from NIH RAS is here.

The OIDC specification provides for defining an expiration for user information. That is, the IdP can indicate that the recipient of an id token should only consider the user information valid for a limited time. One the information has expired then a new id token should be obtained. An OIDC IdP has a “/userInfo” endpoint to which an access token can be passed. The result is a new collection of user info, returned either as a JWT or as a JSON object.

Note that the refresh of the user info' can be done without the user’s involvement, so long as a valid access token is available.

Access tokens can expire, and OAuth 2.0 provides for a ‘refresh token’ which can be exchanged (along with client credentials) for a new access token. The IdP may rotate the refresh token and also return a new, fresh id token. Again, this can be done without the user’s involvement, so long as a valid refresh token is available.

As mentioned above the id token or user info returned by a broker contains a passport in the form of a claim named “ga4gh_passport_v1”. The format, given here, is:

{ "ga4gh_passport_v1": [ "<eyJhbGciOiJI...aaa>", "<eyJhbGciOiJI...bbb>" ] }

The passport is an array of visas, each encoded as a signed JWT. There are multiple types of visas. The variety of types is open ended but the GA4GH spec' defines some particular types here. It’s important to note that while some visas are assertions about the researcher (like AffiliationAndRole, ResearcherStatus) others refer to specific data sets, like ControlledAccessGrants. The reason to have these different types is illuminated by this overview article on GA4GH Data Passports, which differentiates between “registered access” and “controlled access” models. Regarding registered access:

Registered access models are a type of role-based access to datasets.

while for controlled data

In the DAC review phase, the DAC must verify the identity of the data user and determine if the proposed research is within the bounds of the permitted use(s) of the dataset. If approved, the data user and their institution must agree to the terms of use of the repository’s data through a data use or processing agreement. In the data use phase, the data user gains access to the dataset(s).

We may then conceive of different sorts of passport-linked access requirements in Synapse. One type would grant access to data if a researcher is indicated to have a certain status by a trusted Broker. Another type would grant access only if a visa provided by a trusted Broker indicates that the user has access to a certain (controlled) data set. We would expect that in the visa the data set would be referred to by its ID in the namespace known to the Broker, as opposed to its Synapse ID. Therefore the Synapse access requirement would have to include the former ID in order to be able to evaluate the user’s visas.

An example of a ResearcherStatus visa taken from here is:

"ga4gh_visa_v1": { "type": "ResearcherStatus", "asserted": 1549680000, "value": "https://doi.org/10.1038/s41431-018-0219-y", "source": "https://grid.ac/institutes/grid.240952.8", "by": "so" }

The Synapse access requirement would, at a minimum, be configured with the URI seen in the ‘value’ field. An example of a ControlledAccessGrants visa (from the same source) is:

... "ga4gh_visa_v1": { "type": "ControlledAccessGrants", "asserted": 1549632872, "value": "https://example-institute.org/datasets/710", "source": "https://grid.ac/institutes/grid.0000.0a", "by": "dac" }

Again, the Synapse access requirement would, at a minimum, be configured with the data set ID seen in the ‘value’ field.

 

Implementation Considerations

In many places, Synapse needs to rapidly answer the question of which of one or more entities a user is authorized to download. The determination reflects access requirements placed on the entities and uses corresponding access approvals, stored in the Synapse database, to answer the question. Moreover, there are use cases in which a headless user agent (e.g., a batch data processing job) seeks to download data on behalf of a user. In such cases the user agent can’t be redirected to a data passport broker to retrieve user info via the OAuth flow. We should therefore adopt a model in which, when a user first authenticates to a broker, their user info, access token and refresh token are captured in Synapse so that Synapse can maintain up-to-date visa information, which can be used to answer authorization questions without a user’s involvement.

Passport Expiration

OIDC provides for an 'expires_in' the token response. This is the time, in seconds, until the provided access token expires. The client can use this to decide when to use the refresh token to get a new access token. Note that doing so may also update the refresh token.

ID Tokens, being JWTs, have a 'exp' time stamp which is the epoch time after which the user information should no longer be considered valid. A passport should not be respected beyond this time limit.

Visas have an "asserted" field which is the timestamp (in epoch seconds) when an authority asserted what the visa claims.

The GA4GH spec' suggests that clients use this to decide whether to respect a visa. If this timestamp is not used, then the minimal check is that of the JWT "exp" timestamp.

Synapse could periodically examine the 'expires_in' and 'exp' time stamps for the current access and id tokens it holds. If an access token is close to expiration, it could update the access and id tokens. If 'expires_in' is not close to expiration but the 'exp' is close to expiration, then it could update the id token. When an id token is updated the corresponding user's access approvals would be updated (created or deleted) accordingly.

 

Client Provided Passports

It seems some in the GA4GH community view passports as being provided to the Clearinghouse by the client, rather than the Clearinghouse retrieving them from Brokers as proposed above. See:

https://docs.google.com/document/d/1amOGLwAbKkMSU6up_dHGhsuUF9SVFlllO0qjQSJYuVI/edit#heading=h.hh644d5mw17b

In the diagram shown in “Approach 2”, the client (the blue column in the sequence diagram) will receive a userInfo object from the RAS server with a subject ID “paired” to that client by RAS. The Auth server, upon receiving a passport containing that subject ID will not be able to resolve it against the subject IDs it received from RAS. It will not “know” which of its own users the passport represents. The question then is what is required by security compliances standards (HIPAA, FISMA) and Sage Governance with respect to tracking the identity of users who download controlled data.