Sage Portal as a Repository Aggregator
Today Sage can create custom data dissemination portals using Synapse Repository Services as the back-end. A new need (first seen in the AMP-ALS Portal project) is to aggregate content from multiple data repositories so that researchers can browse/search/query and select for analysis data from the multiple platforms. Additionally the portal should (1) let the user enjoy single-sign-on and (2) display user specific state, most importantly whether the current user is approved to access each data item. Here is an example of portal functionality which is intentionally simplified but contains sufficient complexity to illustrate the platform integration challenge:
A user visits the portal and logs in. They see a table of available files, as shown below. The table lists the files (paginated) along with various attributes (which can be used as filters) and also shows whether the user has access to each file. The user queries for all files with “Attribute A”=”Value-1”. The result is more than a page long, and they click through the multiple pages. Along the way they might see this:
File | Attribute A | Attribute B | User Has Access |
File Syn101 | Value-1 | Value-B1 | YES |
File Syn102 | Value-1 | Value-B2 | YES |
File DAP301 | Value-1 | Value-B3 | NO |
File DAP302 | Value-1 | Value-B4 | YES |
File Syn103 | Value-1 | Value-B5 | NO |
(where we’ve added color to emphasize the source of each piece of data). Note that the rightmost column, “User Has Access” is private information, only displayed for the authenticated user. Where the user lacks access there is a link to the host system, where the user can request access.
A particularly sophisticated feature of the Synapse portals is the “Download Cart”. This feature keeps track of all the items selected for download and, importantly, whether access approval is required for each item. For example, the portal can issue a file statistics request for the user’s download cart and receive a FilesStatisticsResponse, including information like numberOfFilesRequiringAction, “These are files that require some action on the user's part in order gain download access.” The following screen shot is from the AMP-ALS Design Figma:
Another type of request is the ActionRequiredRequest which will return the specific actions required to access the files in the download cart. One type of required action is to fulfill an access requirement. Similar information is contained in the response to a table view query, i.e., “the list of actions required for any file in the query [result]”, and can also be retrieved for individual files using the /restrictionInformation service, for one file, or POST /restrictionInformation/batch | Synapse REST API for a batch of files.
In existing portals the Synapse back-end can compute this information because it possesses all the required information:
Index of available file data;
Download cart (a subset of the indexed files);
Access requirements, applied to folders and/or files;
Access approvals, each related to an access requirement and a user
However, in the AMP-ALS portal the Synapse back end won’t be the source of information about users' access approval. Instead it has to retrieve this information from remote data repository(ies).
The portal needs to know whether the user’s account in the remote repository is already connected to their Synapse account. The following screen shot is from the AMP-ALS Design Figma:
For accessing the RDCA-DAP/Aridhia APIs, Aridhia has proposed a “token exchange service” which would allow the AMP-ALS portal obtain an Aridhia access token based on their AMP-ALS/Synapse identity token. They further propose to match the Synapse token to a user’s RDCA-DAP account based on email address.
Proposed Design
In the proposed design, metadata are aggregated from multiple portals into Synapse File Views (with external files referencing resources stored in non-Synapse repositories). Therefore much of the existing Portal components remain unchanged. The novel requirement is to retrieve the access approval state for data items whose access approval is managed externally to Synapse.
We propose adding a new type of “external” access requirement (AR) to Synapse. It would contain, at a minimum two identifiers, one for the linked platform containing the data and another for the identifier of the access requirement equivalent in that platform, for which the Synapse access request is a mirror. Sage’s governance team would craft the AR and apply it to a set of Synapse external files, one for each data item to be accessed in the remote system. The external files would be annotated with metadata from the partner system and added to the file view(s) which power the portal.
With the external AR in place, Synapse can then retrieve the approval information from RDCA-DAP to complete requests like the ActionsRequired request for the download cart. The workflow would be:
The new actions required would include (1) the action of linking the user’s external account and (2) the action of obtaining access approval in the external system. Note that the /restrictionInformation
and /restrictionInformationBatch
services are not asynchronous today, so they would have to be modified to allow potentially time consuming external communication.
Variations of this basic idea to consider are:
The token exchange could return a GA4GH Passport with the access approval information embedded in Visa format. (But note that to do this we would have to address the issue of the user having a very large number of access approvals in the linked system.) If so, then no subsequent API requests need to be made to gather that information.
The token exchange could take as input a refresh token, obtained by explicitly linking the user’s RDCA-DAP account to their Synapse account, i.e. Synapse being an OIDC client of RDCA-DAP.
RDCA-DAP could be configured to be an OIDC client of Synapse, During token exchange it can match the Synapse id token to the user’s RDCA-DAP account based on the “subject” claim rather than by email (the subject claim being more reliable as Synapse users can change their email address).
The request to token exchange service could be made by the Portal rather than by the Synapse back end, and passed along in requests like the ActionRequired request. In the case of multiple integrated repositories, this puts the burden on the Portal to determine what repositories to contact. This burden could be mitigated by putting into the actions required part of the response all the details necessary for the Portal to act.
Other Functionality
While the problem of integrating data access approval into data navigation is the most complex integration problem, there are a few other integration points:
Determining whether an account is linked
This could be done using the proposed token exchange service. If the service can return no token then the two accounts have not been linked. If we implement an OIDC integration, then this would be done differently.
Data access request initiation
Aridhia has proposed this be done through their own UI, by presenting their data access request portal in an IFrame within the AMP-ALS portal. A similar solution could be achieved by redirecting the user to RDCA-DAP in a separate browser tab. An alternative is for the AMP-ALS Portal to present its own data access request form and then forward the content to RDCA-DAP through its APIs. The exchange might look like:
File download
File download by the Portal could be enabled by the proposed token exchange service:
An alternative would be to perform the token exchange in the Synapse backend and return a presigned URL to the Portal.
Work Items
The work to perform this integration:
Back end
Determine how to get access approval information from the remote repository of interest (or negotiate a new interface). API doc’s: https://fair.dap.c-path.org/api/docs
Define a new type of access requirement, e.g. as proposed here. It must be possible to ‘match up’ access requirements with access approval information from RDCA-DAP.
Extend Synapse’s file statistics services (and any other services based on access approval status) to use access approval information retrieved from RDCA-DAP
Front end
Present DAR UI
Determine if user has RDCA-DAP account.
Download files from RDCA-DAP, using their token exchange service.
Governace UI for creating external access requirement
Rendering the new actions required
Variation: New Aggregating Service
A variation on the design to consider is one which introduces a new back-end service to aggregate information from Synapse and RDCA-DAP. The design discussed above is based on the Portal using Synapse as its back-end, and Synapse acting as a client to RDCA-DAP’s REST APIs:
In the alternative, the Portal interacts with a new back-end, which in turn connects to both Synapse and RDCA-DAP. Synapse would not interact with RDCA-DAP at all.
The new Server would be an identity federator. Through the Portal it would link the user’s Synapse and RDCA-DAP accounts. When Portal users log in, they would log in to the Server, not to either of the linked systems (though the Server could optionally delegate login to either or both Synapse and RDCA-DAP). The Server would be a proxy for all requests issued by the Portal. The flow for a request to Synapse would be:
Note that when the Portal logins in, it receives an access token (AT) to the Service (not for either linked data repository). This is labelled, , “AT-Service”. (The prerequisite account linking is not shown in this diagram.) When the Portal requires a resource which is held by Synapse, it sends a request to the Service, which recognizes the request as being relevant to Synapse, obtains a Synapse access token (“AT-SYN”), issues the request, and returns the result. In short the Service acts a proxy to the Synapse REST APIs. The Service could similarly interact with RDCA-DAP to retrieve resources held there, and perhaps issue data access requests.
Certain Portal requests require more complex business logic in the Service. An example is the Synapse Action Required request for the Download list. The response is a list of {Action, count}, where the Action describes what the user needs to do to obtain access, and the count is the number of files requiring this action. For actions requiring approval in RDCA-DAP, the Server would issue the appropriate request to that system and then update the result to indicate to the Portal whether the action is still outstanding or has been satisfied.
A second example is the Synapse FileStatistics request for the Download List. The response has the fields:
totalNumberOfFiles
numberOfFilesAvailableForDownload
numberOfFilesAvailableForDownloadAndEligibleForPackaging
numberOfFilesRequiringAction
sumOfFileSizesAvailableForDownload
Since some of these values depends on information from RDCA-DAP, the request can’t be handled by simply proxying the service. Rather the Server would have to use the Action Required service (and possibly others) to compute the aggregated result before returning it to the Portal.
A third example is the Synapse Eligible Files request used by the Portal to list the files it can download, a subset of the files added to the Download List. Since Synapse lacks information about whether RDCA-DAP files can be downloaded, the Service would have to compute this list by retrieving the download cart contents and determining eligibility incorporating access approval information from RDCA-DAP. Note the challenges in subselecting data from a paginated list, while maintaining a consistent page size.
A fourth example is the Restriction Information, Batch, service used by the Portal. As with the earlier examples, the Service must query RDCA-DAP for the information about files in the request which it holds. The challenge of building the service is lessened by the fact that it is not paginated. (The request is limited to 50 files.) However, it would likely be replaced with an asynchronous service to reflect the potential delays in relaying requests to multiple systems. The request should also reflect the data contributor bypass feature of Synapse, allowing data contributors to download data without going through a Governance process.
A fifth example is the action required column incorporated in the file view response as invoked in the QueryBundleRequest. As in the Action Required request proxy service, the Service would have to fill in the Action column for external files, noting the challenge in managing a paginated response.
Considerations of introducing a new service
The advantage of this approach is that it creates a clean separation of concerns, in which Synapse doesn’t have to make any requests to RDCA-DAP. While at a high-level the approach would also appear to create a path to disconnecting the Portal from Synapse as a back-end, fully disconnecting them would require reproducing a large number of services on which the Portal depends today, which would be several engineer-years worth of effort. And even the initial implementation would require at a minimum rewriting the five services described above.