Document toolboxDocument toolbox

Data Repository Service (DRS)

Introduction

The Data Repository Service (DRS) API provides a generic interface to data repositories so data consumers, including workflow systems, can access data objects in a single, standard way regardless of where they are stored and how they are managed. The primary functionality of DRS is map a logical ID to a means for physically retrieving the data represented by the ID.

DRS is a standard way for data producers to make their data available to data consumers, that supports the control needs of the former and the access needs of the latter. And we need it to be interoperable, so anyone who builds access tools and systems can be confident they’ll work with all the data out there, and anyone who publishes data can be confident it will work with all the tools out there.

Use Cases

Data Producer has stored data using Synapse platform. If data consumer is using some other platform to consume the the data. So there is no way to consume the data of Synapse until there is a common tool which provide access across multiple platform.

Data should be uploaded to Synapse by using our existing APIs by data producer and Data consumer should use DRS APIs to consume data. The common standard DRS makes data platform independent.

User Type

Data Producer : Anyone who has authorization can upload the data

Data Consumer : Anyone who has authorization can download the data

DRS URIs

Hostname-based DRS URIs should be chosen as the ID is always percent-encoded to ensure special characters do not interfere with subsequent DRS endpoint calls and are simple. They contain the DRS host name and the DRS ID only and can be converted directly into a fetch-able URL based on a simple rule.

drs://<hostname>/<id>

<hostname> = repo-prod.prod.sagebase.org

<id> = syn32042766.1 (synapse ID plus version)

e.g drs://repo-prod.prod.sagebase.org/syn32042766.1

The client makes a GET request to the DRS server, using the standard DRS URL syntax:

drs://repo-prod.prod.sagebase.org/syn32042766.1

which will be converted by workflow system to url syntax as below:

GET https://repo-prod.prod.sagebase.org/ga4gh/drs/v1/objects/syn32042766.1

Prod hostname :repo-prod.prod.sagebase.org

Staging hostname : repo-staging.staging.sagebase.org

Datatypes

DRS v1 supports two types of content:

  • Blob is a file —  A Drs blob is a FileEntity and represented by a DrsObject without a contents array.

  • Bundle is a dataset— A Drs bundle is a Dataset and represented by a DrsObject with a contents array

Schema

DRS Object Json:

{ "id":"string", "name":"string", "self_uri":"drs://repo-prod.prod.sagebase.org/32042766.1", "size":0, "created_time":"2019-08-24T14:15:22Z", "updated_time":"2019-08-24T14:15:22Z", "version":"string", "mime_type":"application/json", "checksums":[ { } ], "access_methods":[ { } ], "contents":[ { } ], "description":"string" }

Checksums json:

{ "checksum": "string", "type": "md5" }

Access method json:

{ "type": "https", "access_id": "string", }

Access url json:

Contents json:

DRS Object Attribute Description

Attribute name

Blob (File)

Bundle (Dataset)

Attribute name

Blob (File)

Bundle (Dataset)

id

A DRS id should be Synapse file id with version which makes it immutable e.g syn32132536.1 or a file handle ID prepended with the string “fh” (e.g., fh56789345)).

A DRS id should be Synapse dataset id with version which makes it immutable e.g syn32132349.1

name

Name of file e.g Test3.pages

Name of Dataset e.g Test Dataset

self_uri

A drs URI, as defined in the DRS documentation, that tells clients how to access this object.

e.g drs://repoprod.prod.sagebase.org/syn32132536.1

A drs URI, as defined in the DRS documentation, that tells clients how to access this object.

e.g drs://repoprod.prod.sagebase.org/syn32132349.1

size

File size in bytes eg 85.7 kb is 857000

For dataset the cumulative size, in bytes, of files it contains.

  1. If the user has access on dataset then the size of dataset is sum of size of all the files under dataset.

created_time

Timestamp of file creation

Timestamp of dataset creation

updated_time

Timestamp of file updation

Timestamp of dataset updation

version

A string representing a version e.g 3

A string representing a version e.g 1

mime_type

FileHandle.contentType

Has no mime type

checksums

FileHandle.contentMd5

e.g d269b370219876bb6ace9a1ce190d730

The checksum is computed over a sorted concatenation of the checksums of its top-level contained objects(not recursive, names not included). The list of checksums is sorted alphabetically (hex-code) before concatenation and a further checksum is performed on the concatenated checksum value.

For example, if a dataset contains two files i.e file1 and file 2.

Then the checksum of the bundle is: md5( concat( sort( md5file1,md5file2 ) ) )

 

  1. If the user has access on dataset then the checksum calculation will include md5 of all the files under dataset.

access_methods

access method will provide access id and the type will be https.

Has no access_method

contents

Has no contents.

List of object inside bundle.If the user has access on dataset then the content will contain list of all the files under dataset irrespective of access on each file level.

  1. If the user has access on dataset then the contents will contain list of all the files under dataset irrespective of access on each file level.

description

Description of file.

Description of Dataset.

checksums attribute description

Attribute name

Description

Attribute name

Description

checksum

The hex-string encoded checksum for the data.

e.g b15bd58c8f0946b636545d8309bf0f27

type

The digest method used to create the checksum.

e.g md5

access_method attribute description

Attribute name

Description

Attribute name

Description

type

Type of the access method e.g https

access_id

Access id should be generated by FileHandleAssociationType, syn_id and filehandle_id and concatenating them by '-'. FileHandleAssociationType_<syn_id> '_’ <filehandle_id>.

Where <syn_id> is syn123.1

.e.g FileEntity_syn123.1_56789345

or, if the DRS object is being retrieved with a file handle ID, the Access id will be the file handle ID prepended with the string “fh”.

e.g., fh56789345.

 

contents attribute description

Attribute name

Description

Attribute name

Description

name

A name declared by the bundle author that must be used when materializing this object, overriding any name directly associated with the object itself. The name must be unique with the containing bundle. This string is made up of uppercase and lowercase letters, decimal digits, hypen, period, and underscore. e.g syn32132536.1 as synID is unique.

id

A DRS-identifier of a DrsObject e.g syn32132536.1

drs_uri

A list of full DRS identifier URI paths that may be used to obtain the object. These URIs may be external to this DRS instance. e.g drs://repo-prod.prod.sagebase.org/syn32132536.1

Note

Nesting of bundle(dataset containing dataset) is not supported.

EndPoints

1.Get information about a DRSObject

The get information about a DRSObject API will provide information about the DrsObject which can be file or dataset as shown below in json example. DrsObject is fetched by drsId i.e Synapse Id plus version which makes it immutable, or the file handle ID prepended with the string “fh” (e.g., fh123)).

https://{serverURL}/ga4gh/drs/v1/objects/{object_id}

HTTP method : GET

Path Parameters :

object_id: object id is drs object id i.e Synapse Id plus version which makes it immutable, or the file handle ID prepended with the string “fh” (e.g., fh123)).

Authorization :

Bearer Auth should be done on controller level as done for all other API’s.

Bundle (dataset) Example:

Dataset syn32132349 is created which contains 2 files syn31538774.3 and syn32132536.1.

Request url: https://repo-prod.prod.sagebase.org/ga4gh/drs/v1/objects/syn32132349.1

REQUEST BODY SCHEMA: application/json

expand : If false and the object_id refers to a bundle, then the ContentsObject array contains only those objects directly contained in the bundle.

If true and the object_id refers to a bundle, response with 400 http status code and message

“ nesting of bundle is not supported” will be returned.

If the object_id refers to a blob, then the query parameter is ignored.

 

 

RESPONSE BODY SCHEMA: application/json

RESPONSE CODE: 200

 

Blob (file) example with Synapse ID as Object ID:

Request Url: https://repo-prod.prod.sagebase.org/ga4gh/drs/v1/objects/syn31538774.3

RESPONSE BODY SCHEMA: application/json

Blob (file) example with file handle ID as Object ID:

Request Url: https://repo-prod.prod.sagebase.org/ga4gh/drs/v1/objects/fh56789345

RESPONSE BODY SCHEMA: application/json

HTTP Responses

HTTP Code

Description

Schema

HTTP Code

Description

Schema

200

The DrsObject was found successfully.

DrsObject

400

The request is malformed.

Error

401

The request is unauthorized.

Error

403

The requester is not authorized to perform this action.

Error

404

The requested DrsObject wasn’t found.

Error

500

An unexpected error occurred.

Error

 

2. Get a URL for fetching bytes

The get a url for fetching byte API will provide the actual url of blob for example s3 bucket, google cloud etc, from where file can be downloaded.

https://{serverURL}/ga4gh/drs/v1/objects/{object_id}/access/{access_id}

HTTP method : Get

Path parameters :

object_id: Object id is drs object id. i.e Synapse Id plus version which makes it immutable or the file handle ID prepended with the string “fh” (e.g., fh123)).

access_id: Access id from access methods list of drs object.

Authorization :

Bearer Auth should be done on controller level as done for all other API’s.

Blob (file) example with Synapse ID as Object ID:

https://repo-prod.prod.sagebase.org/ga4gh/drs/v1/objects/syn32042766.1/access/FileEntity_syn31538774.3_56789345

REQUEST BODY SCHEMA: None

RESPONSE BODY SCHEMA: application/json

The presigned url will be sent to the user and file can be downloaded directly from the url without any authentication. As presigned url has tokens included, which expires with time.

Blob (file) example with file handle ID as Object ID:

https://repo-prod.prod.sagebase.org/ga4gh/drs/v1/objects/fh56789345/access/fh56789345

REQUEST BODY SCHEMA: None

RESPONSE BODY SCHEMA: application/json

The presigned url will be sent to the user and the file can be downloaded directly from the url without any authentication, as the presigned url has tokens included, which expires with time.

HTTP Responses

HTTP Code

Description

Schema

HTTP Code

Description

Schema

200

The DrsObject was found successfully.

Access url

400

The request is malformed.

Error

401

The request is unauthorized.

Error

403

The requester is not authorized to perform this action.

Error

404

The requested DrsObject wasn’t found.

Error

500

An unexpected error occurred.

Error

3. Get information about DRS service

The GA4GH Service Registry API specification allows information about GA4GH-compliant web services, including DRS services, to be aggregated into registries and made available via a standard API. The following considerations should be followed when registering DRS services within a service registry.

  • The DRS service attributes returned by /service-info (i.e. id, name, description, etc.) should have the same values as the registry entry for that service.

  • The value of the type object's artifact property should be drs (i.e. the same as it appears in service-info)

  • Each entry in a Service Registry must have a url, indicating the base URL to the web service. For DRS services, the registered url must include everything up to the standardized /ga4gh/drs/v1 path.

 

https://{serverURL}/ga4gh/drs/v1/service-info

HTTP method : Get

Path parameters : None

Authorization : None

Example url: https://repo-prod.prod.sagebase.org/ga4gh/drs/v1/service-info

REQUEST BODY SCHEMA: None

RESPONSE BODY SCHEMA: application/json

Attribute description :

Attribute name

description

Attribute name

description

id

Unique ID of this service. Reverse domain name notation is recommended, though not required. The identifier should attempt to be globally unique so it can be used in downstream aggregator services e.g. Service Registry

name

Name of this service. Should be human readable.

type

Type of a GA4GH service.

description

Description of the service. Should be human readable and provide information about the service.

organization

Organization providing the service.

contactUrl

URL of the contact for the provider of this service, e.g. a link to a contact form (RFC 3986 format), or an email (RFC 2368 format).

documentationUrl

URL of the documentation of this service (RFC 3986 format). This should help someone learn how to use your service, including any specifics required to access data, e.g. authentication.

createdAt

Timestamp describing when the service was first deployed and available (RFC 3339 format)

updatedAt

Timestamp describing when the service was first deployed and available (RFC 3339 format)

environment

Environment the service is running in. Use this to distinguish between production, development and testing/staging deployments. Suggested values are prod, test, dev, staging. However this is advised and not enforced.

version

Version of the release in which we will deliver the DRS API.

url

DRS Service base url for the provider of service.

 

Error Response

In case of request failure error should be thrown with error message and status code.