DOI (Digital Object Identifier)

Background

Refer to /wiki/spaces/SYNCOM/pages/33914913 for the context driving the project.

DOI Basics

A DOI is a persistent identifier that can locate resources on the internet. Examples, doi:10.1038/nature.2012.10240, doi:10.1126/science.1100533, doi:10.6084/m9.figshare.153827, 10.5240/1489-49A2-3956-4B2D-FE16-5. The first two are Nature and Science publications respectively. The third is a figshare data set. The last one is a movie. Note that, instead of an HTML web page, the movie DOI points to an XML document about the movie. Given a DOI, locating its digital content on the internet (i.e. mapping a URI to a URL) is the primary use case of DOI.

Refer to the handbook for the restrictions on DOI format. But basically the length is unlimited, any printable unicode character is allowed, granularity is open. Need to pay attention to URL encoding but nothing too surprising.

Design

Format

See examples of DOI above. Within the registered name space, we are free to choose the DOIs. But we want the DOI to have all the necessary information but no more.

doi:10.12345/sage.synapse.<id>.<ver>

As shown, DOIs are published as permanent links. The DOIs better be optimized for humans and search engines. For example, sage.synapse.144362.7 tells this is a Sage Synapse object syn144362 of version 7.

Metadata

The meta data will include URLs to the corresponding Synapse entity. More than one URL can be used but all should point to the same entity. One question is to whether to point the DOI to the Synapse web client or to the backend repo service. Here are the possibilities:

Point the DOI to the web client URL. This type of DOI is good for human beings as the users. But it prevents the R and Python clients from using the DOI.
Alternatively, we can point the DOI to the repo URL. The entity is fetched as an JSON object. This type of DOI favors the analytical clients. But the human users of the Synapse website will find it less appealing.
Provide two separate sets of DOIs (doing both 1 and 2). Different client will pick the proper DOI to use.

If we choose the web URLs as the DOI redirect, the R/Python clients need to rewrite the URLs to repo service URLs. Now imagine the R/Python client calls dx.doi.org to resolve a DOI. The servers returns a 3XX response with a Synapse web URL (e.g. http://synapse.sagebase.org/#Synapse:syn12345.2). The R/Python client then rewrites the web client URL with the corresponding repo service URL (e.g. http://repo.sagebase.org/entity/syn12345/version/2) and proceeds to call the repo service to get the entity. The rewrite is the additional logic for the R/Python clients to use the Synapse DOIs. If we combine the idea with projects SWC-422 and SWC-423, the web and repo URLs will assume the same format, only the server part needs to swapped for the rewrite.

Another question about metadata is what else to include in the metadata. The EZID demo lists the author (who), the title (what), and the timestamp (when). As I was testing around, several fields were indeed required for generating DataCite DOIs. Here is the DataCite Schemas. The most recent schema requires Creator, Title, Publisher, PublicationYear besides DOI. Again, we need pay attention to the escaping, encoding specs here when supplying the data. Here is the proposed mapping of the required fields.

Creator – Entity CREATED_BY

Title – Entity NAME

Publisher – Sage Bionetworks

PublicationYear – The timestamp this DOI is being created

Granularity

What entity type? Project or data? Public or private? Which Version.

We let users explicitly ask to create DOIs (vs. we auto-generating a DOI for each entity version). Creating a DOI for a version of a leaf entity also means 1) the entity of the particular version is immutable/undeletable and 2) the entity of the particular version is public and cannot be changed to private.

To achieve 1) and 2), we could "copy" the entity and its revision to a new entity, make Synapse the owner, give it public READ access, and create a DOI on this read-only entity. Too much confusion for users.

Add a flag "published" to lock down a version (cannot be deleted, cannot be made private, cannot be promoted).

Asynchronous Client to the DOI Service

As we are calling an external service with unknown reliability, this will certainly be done asynchronously. How do we handle failures? Record failed DOIs in the database table?

Data Migration

Involve of avoid?

External Links

Wiki http://en.wikipedia.org/wiki/Digital_object_identifier

DOI http://www.doi.org/

DOI Handbook http://www.doi.org/hb.html

DOI Resolver http://dx.doi.org/

Short DOI http://shortdoi.org/

DOI Crossref http://www.crossref.org/

DataSite http://www.datacite.org/

Name-to-Thing EZID http://n2t.net/ezid/

EZID Documentation http://n2t.net/ezid/home/documentation

EZID Service Guidelines http://www.cdlib.org/services/uc3/docs/EZIDServiceGuidelines.pdf (Note there is one hour service window each week.)

EZID Service Status Blog http://ezidstatus.wordpress.com/