Table of Contents |
---|
Background
For the context driving the project, refer to this page (
Table of Contents |
---|
Background
Refer to /wiki/spaces/SYNCOM/pages/33914913) for the context driving the project.
DOI Basics
A doi DOI is a persistent identifier that can locate resources on the internet. Examples, doi:10.1038/nature.2012.10240, doi:10.1126/science.1100533, doi:10.6084/m9.figshare.153827, 10.5240/1489-49A2-3956-4B2D-FE16-5. The first two are Nature and Science publications respectively. The third is a figshare data set. The last one is a movie. Note that, instead of a an HTML web page, the movie doi DOI points to an XML document about the movie. Given a doiDOI, locating its digital content on the internet (i.e. mapping a URI to a URL) is the primary use case of DOI.
Refer to the handbook for the restrictions on doi DOI format. The But basically the length is unlimited. Any , any printable unicode character is allowed. Granularity , granularity is open. Pay Need to pay attention to URL encoding ( but nothing too surprising there).
Design
Format
We want the doi See examples of DOI above. Within the registered name space, we are free to choose the DOIs. But we want the DOI to have all the necessary information but no more. Also see the examples of doi above.
Code Block |
---|
doi:10.7303/syn<id>.<ver> |
As shown, DOIs are published as permanent links. The DOIs better be optimized for humans and search engines. The above format is chosen per the result of a survey:
Code Block |
---|
doi:sage10.7303/synapse.<id>.<ver> |
...
144362 8.3% 1 doi:10.7303/sage.synapse.144362 |
...
Granularity
What entity type? Public or private? Which Version. A simple, straightforward solution is to apply a doi to each and every entity version. This is because the official DOI model does not enforce any restrictions on the granularity. We could and should create a doi for each identifiable entity. Another reason is that, if we choose EZID as the DOI API provider, it's a fixed price for 1 million dois per year. Within the 1 million limit (which we currently are), it does not matter how many dois we use.
We need to find out the price model for more than 1 million dois. We are very close to this limit.
...
0.0% 0
doi:10.7303/syn144362 58.3% 7
doi:10.7303/sage.syn144362 8.3% 1
doi:10.7303/synapse.syn144362 25.0% 3 |
Real examples:
DOI | DOI Resolver Link | Synapse Target URL |
---|---|---|
http://dx.doi.org/10.7303/syn1720822 | https://www.synapse.org/#!Synapse:syn1720822 | |
doi:10.7303/syn1720822.1 | http://dx.doi.org/10.7303/syn1720822.1 | https://www.synapse.org/#!Synapse:syn1720822/version/1 |
Metadata
The meta data will include URLs to the corresponding Synapse entity. More than one URL can be used but all should point to the same entity. One question is to whether to point the doi DOI to the Synapse web client or to the backend repo service. Here are the possibilities:
- Point the doi DOI to the web client URL. This type of doi DOI is good for human beings as the users. But it prevents the R and Python clients from using the doiDOI.
- Alternatively, we can point the doi DOI to the repo URL. The entity is fetched as an JSON object. This type of doi DOI favors the programmable analytical clients. But the human users of the Synapse website will find it less appealing.
- Provide two separate sets of dois DOIs (doing both 1 and 2). Different client will pick the proper DOI to use.
If we choose the web URLs as the DOI redirect, the R/Python clients need to rewrite the URLs to repo service URLs. Now imagine the R/Python client calls dx.doi.org to resolve a DOI. The servers returns a 3XX response with a Synapse web URL (e.g. http://synapse.sagebase.org/#Synapse:syn12345.2). The R/Python client then rewrites the web client URL with the corresponding repo service URL (e.g. http://repo.sagebase.org/entity/syn12345/version/2) and proceeds to call the repo service to get the entity. The rewrite is the additional logic for the R/Python clients to use the Synapse DOIs. If we combine the idea with projects SWC-422 and SWC-423, the web and repo URLs will assume the same format, only the server part needs to swapped for the rewrite.
Chris, "As an alternative, the client could get redirected from the DOI service to the synapse web app. The clients specify an 'application/json' accept header. Based on that, the web app could redirect to the repo service. Sounds more cumbersome, but maybe there's some advantage?"
Another question about metadata is what else to include in the metadata. The EZID demo lists the author (who), the title (what), and the timestamp (when). Are these necessary? Do we need to include them? What can be the use cases around those additional fields? What are the minimal metadata?. As I was testing around, several fields were indeed required for generating DataCite DOIs. Here is the DataCite Schemas. The most recent schema requires Creator, Title, Publisher, PublicationYear besides DOI. Again, we need pay attention to the escaping, encoding specs here when supplying the data. Here is the proposed mapping of the required fields.
Creator: Entity CREATED_BY
Title: Entity NAME
Publisher: Sage Bionetworks
PublicationYear: The timestamp this DOI is being created
Granularity
Ideally, we let users explicitly ask to create DOIs (vs. we auto-generating a DOI for each entity version). Creating a DOI for a version of a leaf entity also means 1) the entity of the particular version is immutable/undeletable and 2) the entity of the particular version is public and cannot be changed to private. This is in a sense publishing the entity. As of now, the publishing aspect of it is still unclear. For possible approaches, see this page. We are still looking for a reasonable implementation that is natural to users and at the same time not too intrusive to the current architecture.
The current model is to let users assume the ultimate responsibility of DOIs. This is to say, users call Synapse to create DOIs for versions of entities, but Synapse does not enforce that the entities are public and immutable. It is the users' responsibility to make sure that being the case. In the future, when the specs of publishing are clear, we may add an additional API that does both publishing the entities and creating the DOIs in one step. It's worth mentioning that figshare automatically mints DOIs for every object. There is a separate figshare action to publish something.
Asynchronous Client to the DOI Service
As we are calling an external service with unknown reliability, this will certainly be done asynchronously. How According to Joan, "it takes something less than 5 seconds to register a DOI" and 5 minutes for the DOI to propagate through the Handle system. My test against the mint-DOI operation shows latencies on the client side between 3 and 4 seconds. The average is 3.4 seconds. There isn't noticeable difference between calling from my local machine and calling from an east-1 EC2 instance.
DataCite Metadata Store also has DOI APIs. I suspect it is what EZID ultimately calls. The servers of mds.datacite.org are EC2 instances in the EU WEST region. A ping to the servers takes about 200 ms.
How do we handle failures? Record failed dois DOIs in the database table?
Data Migration
Involve of avoid?
External Links
DOI
- Wiki http://en.wikipedia.org/wiki/Digital_object_identifier
- DOI http://www.doi.org/
- DOI Handbook http://www.doi.org/hb.html
- DOI Resolver http://dx.doi.org/
- Short DOI http://shortdoi.org/
- DOI Crossref http://www.crossref.org/
DataSite http://www.datacite.org/EZID
- Name-to-Thing
...
- EZID http://n2t.net/ezid/
- EZID Documentation http://n2t.net/ezid/home/documentation
- EZID Service Guidelines http://www.cdlib.org/services/uc3/docs/EZIDServiceGuidelines.pdf (Note there is one hour service window each week.)
- EZID Service Status Blog http://ezidstatus.wordpress.com/
DataCite
- DataCite http://www.datacite.org/
- DataCite API doc https://mds.datacite.org/static/apidoc
- DataCite MDS source code https://github.com/datacite/mds
- DataCite test environments http://test.datacite.org/
- DataCite infrastructure http://www.bl.uk/aboutus/stratpolprog/digi/datasets/workshoparchive/EdZukowski_DataCiteInfrastructure_May2012.pdf
- DataCite statistics http://stats.datacite.org/