Background
For the context driving the project, refer to this page (/wiki/spaces/SYNCOM/pages/33914913)
DOI Basics
A doi is a persistent identifier that can locate resources on the internet. Examples, doi:10.1038/nature.2012.10240, doi:10.1126/science.1100533, doi:10.6084/m9.figshare.153827, 10.5240/1489-49A2-3956-4B2D-FE16-5. The first two are Nature and Science publications respectively. The third is a figshare data set. The last one is a movie. Note that, instead of a HTML web page, the movie doi points to an XML document about the movie. Given a doi, locating its digital content on the internet (i.e. mapping a URI to a URL) is the primary use case of DOI.
Refer to the handbook for the restrictions on doi format. The length is unlimited. Any printable unicode character is allowed. Granularity is open. Pay attention to URL encoding (but nothing surprising there).
Design
Format
We want the doi to have all the necessary information but no more. Also see the examples of doi above.
doi:sage.synapse.<id>.<ver>
For example, doi:sage.synapse.144362.1 refers to version 1 of syn144362. The version number is included in the doi so that each version will have its own doi. This way, we avoid the problem of choosing a version for a doi if the doi's granularity only reaches the entity level.
Granularity
What entity type? Public or private? Which Version. A simple, straightforward solution is to apply a doi to each and every entity version. This is because the official DOI model does not enforce any restrictions on the granularity. We could and should create a doi for each identifiable entity. Another reason is that, if we choose EZID as the DOI API provider, it's a fixed price for 1 million dois per year. Within the 1 million limit (which we currently are), it does not matter how many dois we use.
We need to find out the price model for more than 1 million dois. We are very close to this limit.
What about the new File/Wiki objects?
Metadata
The meta data will include URLs to the corresponding Synapse entity. More than one URL can be used but all should point to the same entity. One question is to whether to point the doi to the Synapse web client or to the backend repo service. Here are the possibilities:
- Point the doi to the web client URL. This type of doi is good for human beings as the users. But it prevents the R and Python clients from using the doi.
- Alternatively, we can point the doi to the repo URL. The entity is fetched as an JSON object. This type of doi favors the programmable clients. But the human users of the Synapse website will find it less appealing.
- Provide two separate sets of dois (doing both 1 and 2). Different client will pick the proper doi to use.
Another question is what else to include in the metadata. The EZID demo lists the author (who), the title (what), and the timestamp (when). Are these necessary? Do we need to include them? What can be the use cases around those additional fields? What are the minimal metadata?
Asynchronous Client to the DOI Service
As we are calling an external service with unknown reliability, this will certainly be done asynchronously. How do we handle failures? Record failed dois in the database table?
Data Migration
Involve of avoid?
External Links
Wiki http://en.wikipedia.org/wiki/Digital_object_identifier
DOI Handbook http://www.doi.org/hb.html
DOI Resolver http://dx.doi.org/
Short DOI http://shortdoi.org/
DOI Crossref http://www.crossref.org/
DataSite http://www.datacite.org/
Name-to-Thing EZID http://n2t.net/ezid/
EZID Documentation http://n2t.net/ezid/home/documentation
EZID Service Guidelines http://www.cdlib.org/services/uc3/docs/EZIDServiceGuidelines.pdf (Note there is one hour service window each week.)