Document toolboxDocument toolbox

Synapse IDs, GUIDs and URLs

A Synapse user should be able to publish a Synapse URL in a paper without fear of the URL breaking one day. This requirement implies that Synapse URLs must be both immutable and globally unique.

Currently, a Synapse URLs has the following form:

/<host>/<prefix>/<api_version>/<type>/<id>

For example:

https://staging-reposervice.elasticbeanstalk.com/repo/v1/dataset/12

In this example:

host=staging-reposervice.elasticbeanstalk.com
prefix=repo
api-version=v1
type=dataset
id=12

This URL would be globally unique and immutable if the host is globally unique (which it is) and the ID is both immutable and unique within the domain.  Currently our IDs are immutably and unique but only within a single database schema.  This is a side-effect of using a MySql issued IDs that are produced from a hidden database sequences. In other words, if we were to migrate the above dataset from one schema to another, we could not guarantee that the ID would not change because the ID might already be in use in the new schema. Therefore, our current URLs do not meet the immutability test.

To resolve this issue we need a way to issue IDs that are unique within our domain. There are at least two basic ways to achieve this:

  1. Use a true GUID for the ID
  2. Use a dedicated service for issuing IDs to all instance of Synapse within the domain. The service would need to guarantee to never issue the same ID twice in a distributed environment.

The problem with the first option is real GUIDs are very long and unwieldy strings. In addition we would still need a smaller, numeric ID for primary keys, foreign keys, joins and filters to meet database performance requirements. Comparing long strings is just too computationally expensive for the most basic database operations.

The problem with the second option is that we would need to maintain a separate ID generating service. However, this option means the IDs do not need to be globally unique.  They only need to be unique within the Sage Bionetworks domain.  This is easy to achieve if the ID generating service never issues the same ID twice.  These IDs can also be numeric so they can also be used as the Entity primary key.  This option would require the smallest change to the repository service.

The ID generating service could be as simple as an independent and dedicated MySql database that issues IDs from a sequence to all instance of Synapse.  On the back-end we can even keep using Longs for Entity IDs until we get close to 9,223,372,036,854,775,807 entities (Long.MAX).

While initially this would mean we have a single point of failure, we can always add a more complex an robust ID generating system in the future.