Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The hierarchy technique is a poor fit for data categories that are orthogonal to each other.  For example, consider a project that contained data about college students where the categories of interest are; GPA, Major, Age.  In this example each category is completely independent of each other.  Building a hierarchy with such independent categories becomes arbitrary.   A ranking by Age, then Major, then GPA, might works well if the data consumer want to find all data for students over 40 years old.  However, if the data consumer wants to find data for all students with a low GPA then the above ranking hierarchy become a hindrance.

File Naming Scheme

...

This technique involves writing a human readable description for each file.  A file description that is well written and complete can be very valuable to data consumers.  This is also the only organization technique that works well with search engines technologies (ie. Google) .  When executed well, data consumers can find anything they desire by formulating a simple text search.  The main limitation of this technique is finding people willing/able to construct useful descriptions for each file.

...

Tagging involves adding one or more short descriptive strings (or tags) to a data file.  Unlike annotations, tags are a form of unstructured classification since no categories are defined.  Instead values are added to each file without considering predefined categories.  The lack of structure is both a strength and weakness for data providers and consumers.  Since there is no structure, data providers can add tags at their own discretion.  However, this means there are no guidelines to help providers add tags of value.  This means the value of the tags can be inconsistent across data providers and time.  To discover data of interest, data consumers only need to provide one or more tag value.  This is simpler than building a filter with key/value pairs like annotation queries.  However, it is not possible to find data by category since such categories were not defined.  It is difficult to consistently add valuable tags which means it can be difficult to find data of interest. 

Conclusions

Currently, Synapse does not have services for project organizers to formally define a project's organization. For example, there are no services to define a project's annotation “schema” or file hierarchy.  Instead, project organizers maintain a project's organization by first attempting to communicate with data providers (wikis, emails, etc...).  When this does not work, organizers rely on tools/features to help find and correct erroneous data.  The current set of tools/features are lacking:

  • Annotation query language is not derived from SQL like the table query language.
  • Filtering by hierarchy is limited to projectId and benefactorId.
  • Annotations query system is not scaling with Synapse.
  • A singe annotations query can adversely effect all Synapse services by overloading the main database.
  • While it is possible to find incorrect annotations, applying fixes in bulk is not possible.  In fact, users have develops at least three system to try to solve this issue from the client side.

 

For more information see: History of Controlled Vocabularies in Synapsesupports all of the above organization techniques with the exception of tagging.  Each time we considered adding a tagging system to Synapse we concluded that most scientific data has natural data categories that both project organizers and data consumers wish to utilized.  This likely also explains why most mature projects adopt some level of project organization using annotations.  Annotation strike a nice balance between maintainability and usefulness.  We believe, improving the annotations system in Synapse will provide a greatest benefit to project organizers, data providers and data consumers.

There are several major limitations to the annotations system currently supported in Synapse.

  • There is no mechanism for project organizers to define the data categories within Synapse.  Therefore, there is no annotations guide for data providers or data consumers.
  • Data consumers must write SQL like queries to utilize annotations within a given scope.  There are no UI tools that help data consumers discover data using annotations.

These limitations are also reflected in the driving use cases for this feature set: Use Cases.

The following features are proposed to aid in project organization:

...