Project Organization
Introduction
A project is created for the purpose of gathering and sharing data for a specific topic. The project organizers are a group of users that collaborate to define how data should be organized in the project. Organizers use a combination of features such as folder hierarchy, file naming conventions, and annotations on files to organize the data in the project. Their goal is to make it easy for data consumers to find data of interest.
Data providers are another group of users that add data to the project (they may or may not overlap with project organizers and data consumers). Ideally, data providers would always follow a project's organization. In practice, data will be uploaded to the project without adhering to the project's organization. Here are some examples of why data is not organized as expected:
- The project's organization is not fully understood.
- Data does not fit neatly into the organization (further organization is required).
- Disagreement about how data should be organized.
- Bugs in automated upload scripts or human error for manual uploads.
- Lack of time or resources needed to follow the organization.
Even when data is organized as expect it may still be difficult for data consumers to find data of interest. The challenges faced by data consumers is dependent on the organization techniques used for each project. In the next section we will discuss the benefits and limitations of each organization technique.
Organization Techniques
Hierarchy
Organization by hierarchy involves defining a folder hierarchy to contain the data files. Often a folder is named to define a category of files contained withing the folder. The hierarchy technique works well when the data to be categorized is naturally hierarchical. A classic example of such hierarchy would be the taxonomical ranking of groups of organisms (Kingdom, Phylum, Class, Order, Family, Genus, Species) where each organism belongs to only one Kingdom, Phylum, Class, etc. However, the hierarchy technique dictates how data consumers are expected navigate the data. For example, to find data about an organism in taxonomical ranking hierarchy, the consumer must know the full ranking of an organism in order to find the data of interest.
The hierarchy technique is a poor fit for data categories that are orthogonal to each other. For example, consider a project that contained data about college students where the categories of interest are; GPA, Major, Age. In this example each category is completely independent of each other. Building a hierarchy with such independent categories becomes arbitrary. A ranking by Age, then Major, then GPA, might works well if the data consumer want to find all data for students over 40 years old. However, if the data consumer wants to find data for all students with a low GPA then the above hierarchy become a hindrance.
File Naming Scheme
This organization technique involves naming each file following some descriptive naming scheme. This technique is often used in conjunction with the hierarchy technique. In fact, the naming schemes often match some type of hierarchy. The file naming scheme has all of the same the benefits and limitations of the hierarchy technique.
File Description
This technique involves writing a human readable description for each file. A file description that is well written and complete can be valuable to data consumers. This is also the only organization technique that works well with search engines technologies (ie. Google) . When executed well, data consumers can find anything they desire by formulating a simple text search. The main limitation of this technique is finding people willing/able to construct useful descriptions for each file.
Provenance
Provenance provided for a data file will typically include links that describe where the data originated, and links to scripts used to process or transform the data. These links describe a network for data consumers to navigate to help find the data of interest. To make the most of these provenance networks, users need rich graphical UI tools and services to query the networks using graph query languages. The current provenance tools in Synapse only allows users to follow links upstream to where the data originated. This limits the data consumers ability to use provenance to find data of interest.
It is also challenging to get data providers to include provenance with the data they provide. It is even more difficult for a third party to add provenance to data after the fact.
Annotations
An annotations system is a form of structured data classification. With annotations, structure is imposed by first defining the data categories of interest. Each file is then assigned a value for each category. Annotations work best for data discovery when the categories are well defined and understood by both the data providers and data consumers. With such a system, data consumers can formulate queries to find data of interest using the data categories.
Tagging
Tagging involves adding one or more short descriptive strings (or tags) to a data file. Unlike annotations, tags are a form of unstructured classification since no categories are defined. Instead values are added to each file without considering predefined categories. The lack of structure is both a strength and weakness for data providers and consumers. Since there is no structure, data providers can add tags at their own discretion. However, this means there are no guidelines to help providers add tags of value. This means the value of the tags can be inconsistent across data providers and time. To discover data of interest, data consumers only need to provide one or more tag value. This is simpler than building a filter with key/value pairs like annotation queries. However, it is not possible to find data by category since such categories were not defined. It is difficult to consistently add valuable tags which means it can be difficult to find data of interest.
Conclusions
Currently, Synapse supports all of the above organization techniques with the exception of tagging. Each time we considered adding a tagging system to Synapse we concluded that most scientific data has natural data categories that both project organizers and data consumers wish to utilized. This likely also explains why most mature projects adopt some level of project organization using annotations. Annotation strike a nice balance between maintainability and usefulness. We believe, improving the annotations system in Synapse will provide a greatest benefit to project organizers, data providers and data consumers.
There are several major limitations to the annotations system currently supported in Synapse.
- There is no mechanism for project organizers to define the data categories within Synapse. Therefore, there is no annotations guide for data providers or data consumers.
- Data consumers must write SQL like queries to utilize annotations within a given scope. There are no UI tools that help data consumers discover data using annotations.
These limitations are also reflected in the driving use cases for this feature set: Use Cases.
The following features are proposed to aid in project organization: