Data Model

This is the data model implied by the wire frame Platform UI found at http://pixeltheoryinc.com/clients/sage/platform/
circa Dec. 2010
The source file for the image is attached and the tool for editing it can be freely downloaded:http://www.umlet.com/

Notes / Assumptions:

The data model depicted here may well span multiple data stores, including (1) a core data store to track projects, data sets, revisions, etc., (2) a authentication/authorization system to track users and groups, (3) a document collaboration and discussion thread system (e.g. Google Apps / Google Groups), and (4) a file-based system for molecular-profiling data.

"Posted" means uploaded to the Platform. "Release" is synonymous with "Post".

"Published" (for something posted) means to be made available to all Platform users.

"Downloadability" is a Dataset attribute separate from whether it's "Published" (though in practice a Dataset is Posted, then Published before it's made downloadable).

A Dataset is associated with a Project in one of two ways, (1) it is accessible by a Group created for a project team, (2) it is used in a Workflow owned by the Project. The latter is the definition of a Project "using" a dataset.

Access control is on Projects, Datasets (incl. Dataset layers, Analysis Results), and Scripts/Algorithms.

Versioning is on Data Layers (including Analysis Results, Networks, Gene Lists, Workstream Steps) and Scripts/Algorithms. A Dataset inherits the versions of its layers, e.g. when a layer is revised the version of the Dataset is (conceptually) incremented.

Following applies to Projects, Workstreams, and Datasets.

Commenting can be done on Datasets, Dataset Layers, and Scripts/Algorithms.

Study size of a Dataset is the size of the union of sample sets for its Data Layers.

A Dataset has a single "study area / disease" attribute.

Answered Questions:

- are data-sets associated with projects? if so, is the association optional?
Yes and yes.

Can a ds be associated with multiple projects?
Yes. Idea is there is a global listing of datasets which may be public, or restricted to certain groups / individuals. When you browse the datasets tab you are browsing this global, Sage-curated and approved library of datasets.
We have also heard the need for project teams to upload their own data sets, which would be limited in scope to the project. Idea is to give users the ability to quickly start working with data without being gated by curation. Users could then "publish" the data from the project to the global library, which requires the same curation process as any other data (although hopefully the project teams get it in reasonable shape to begin with.)

(Note the "DatasetMyers" screen says "Projects Using this Dataset" suggesting there's a "uses" relationship bet. projects and ds's.)

- what does "suggest a dataset" mean?
Send an email to the Sage curation team about a dataset you think we should acquire and support.

- do species and tissuetype go in the dataset or in the layer?
Surprisingly a bit tricky. Species has always been the same in every case, and should go in the dataset. In the vast majority of cases, tissue is the same for all layers, and users want the ability to browse, sort, and filter datasets by that tissue type. However, some data sets contain data from multiple tissues, e.g. different brain regions in the Harvard Brain data, or both fat and liver tissue in a diabetes study. In these cases the genetics is the same for all the tissues, but I've been thinking of this as multiple layers of expression data, with a layer per tissue type. This seems right as users might want only the data from a specific tissue, and any analysis would probably treat the data as separate objects. Another example is cancer genetics, where you have genetic data on tumor and adjacent normal tissue to look for mutations in the cancer. Again, this seems like two layers from an analysis / data access point of view.
However, we don't allow browsing / sorting / filtering of layers and doing so seems wrong to me. When browsing / sorting / filtering I think you'd want to treat the dataset as having all the tissue types of all its layers. So, a dataset with layers from liver and fat would show up as having two values in the tissue type attribute, and you'd find the dataset if you filtered on either value. E.g. in some cases, the attributes of a dataset are the sum of the attributes on all the individual layers.
Note that tissue doesn't really apply to some layers at all, e.g. many clinical variables are measurements on the whole organism, not any particular tissue. I think this is still consistent with the above if we just ignore layers that don't contribute to the data set level values.
I think this works for many categorical variables, e.g. Platform might also work the same way.

same Q for StudySize
For study size I've been using the total number of individuals at the dataset level. Not all layers would necessarily have data on all the individuals, so they might have lower numbers. This is actually pretty important information about the data. One user wanted to see a Venn Diagram of how many samples existed for every combination of layers!

that is, is it correct to say that a dataset specifies the subjects, tissue types, etc., and the layers represent the assays (GE, GT, sequencing)? Or can different layers have different subjects, tissue types, phenotypes and numbers of samples?
See above for most of this. The different layers shouldn't have data on completely different sets of individuals, otherwise they would just be separate datasets. However, some dataset layers may be incomplete for some individuals.
- may be multiple diseases, species, tissue types, platforms
- what does it mean for a dataset to be "posted"?
Aren't different layers posted at different times?
"Posted" I think is "last modified". Different layers might be released at different times, I think this would be the exception for raw data, but in short term at least it will be common for the QC layers to come in later. I think every time a dataset is modified, it's posted date changes and version number increases.

- Is the dataset "description" (seen in the Datasets screen) the same as the dataset "Overview" (see in the DatasetMyers screen)?
No, the text on the datasets screen would be some text describing the types of data Sage is hosting.

- should 'posted' and 'curated' time stamps be associated w/ Datasets or with dataset layers?
Again, probably the layers, with information "rolled up" to the dataset.

- Is dataset "download availability" = (release date != null)?
I think so... not exactly sure of the question here.
- Re "Release Notes: 3", what does it mean to "release", can a DS be released more than
once? Does it mean another sample has been added to the data set, that the DS has
been updated, or something else?
Probably most commonly means a new layer added (especially a QC layer), new samples also possible. Any change at all to the dataset needs to be versioned since a goal is to be able to reproduce analysis, we have to know exactly what data was available when a particular analysis was run, even if the data set changes later.
Practically, I don't see large number of versions to a dataset occurring. This will be an infrequent event and many datasets may only have a v1 release, but I think we still need to code for it.
- What is a "contributor" -- Someone who uploads data? someone who analyses the data?
Is it something more restrictive, like the PI of the experiment who generated the data?
What should be the granularity of 'contribution' -- by the sample, the layer, the
data set, some sort of ds revision, ...?
The contributor is the person who provided us the data, most likely the PI of the lab that generated it. It's the person we thank profusely for his contribution, and encourage users of the data to cite when they publish work that uses the data. The data is actually sent to us by some grad student or post doc in the contributor's lab.
I've been seeing one contributor for the dataset in the examples I've seen, though I guess multiple contributors could be possible.
- In the DatasetMyers screen, what does "Modifed" mean? Is it the latest date
something changed in the project? If so, what are the things whose modifications
need to be tracked? Do they also need to be versioned?
Something changed in the curated data set layer. We're not in a project context here.

Open Questions:

- what does "suggest a project" mean?

- Is it OK that a project doesn't have the following, but rather inherits it from its data sets?
diseases / study areas
# of followers
last activity

- what are the 'activities' that need to be tracked, e.g.
- add a data set to a project
- add a layer to a dataset that's added to a project
- add a sample to a dataset (layer?) that's added to a project
- grant access to a document to a group associated with a project
- send an email to a group associated with a project (probably too minor...)

- what objects are to be 'versioned'?

- how is 'last modified' defined for a workstream?

- Is every data set and layer in a project in some workstream, or can a project have ds,dl's that aren't in any workstream?
(My guess is the latter since a project can have new, unused data, and perhaps 'scratch' analyses.)

- Are workstream steps 'predefined' or are they created during analysis? E.g. do you have blank steps in a workstream
while you're waiting for an analysis to be continued?

- The workstream steps seem to include the operations as well as the derived data layers. Is this right?
Isn't there a missing step between "Network" and "Gene Lists" that says how the gene lists were selected from the network?

- Can a workstream branch or is it purely linear?

- Is normalized data considered a distinct layer from the raw data it uses as input?

- It seems to me that a "Workstream step" is a specific kind of "Analysis Result". Is that right?

- What are the valid "statuses" for an algorithm, besides "Unpublished"?

- what does it mean to "Remove" Network1 or GeneList1?

- "Submit a network": Would there have to be a parent dataset (existing or created upon the network's submission)?

- Networks page: The "Project" of a network could be the Project owning the workstream whose step
created the Project or, if the network was not created as a workstream step, then a comma
delimited list of all Projects using the dataset which is the parent of the network.

- Should the Platform have "Scripts" as a generalization of "algorithm" or are all scripts that can process
data considered algorithms?

- What objects (if any, so far) need versioning ?

- What are the types of access, e.g. is there 'read-only' access to a resource for some groups and read-write for others?
is there a separate permission to 'publish'?

- What are the possible statuses for a dataset?