Data Model

OBSOLETE

This is the data model implied by the wire frame Platform UI found at http://pixeltheoryinc.com/clients/sage/platform/
circa Dec. 2010
The source file for the image is attached and the tool for editing it can be freely downloaded:http://www.umlet.com/

Notes / Assumptions:

The data model depicted here may well span multiple data stores, including (1) a core data store to track projects, data sets, revisions, etc., (2) a authentication/authorization system to track users and groups, (3) a document collaboration and discussion thread system (e.g. Google Apps / Google Groups), and (4) a file-based system for molecular-profiling data.

"Posted" means uploaded to the Platform.

"Published" (for something posted) means to be made available to all Platform users. "Release" is synonymous with "Publish".

"Downloadability" is synonymous with read-access.

A Dataset is associated with a Project in one of two ways, (1) it is accessible by a Group created for a project team, (2) it is used in a Workstream owned by the Project. The latter is the definition of a Project "using" a dataset.

Access control is on Projects, Datasets (incl. Dataset layers, Analysis Results), and Scripts. (Note, there is no such thing as a 'private' resource. Scripts and data are always available at least to the project team and later to the public.)

Versioning is on Data Layers (including Analysis Results, Networks, Gene Lists, *maybe* (TBD) Workstream Steps) and Scripts. A Dataset inherits the versions of its layers, e.g. when a layer is revised the version of the Dataset is (conceptually) incremented. (Note, a likely design decision is to delegate document versioning to Google Apps and version control systems.)

Following applies to Projects, Workstreams, and Datasets as detailed in this page: http://pixeltheoryinc.com/clients/sage/platform/Profile.html

Commenting can be done on Datasets, Dataset Layers, Scripts, Projects, Workstreams.

Study size of a Dataset is the number of subjects involved in the study. (Note this is not equivalent to the number of samples, which may be more.)

A Dataset may have a single "study area / disease" attribute. (This is TBD.)

Answered Questions:

- are data-sets associated with projects? if so, is the association optional?
Yes and yes.

Can a ds be associated with multiple projects?
Yes. Idea is there is a global listing of datasets which may be public, or restricted to certain groups / individuals. When you browse the datasets tab you are browsing this global, Sage-curated and approved library of datasets.
We have also heard the need for project teams to upload their own data sets, which would be limited in scope to the project. Idea is to give users the ability to quickly start working with data without being gated by curation. Users could then "publish" the data from the project to the global library, which requires the same curation process as any other data (although hopefully the project teams get it in reasonable shape to begin with.)

(Note the "DatasetMyers" screen says "Projects Using this Dataset" suggesting there's a "uses" relationship bet. projects and ds's.)

- what does "suggest a dataset" mean?
Send an email to the Sage curation team about a dataset you think we should acquire and support.

- do species and tissuetype go in the dataset or in the layer?
Surprisingly a bit tricky. Species has always been the same in every case, and should go in the dataset. In the vast majority of cases, tissue is the same for all layers, and users want the ability to browse, sort, and filter datasets by that tissue type. However, some data sets contain data from multiple tissues, e.g. different brain regions in the Harvard Brain data, or both fat and liver tissue in a diabetes study. In these cases the genetics is the same for all the tissues, but I've been thinking of this as multiple layers of expression data, with a layer per tissue type. This seems right as users might want only the data from a specific tissue, and any analysis would probably treat the data as separate objects. Another example is cancer genetics, where you have genetic data on tumor and adjacent normal tissue to look for mutations in the cancer. Again, this seems like two layers from an analysis / data access point of view.
However, we don't allow browsing / sorting / filtering of layers and doing so seems wrong to me. When browsing / sorting / filtering I think you'd want to treat the dataset as having all the tissue types of all its layers. So, a dataset with layers from liver and fat would show up as having two values in the tissue type attribute, and you'd find the dataset if you filtered on either value. E.g. in some cases, the attributes of a dataset are the sum of the attributes on all the individual layers.
Note that tissue doesn't really apply to some layers at all, e.g. many clinical variables are measurements on the whole organism, not any particular tissue. I think this is still consistent with the above if we just ignore layers that don't contribute to the data set level values.
I think this works for many categorical variables, e.g. Platform might also work the same way.

same Q for StudySize
For study size I've been using the total number of individuals at the dataset level. Not all layers would necessarily have data on all the individuals, so they might have lower numbers. This is actually pretty important information about the data. One user wanted to see a Venn Diagram of how many samples existed for every combination of layers!

that is, is it correct to say that a dataset specifies the subjects, tissue types, etc., and the layers represent the assays (GE, GT, sequencing)? Or can different layers have different subjects, tissue types, phenotypes and numbers of samples?
See above for most of this. The different layers shouldn't have data on completely different sets of individuals, otherwise they would just be separate datasets. However, some dataset layers may be incomplete for some individuals.
- may be multiple diseases, species, tissue types, platforms
- what does it mean for a dataset to be "posted"?
Aren't different layers posted at different times?
"Posted" I think is "last modified". Different layers might be released at different times, I think this would be the exception for raw data, but in short term at least it will be common for the QC layers to come in later. I think every time a dataset is modified, it's posted date changes and version number increases.

- Is the dataset "description" (seen in the Datasets screen) the same as the dataset "Overview" (see in the DatasetMyers screen)?
No, the text on the datasets screen would be some text describing the types of data Sage is hosting.

- should 'posted' and 'curated' time stamps be associated w/ Datasets or with dataset layers?
Again, probably the layers, with information "rolled up" to the dataset.

- Is dataset "download availability" = (release date != null)?
I think so... not exactly sure of the question here.
- Re "Release Notes: 3", what does it mean to "release", can a DS be released more than
once? Does it mean another sample has been added to the data set, that the DS has
been updated, or something else?
Probably most commonly means a new layer added (especially a QC layer), new samples also possible. Any change at all to the dataset needs to be versioned since a goal is to be able to reproduce analysis, we have to know exactly what data was available when a particular analysis was run, even if the data set changes later.
Practically, I don't see large number of versions to a dataset occurring. This will be an infrequent event and many datasets may only have a v1 release, but I think we still need to code for it.
- What is a "contributor" -- Someone who uploads data? someone who analyses the data?
Is it something more restrictive, like the PI of the experiment who generated the data?
What should be the granularity of 'contribution' -- by the sample, the layer, the
data set, some sort of ds revision, ...?
The contributor is the person who provided us the data, most likely the PI of the lab that generated it. It's the person we thank profusely for his contribution, and encourage users of the data to cite when they publish work that uses the data. The data is actually sent to us by some grad student or post doc in the contributor's lab.
I've been seeing one contributor for the dataset in the examples I've seen, though I guess multiple contributors could be possible.
- In the DatasetMyers screen, what does "Modifed" mean? Is it the latest date
something changed in the project? If so, what are the things whose modifications
need to be tracked? Do they also need to be versioned?
Something changed in the curated data set layer. We're not in a project context here.

Open Questions:

- are the attributes for a dataset well defined (species, tissue type, ...) or should the list be open ended?

- what does "suggest a project" mean? ("Send Stephen an email"?)

- Is it OK that a project doesn't have the following, but rather inherits it from its data sets?
diseases / study areas
# of followers
last activity

- what are the 'activities' that need to be tracked in the Project History, e.g.
- create a workstream in a project
- add a layer to a dataset that's used in a project
- add a sample to a dataset (layer?) that's used in a project
- grant access to a document to a Project's group

- grant access to a dataset to a Project's group
- send an email to a group associated with a project (probably too minor...)

- are there any objects are to be 'versioned' besides data sets, layers, and scripts/algorithms?

- If the start data for a workstream changes late in a workstream's progress, is the change reflected in an update to the current workstream (like changes propagating through a xls sheet) or rather in a new workstream (i.e. the workstream is tied to a particular revision of each input data layer)?

- Are workstreams 'versioned'? If so, what constitutes a revison, a change in algorithm, and change in data, both, ...?

- can you Follow something that's not versioned? What events are there besides revisions?

- can document versioning be delegated to the document collaboration system (Google Apps)?

- how is 'last modified' defined for a workstream?

- In the workstream example, is "Created" really defined for an analysis (e.g. "Correlation Network Analysis") or just for the analysis output ("Network")?

- Is every data set and layer in a project in some workstream, or can a project have ds,dl's that aren't in any workstream?
(My guess is the latter since a project can have new, unused data, and perhaps 'scratch' analyses.)

- Are workstream steps 'predefined' or are they created during analysis? E.g. do you have blank steps in a workstream while you're waiting for an analysis to be continued?

- The workstream steps seem to include the operations as well as the derived data layers. Is this right?
Isn't there a missing step between "Network" and "Gene Lists" that says how the gene lists were selected from the network?

- Can a workstream branch or is it purely linear?

- Is normalized data considered a distinct layer from the raw data it uses as input?

- It seems to me that a "Workstream step" is a specific kind of "Analysis Result". Is that right?

- What are the valid "statuses" for an algorithm, besides "Unpublished"?

- What can be published besides data sets, data layers, and scripts/algorithms?

- what does it mean to "Remove" Network1 or GeneList1?

- "Submit a network": Would there have to be a parent dataset (existing or created upon the network's submission)?

- Networks page: The "Project" of a network could be the Project owning the workstream whose step
created the Project or, if the network was not created as a workstream step, then a comma
delimited list of all Projects using the dataset which is the parent of the network.

- Should the Platform have "Scripts" as a generalization of "algorithm" or are all scripts that can process data considered algorithms?

- What are the types of access to resources by users, e.g. is there 'read-only' access to a resource for some groups and read-write for others? Is there a separate permission to 'publish'?

- What are the possible statuses for a dataset?