Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

 
This is the data model implied by the wire frame Platform UI found at http://pixeltheoryinc.com/clients/sage/platform/
circa Dec. 2010
The source file for the image is attached and the tool for editing it can be freely downloaded:http://www.umlet.com/



 

Notes / Assumptions:

The data model depicted here may well span multiple data stores, including (1) a core data store to track projects, data sets, revisions, etc., (2) a authentication/authorization system to track users and groups, (3) a document collaboration and discussion thread system (e.g. Google Apps / Google Groups),  and (4) a file-based system for molecular-profiling data.

"Posted" means uploaded to the Platform.  "Published" (for something posted) means to be made available to all Platform users.

A Dataset is associated with a Project in one of two ways, (1) it is accessible by a Group created for a project team, (2) it is used in a Workflow owned by the Project.  The latter is the definition of a Project "using" a dataset.

...

Versioning is on Data Layers (including Analysis Results, Networks, Gene Lists, Workstream Steps) and Scripts/Algorithms.  A Dataset inherits the versions of its layers, e.g. when a layer is revised the version of the Dataset is (conceptually) incremented.

Following applies to Projects, Workstreams, and Datasets.

...

that is, is it correct to say that a dataset specifies the
 subjectsthe subjects, tissue types, etc., and the layers represent
 the represent the assays (GE, GT, sequencing)?  Or can different
 layers different layers have different subjects, tissue types, phenotypes
 and phenotypes and numbers of samples?
See above for most of this.  The different layers shouldn't have data on completely different sets of individuals, otherwise they would just be separate datasets.  However, some dataset layers may be incomplete for some individuals.
- may be multiple diseases, species, tissue types, platforms
- what does it mean for a dataset to be "posted"?
 Aren't different layers posted at different times?
"Posted" I think is "last modified".  Different layers might be released at different times, I think this would be the exception for raw data, but in short term at least it will be common for the QC layers to come in later.  I think every time a dataset is modified, it's posted date changes and version number increases.

- Is the dataset "description" (seen in the Datasets screen) the same as the
dataset "Overview" (see in the DatasetMyers screen)?
No, the text on the datasets screen would be some text describing the types of data Sage is hosting.

- What does "Data type: Clinical phenotypes" mean as associated with a data *set*?
Not sure what you're asking about here.
- should 'posted' and 'curated' time stamps be associated w/ Datasets or
 with or with dataset layers?
Again, probably the layers, with information "rolled up" to the dataset.

- Is dataset "download availability" = (release date != null)?
I think so... not exactly sure of the question here.
- Re "Release Notes: 3", what does it mean to "release", can a DS be released more than
once?  Does it mean another sample has been added to the data set, that the DS has
been updated, or something else?
Probably most commonly means a new layer added (especially a QC layer), new samples also possible.  Any change at all to the dataset needs to be versioned since a goal is to be able to reproduce analysis, we have to know exactly what data was available when a particular analysis was run, even if the data set changes later.
Practically, I don't see large number of versions to a dataset occurring.  This will be an infrequent event and many datasets may only have a v1 release, but I think we still need to code for it.
- What is a "contributor" -- Someone who uploads data? someone who analyses the data?
Is it something more restrictive, like the PI of the experiment who generated the data?
What should be the granularity of 'contribution' -- by the sample, the layer, the
data set, some sort of ds revision, ...?
The contributor is the person who provided us the data, most likely the PI of the lab that generated it.   It's the person we thank profusely for his contribution, and encourage users of the data to cite when they publish work that uses the data.  The data is actually sent to us by some grad student or post doc in the contributor's lab.
I've been seeing one contributor for the dataset in the examples I've seen, though I guess multiple contributors could be possible.
- In the DatasetMyers screen, what does "Modifed" mean?  Is it the latest date
something changed in the project?  If so, what are the things whose modifications
need to be tracked?  Do they also need to be versioned?
Something changed in the curated data set layer.  We're not in a project context here.

...