OBSOLETE
This is the data model implied by the wire frame Platform UI found at http://pixeltheoryinc.com/clients/sage/platform/
circa Dec. 2010
The source file for the image is attached and the tool for editing it can be freely downloaded:http://www.umlet.com/
...
"Posted" means uploaded to the Platform.
"Published" (for something posted) means to be made available to all Platform users. "Release" is synonymous with "Publish".
"Downloadability" is synonymous with read-access.
A Dataset is associated with a Project in one of two ways, (1) it is accessible by a Group created for a project team, (2) it is used in a Workflow Workstream owned by the Project. The latter is the definition of a Project "using" a dataset.
Access control is on Projects, Datasets (incl. Dataset layers, Analysis Results), and Scripts/Algorithms.Scripts. (Note, there is no such thing as a 'private' resource. Scripts and data are always available at least to the project team and later to the public.)
Versioning is on Data Layers (including Analysis Results, Networks, Gene Lists, Workstream *maybe* (TBD) Workstream Steps) and Scripts/Algorithms. A Dataset inherits the versions of its layers, e.g. when a layer is revised the version of the Dataset is (conceptually) incremented. (Note, a likely design decision is to delegate document versioning to Google Apps and version control systems.)
Following applies to Projects, Workstreams, and Datasets. and Datasets as detailed in this page: http://pixeltheoryinc.com/clients/sage/platform/Profile.html
Commenting can be done on Datasets, Dataset Layers, and Scripts/AlgorithmsScripts, Projects, Workstreams.
Study size of a Dataset is the size number of the union of sample sets for its Data Layers.A Dataset has subjects involved in the study. (Note this is not equivalent to the number of samples, which may be more.)
A Dataset may have a single "study area / disease" attribute. (This is TBD.)
Answered Questions:
- are data-sets associated with projects? if so, is the association optional?
Yes and yes.
Can a ds be associated with multiple projects?
Yes. Idea is there is a global listing of datasets which may be public, or restricted to certain groups / individuals. When you browse the datasets tab you are browsing this global, Sage-curated and approved library of datasets.
We have also heard the need for project teams to upload their own data sets, which would be limited in scope to the project. Idea is to give users the ability to quickly start working with data without being gated by curation. Users could then "publish" the data from the project to the global library, which requires the same curation process as any other data (although hopefully the project teams get it in reasonable shape to begin with.)
...
- Is dataset "download availability" = (release date != null)?
I think so... not exactly sure of the question here.
- Re "Release Notes: 3", what does it mean to "release", can a DS be released more than
once? Does it mean another sample has been added to the data set, that the DS has
been updated, or something else?
Probably most commonly means a new layer added (especially a QC layer), new samples also possible. Any change at all to the dataset needs to be versioned since a goal is to be able to reproduce analysis, we have to know exactly what data was available when a particular analysis was run, even if the data set changes later.
Practically, I don't see large number of versions to a dataset occurring. This will be an infrequent event and many datasets may only have a v1 release, but I think we still need to code for it.
- What is a "contributor" -- Someone who uploads data? someone who analyses the data?
Is it something more restrictive, like the PI of the experiment who generated the data?
What should be the granularity of 'contribution' -- by the sample, the layer, the
data set, some sort of ds revision, ...?
The contributor is the person who provided us the data, most likely the PI of the lab that generated it. It's the person we thank profusely for his contribution, and encourage users of the data to cite when they publish work that uses the data. The data is actually sent to us by some grad student or post doc in the contributor's lab.
I've been seeing one contributor for the dataset in the examples I've seen, though I guess multiple contributors could be possible.
- In the DatasetMyers screen, what does "Modifed" mean? Is it the latest date
something changed in the project? If so, what are the things whose modifications
need to be tracked? Do they also need to be versioned?
Something changed in the curated data set layer. We're not in a project context here.
Open Questions:
- are data-sets associated with projects? if so, is the association optional?
Can a ds be associated with multiple projects?
(Note the "DatasetMyers" screen says "Projects Using this Dataset" suggesting
there's a "uses" relationship bet. projects and ds's.)
- what does "suggest a dataset" mean?
- do species and tissuetype go in the dataset or in the layer?
same Q for StudySize
that is, is it correct to say that a dataset specifies the
subjects, tissue types, etc., and the layers represent
the assays (GE, GT, sequencing)? Or can different
layers have different subjects, tissue types, phenotypes
and numbers of samples?
- may be multiple diseases, species, tissue types, platforms
- what does it mean for a dataset to be "posted"?
Aren't different layers posted at different times?
- Is the dataset "description" (seen in the Datasets screen) the same as the
dataset "Overview" (see in the DatasetMyers screen)?
- What does "Data type: Clinical phenotypes" mean as associated with a data *set*?
- should 'posted' and 'curated' time stamps be associated w/ Datasets or
with dataset layers?
- Is dataset "download availability" = (release date != null)?
- Re "Release Notes: 3", what does it mean to "release", can a DS be released more than
once? Does it mean another sample has been added to the data set, that the DS has
been updated, or something else?
- What is a "contributor" -- Someone who uploads data? someone who analyses the data?
Is it something more restrictive, like the PI of the experiment who generated the data?
What should be the granularity of 'contribution' -- by the sample, the layer, the
data set, some sort of ds revision, ...?
- In the DatasetMyers screen, what does "Modifed" mean? Is it the latest date
something changed in the project? If so, what are the things whose modifications
need to be tracked? Do they also need to be versioned?the attributes for a dataset well defined (species, tissue type, ...) or should the list be open ended?
- what does "suggest a project" mean? ("Send Stephen an email"?)
- Is it OK that a project doesn't have the following, but rather inherits it from its data sets?
diseases / study areas
# of followers
last activity
- what are the 'activities' that need to be tracked in the Project History, e.g.
- add a data set to a create a workstream in a project
- add a layer to a dataset that's added to used in a project
- add a sample to a dataset (layer?) that's added to a s used in a project
- grant access to a document to a group associated with a projectProject's group
- grant access to a dataset to a Project's group
- send an email to a group associated with a project (probably too minor...)
- what are there any objects are to be 'versioned' besides data sets, layers, and scripts/algorithms?
- If the start data for a workstream changes late in a workstream's progress, is the change reflected in an update to the current workstream (like changes propagating through a xls sheet) or rather in a new workstream (i.e. the workstream is tied to a particular revision of each input data layer)?
- Are workstreams 'versioned'? If so, what constitutes a revison, a change in algorithm, and change in data, both, ...?
- can you Follow something that's not versioned? What events are there besides revisions?
- can document versioning be delegated to the document collaboration system (Google Apps)?
- how is 'last modified' defined for a workstream?
- In the workstream example, is "Created" really defined for an analysis (e.g. "Correlation Network Analysis") or just for the analysis output ("Network")?
- Is every data set and layer in a project in some workstream, or can a project have ds,dl's that aren't in any workstream?
(My guess is the latter since a project can have new, unused data, and perhaps 'scratch' analyses.)
- Are workstream steps 'predefined' or are they created during analysis? E.g. do you have blank steps in a workstream
while while you're waiting for an analysis to be continued?
...
- What are the valid "statuses" for an algorithm, besides "Unpublished"?
- What can be published besides data sets, data layers, and scripts/algorithms?
- what does it mean to "Remove" Network1 or GeneList1?
...
- Networks page: The "Project" of a network could be the Project owning the workstream whose step
created the Project or, if the network was not created as a workstream step, then a comma
delimited list of all Projects using the dataset which is the parent of the network.
- will Should the Platform have "Scripts" as a generalizatin generalization of "algorithm" or are all scripts that can process
data data considered algorithms?
- What objects (if any, so far) need versioning are the types of access to resources by users, e.g. is there 'read-only' access to a resource for some groups and read-write for others? Is there a separate permission to 'publish'?
- What are the possible statuses for a dataset?