The data collection lifecycle describes the following steps in a study:
performance of a scheduled assessment → upload → export → indexing for analysis → post-processing (including validation by Bridge)
We have discussed a few aspects about how this might work:
How data is grouped during upload is not important to scientists, as long as it be grouped in a useful manner when it is made available to scientists (this does not necessarily mean grouped together physically… and index between grouping concepts in files, like an index of files from a single assessment, is okay). As said by some scientist during one of our meetings: "colocation is the biggest win for researchers." Having said that, the only way we may be able to provide colocation is by ensuring that data from a particular bounded unit (the assessment, a session instances) is store together.
The semantics of data generated from an assessment should be documented as part of the assessment’s development. Documentation support is being built into the design of assessments, which can then be made public or shared for other studies to use.
Syntactic validation will probably center on some simple checks of the ZIP files, ZIP file entries, and file types in an upload (for example, does a file exist, does it have some bytes in it). This processing would happen on the Bridge worker platform as part of post-processing, sending an error back to the app development team.
Reports also need documentation, including some validation support like providing schemas for client developers to use to validate their report data;
A lifecycle for assessments might help, for example, when determining whether or not to validate (however I doubt this since we also have examples of needing to troubleshoot in production, eg. running code in a simulator. In this environment, some files might be missing, and that shouldn’t generate spurious validation errors).
In addition to these concerns, we can ask how data has to be grouped…
Data from a single assessment is ideally in a single upload. Assuming we want this, study designers can create new assessments out of old assessments, and we’ll need to know what the structure of a zip file should look like when this happens. (And the structure of a related report.)
Data from a session instance should be identified by a common session instance identifier (currently called a “run” id).
There are other ways that researchers may want to access data: by participant; by study arms; by demographic characteristics; by protocol; by a type of assessment. In essence we want to post-process the data so it is “indexed” by the metadata characteristics of the data files. This would give access to the session instance relationship, among other things.
Data needs to go into projects in Synapse. We also have a question as to how the data should be divided between projects.
Dwayne has suggested a model where one app context goes to one Project, but the data is hierarchically accessible(?), with the default protocol saving data a the top level of a virtual filesystem on S3, and other protocols saving data to subdirectories of the main directory. This bears a relationship to how we currently allow access to data through substudy associations and is similar to what we will do with organizations and protocols.