Document toolboxDocument toolbox

Data Collection Lifecycle

The data collection lifecycle describes the following steps in a study:

performance of a scheduled assessment → upload → export → post-processing (including things like validation or indexing by Bridge)

We have discussed a few aspects about how this might work:

  • How data is grouped during upload is not important to scientists, as long as it be grouped in a useful manner when it is made available to scientists (this does not necessarily mean grouped together physically… and index between grouping concepts in files, like an index of files from a single assessment, would be okay). As said by Dan during one of our meetings: "colocation is the biggest win for researchers." Having said that, the only way we may be able to provide colocation is by ensuring that data from a particular bounded unit (the assessment, a session instance) is stored together.

  • The semantics of data generated from an assessment should be documented as part of the assessment’s development. Documentation support is being built into the design of assessments, which can then be made public or shared for other studies to use.

  • Syntactic validation will probably center on some simple checks of the ZIP files, ZIP file entries, and file types in an upload (for example, does a file exist, does it have some bytes in it). This processing would happen on the Bridge worker platform as part of post-processing, sending an error back to the app development team. Down the road it should be possible for studies to add custom validation as part of the post-processing pipeline for more specific needs.

  • Assessments also have reports that need to be documented, including some validation support like providing schemas for client developers to use to validate their report data. Bridge would not know about this, it’s for the use of client developers. The assessment API is being rewritten in part so that assessments can have server-managed state similar to what the reports API is being used for now. Ideally (to me), the assessment history API would be used to hold assessment state and the reports API would be used to hold… reports.

  • A lifecycle for assessments might help, for example, when determining whether or not to validate (however I doubt this since we also have examples of needing to troubleshoot in production, eg. running code in a simulator. In this environment, some files might be missing, and that shouldn’t generate spurious validation errors).

Validation

If you think about the various environments in which the client app can be run as profiles, there are really only a couple of values that might change: whether or not a file is required, and whether or not it should be validated in some manner.

Simple profiles:

  • development

  • published

  • simulator

Things that can change by profile:

  • is it required?

  • what validation should occur? (none, or something else)?

Export

In addition to these concerns, we can ask how data has to be grouped…

Data from a single assessment is ideally in a single upload. Assuming we want this, study designers can create new assessments out of old assessments, and we’ll need to know what the structure of a zip file should look like when this happens. (And the structure of a related report… or though maybe the server doesn’t need to know any of this if it’s just storying the file.)

Data from a session instance should be identified by a common session instance identifier.

There are other ways that researchers may want to group and access data: by participant; by study arms; by demographic characteristics; by protocol; by a type of assessment. In essence we want to post-process the data so it is “indexed” by the metadata characteristics of the data files. The session instance relationship would be covered by this, but it would also be flexible.

Data needs to go into projects in Synapse. We also have a question as to how the data should be divided between projects. Three possibilities that we’ve discussed:

Data from an app to one project. This is similar to what we do now. Dwayne has suggested a model where the data is hierarchically accessible(?), with the default protocol saving data at the top level of a virtual S3 filesystem, and other protocols saving data to subdirectories under the main directory. This bears a relationship to how we currently allow data access through a substudy: data not marked with a substudy is available as global data to global users, who can also see data marked with a substudy, but users of that substudy only see data of that substudy.

In fact, “global” organizational users are Sage employees or other primary administrators of an app; while other organizations are scoped to see only their own data. However, none of this is enforced by having all data in a single Synapse project.

Data from an organization to one project. Data from any protocol that is owned by an organization goes into a Synapse project for that organization for that app. As all protocols will be associated with an organization, there’s no “default” project (even though one protocol will be considered the default in certain scenarios, like a user who downloads the app from the App Store with no further instructions).

On the downside, if one organization wants another organization’s data, even the data of the main study, they’d have to ask for it. We could create a system where an organization could grant access to another organization, and copy uploaded data (or pointers to that data) into the projects for both those organizations. But that’s only a partial solution since many of those arrangements could be made after the data was exported to Synapse.

Data from a protocol to one project. This would be the most fine-grained separation of data from an app. Even if a single organization updated a protocol (the study was renewed with slight changes, or some changes are made mid-study as part of an app release), the data would go to separate projects. I haven’t thought much further about this option because it doesn’t seem useful.