Upload Format v2

Upload Format v2

Goals

Primary goals:

  • Data exported to Synapse has a more researcher-friendly format

  • Health data (survey answers, sensor data, activity data) submitted by apps in a more parseable, more portable, non-ResearchKit/AppCore-specific format.

Secondary Goals - Potential future improvements that we should do, but will not materially impact the design of the core goals of Upload Format v2:

  • Allow study developers to rapidly set up new studies with a shared library of surveys, activities, and schemas - Spun off into a separate design project: Shared Module Library

  • Validating within JSON blobs and CSVs, possibly using Open mHealth(http://www.openmhealth.org/) and/or FHIR (https://www.hl7.org/fhir/)

    • We'll want the option for validated JSON and unvalidated JSON. The former will be useful for importing from the shared module. The latter will be useful for rapid prototyping. In general, we as the platform should allow study developers to be as restrictive or as permissive as they'd like.

  • List of released app versions in Researcher Portal, in Synapse.

  • Test data should go to separate Synapse tables than production data. We'll want to key off data group and possibly app version.

    • This came up with both FPHS and with Android Mole Mapper.

Non-goals:

  • Removing schemas - Bridge will continue to need schemas in the forseeable future. Schemas are needed so Bridge knows how to parse incoming data from apps (auto-detection has proven to be unreliable) and how to export data to Synapse.

Overview

We want to change both (a) the data format being exported to Synapse and (b) the data format and APIs used by apps to submit health data to Bridge. There is no strict requirement to keep or replace the current Upload Schema system. However, keeping the current Upload Schemas and making incremental changes to it will allow us to modify the back-end without affecting the front-end and vice versa.

Phase 1 is updating the Synapse export. Phase 2 is updating the app data format. Upload Schemas are the anchor to hold everything in place. We want to do the back-end before the front-end so that when we update the front-end, the back-end is already ready to receive the data.

Phase 1: Synapse Export

Subtask 1a: Field Types

Tracking JIRA:

BRIDGE-1288 - Getting issue details... STATUS

For an example of old and new field types in Synapse, see https://www.synapse.org/#!Synapse:syn5853757. A few notes:

  • There are 4 old attachment types and one new attachment that differ in Synapse only by MIME type and file extention. For simplicity, only ATTACHMENT_BLOB and ATTACHMENT_V2 are represented, using the audio example to highlight the difference in file name and extension. (Synapse Web Client doesn't let me pick the MIME type, but that difference is largely quality-of-life anyway.)

  • Multi-Choice Int isn't represented in the example Synapse table, as this is no longer being supported. It doesn't look sufficiently different enough from Multi-Choice string to be interesting anyway.

  • Short Inline JSON and Strings look the same between v1 and v2, the only difference is the threshold for what's considered "short". As such, in the example Synapse table, we will only have one example of "short" for both v1 and v2.

  • The Synape table doesn't have examples of time and duration as there are very few examples of time in the real world and no examples of duration. Instead, we only have the new examples to hammer down what it should look like.

Type Overview

Conceptual Type

Bridge Survey Type

Bridge Upload Schema Type

Synapse Table Column Type

Conceptual Type

Bridge Survey Type

Bridge Upload Schema Type

Synapse Table Column Type

Boolean (unchanged)

dataType=BOOLEAN

BOOLEAN

BOOLEAN

Decimal (unchanged)

dataType=DECIMAL

FLOAT

DOUBLE

Int (unchanged)

dataType=INTEGER

INT

INTEGER

Attachment (any) (old)

N/A

ATTACHMENT_BLOB

FILEHANDLEID, mimeType=application/octet-stream

Attachment (CSV) (old)

N/A

ATTACHMENT_CSV

FILEHANDLEID, mimeType=text/csv

Attachment (JSON) (old)

N/A

ATTACHMENT_JSON_BLOB

FILEHANDLEID, mimeType=text/json

Attachment (JSON table) (old)

N/A

ATTACHMENT_JSON_TABLE

FILEHANDLEID, mimeType=text/json

Attachment (new)

N/A

ATTACHMENT_V2, mimeType=*, fileExtension=*

FILEHANDLEID, mimeType=*

Multi-Choice String (old)

MultiValueConstraints, allowMultiple=true, dataType=STRING

ATTACHMENT_JSON_BLOB

FILEHANDLEID, mimeType=text/json

Multi-Choice Int w/ > 20 enum values (old, no longer supported in v2)

MultiValueConstraints, allowMultiple=true, dataType!=STRING, enumeration.size > 20

ATTACHMENT_JSON_BLOB

FILEHANDLEID, mimeType=text/json

Multi-Choice Int w/ <= 20 enum values (old, no longer supported in v2)

MultiValueConstraints, allowMultiple=true, dataType!=STRING, enumeration.size <= 20

INLINE_JSON_BLOB

STRING, maxLength=100

Single-Choice (old)

MultiValueConstraints, allowMultiple=false

INLINE_JSON_BLOB

STRING, maxLength=100

Multi-Choice (new)

MultiValueConstraints, allowMultiple=true

MULTI_CHOICE

multiple BOOLEANs

Single-Choice (new)

MultieValueConstraints, allowMultiple=false

SINGLE_CHOICE, maxLength=*

STRING, maxLength=*

String, maxLength <= 100 (old)

dataType=STRING, maxLength <= 100

STRING

STRING, maxLength=100

String, maxLength > 100 (old)

dataType=STRING, maxLength > 100 || maxLength not defined

ATTACHMENT_BLOB

FILEHANDLEID, mimeType=application/octet-stream

String, maxLength <= 1000 (new)

dataType=STRING, maxLength <= 1000

STRING, maxLength=*

STRING, maxLength=*

String, maxLength > 1000 (new)

dataType=STRING, maxLength > 1000 || maxLength not defined

STRING, maxLength=*

BLOB

Inline JSON, maxLength <= 100 (old)

N/A

INLINE_JSON_BLOB

STRING, maxLength=100

Inline JSON, maxLength > 100 (old)

N/A

ATTACHMENT_BLOB

FILEHANDLEID, mimeType=application/octet-stream

Inline JSON, maxLength <= 1000 (new)

N/A

INLINE_JSON_BLOB, maxLength=*

STRING, maxLength=*

Inline JSON, maxLength > 1000 (new)

N/A

INLINE_JSON_BLOB, maxLength=*

BLOB

Date (YYYY-MM-DD) (old)

dataType=DATE

CALENDAR_DATE

STRING, maxLength=10

Date (YYYY-MM-DD) (new)

dataType=DATE

DATE_V2

CALENDAR_DATE (because "DATE" is already used for date-time)

Date-Time (old)

dataType=DATETIME

TIMESTAMP

DATE (internally, this is epoch time in milliseconds)

Date-Time (new)

dataType=DATETIME

TIMESTAMP (changes to Bridge-EX)

DATE for timestamp + STRING maxLength=5 for timezone (+ZZZZ)

Time (old)

dataType=TIME

STRING

STRING, maxLength=100

Time (new)

dataType=TIME

TIME_V2

STRING, maxLength=12 (hh:mm:ss.sss)

Duration (old)

dataType=DURATION

STRING

STRING, maxLength=100

Duration (new)

dataType=DURATION

DURATION_V2

STRING, maxLength=24

Highlighted cells indicate a new feature in the type system.

Unchanged Types

Booleans, ints, and floats are straightforward and are unchanged between v1 and v2.

Attachments

In v1, we had multiple attachment types that mapped to specific MIME types. We had a MIME type for JSON and a MIME type for CSV, but not for other file types (notably audio files). In v2, we're consolidating all attachment types to ATTACHMENT_V2, which will have metadata for MIME type and for file extension. While MIME type and file extension don't affect the researcher's ability to download the data, having the correct MIME type and file extension provide "quality of life" improvements.

Concrete example, for the audio_audio.m4a file in the Voice Activity:

  • In v1, the Bridge type is ATTACHMENT_BLOB, and it's exported with MIME type application/octet-stream and file name audio_audio.mp4-[guid].tmp.

  • In v2, the Bridge type is ATTACHMENT_V2 w/ MIME type audio/mp4 and file extension m4a, and it's exported with MIME type audio/mp4 and file name audio_audio-[guid].m4a.

Multiple Choice

Terminology note: "Multi-Choice" refers to a multiple choice question where users can select multiple answers (equivalent to allowMultiple=true in the survey model). "Single-Choice" refers to a multiple choice question where users can only select a single answer (equivalent to allowMultiple=false in the survey model). This terminology comes from ResearchKit/AppCore.

In v1, Multi-Choice answers with string types could potentially be very long (notable example question: "What sports do you play?"; example response: [ "football", "fencing", "swimming", "running", "ballet" ]). In order to fit within Synapse table row width limits, we wrote these to a file handle. This is not researcher-friendly because you can't query on file handles, and downloading hundreds of small files is very inefficient. In v2, to make these queryable and to fit within row width limits, we're exporting this as multiple boolean columns, corresponding to whether the user selected the choice or not. (See below for example.)

Also, in v1, Multi-Choice answers were sometimes represented as ints, where the ints mapped to enum values. This only occurred in client-side hardcoded surveys and is considered not very useful, as ints generally won't have meaning to researchers if they can't map the answer to a value. This is no longer being directly supported, but is still indirectly supported as ints can be trivially converted to strings.

For Single-Choice questions, ResearchKit/AppCore submits answers as a JSON array with only one element. When we first implemented Uploads and Bridge-EX, there was no documentation on the ResearchKit/AppCore data formats, so we decided to touch the submitted JSON as little as possible. As such, Bridge would pass along this JSON array as is. (Example question: "What is your gender?"; ResearchKit/AppCore submits: [ "Male" ]; Bridge passes this along as is.) These extra brackets were inconvenient to researchers and were unnecessary, and now that we have a year's worth of experience working with the format, we believe we can remove the brackets altogether. In v2, same example question, ResearchKit/AppCore submits [ "Male" ]; Bridge passes along the value Male without quotes or brackets.

Example table:

multi-choice-old

multi-choice-new.football

multi-choice-new.fencing

multi-choice-new.swimming

single-choice-old

single-choice-new

multi-choice-old

multi-choice-new.football

multi-choice-new.fencing

multi-choice-new.swimming

single-choice-old

single-choice-new

file handle: multi-choice-old-1234.tmp

false

true

true

[ "Male" ]

Male

multi-choice-old-1234.tmp

[ "fencing", "swimming" ]

Strings and Inline JSON

In v1, Strings had a max length of 100 chars. Anything longer than that had to be an Attachment (File Handle in Synapse). Researchers were expected to know the difference when setting up their schemas. This is a problem because (a) File Handles aren't queryable, and having hundreds of small File Handles is inefficient (b) it's too easy for researchers to accidentally break themselves (c) if the string is always very short, we waste lots of bytes that aren't being used.

In v2, Bridge types will have metadata to know String length, so we can use a Synapse column size that is just big enough. (Note that Bridge survey string fields have a maxLength property, but Bridge schema string fields do not.) This allows us to scale up to 1000 chars or down to 1 char as needed. Also, for Strings longer than 1000 chars (but not more than, say, 1mb), Synapse will have a new data type called a Blob which will fit this use case.

The same issues and changes are being applied to inline JSON. However, inline JSON was generally not used directly by researchers, rather used indirectly for multi-choice and single-choice survey questions.

See also

PLFM-3808 - Getting issue details... STATUS

Date and Time

There are several issues with how Surveys, Schemas, and Synapse tables currently handle date and time.

  • The terminology is inconsistent across all three, so we're standardizing on a consistent set of terminology "date", "date-time", and "time".

  • Upload schemas don't support times or durations.

  • Synapse tables don't support dates, times, or durations. Additionally, date-times are represented as epoch times, which means they lose timezone information. Also, the Synapse Web Client represents date-times in one format and queries them using another format, and neither format is ISO 8601.

Additional details below:

Date: Also called calendar date. A date is an abstract date without a timezone. It's used to represent moments in time where the exact time and time zone are ambiguous, unknown, or unnecessary. For example: birthdates, holidays, diagnosis dates, release dates, etc. All systems should use the ISO 8601 date format YYYY-MM-DD. Concrete example: 2016-04-01.

Dates can't be converted into date-times and vice versa. A date-time can be represented as 2 (or potentially 3) dates, depending on the time zone and semantics. A date can only be represented as a date-time by choosing an arbitrary time and time zone. Problems can occur when a date is converted to a date-time using one set of semantics, then converted back to a date using a different set of semantics. This is frequently a problem when one system uses the local timezone but another system forces UTC. To avoid these problems and ambiguities, we should avoid converting between dates and date-times.

Synapse currently doesn't have a native calendar date column type, but there is a use case for this outside of Bridge. We can bake the design for a calendar date column in Synapse into this design and use Bridge as the driver for this feature. See also

PLFM-3816 - Getting issue details... STATUS

Date-Time: Also called timestamp. A date-time is a specific moment in time with a timezone. It's generally used to represent things where both the date and relative time of day are important. For example: activity start and end times, medication times, etc. All systems should use the ISO 8601 date-time format YYYY-MM-DDThh:mm:ss.sss+zzzz as it is both human readable and machine readable. Concrete example: 2016-04-01T16:15:00.000-0700. Variations on time and time zone (such as 2016-04-01T23:15Z) are also acceptable.

This differs from epoch time (milliseconds since 1970-01-01T00:00Z) in that date-times include time zone information while epoch times are always implicitly in UTC. (Aside from representational differences.) We want to preserve the time zone rather than force UTC because the time zone includes information that is relevant to research. For example, knowing User X took their medication at YYYY-MM-DDT15:00-0700 tells us that the user took their medication in the mid-afternoon, while YYYY-MM-DDT22:00Z is ambiguous, depending on where the user lives.

Because Synapse tables are backed in a SQL database, it's infeasible to have a single Synapse table column represent both timestamp and timezone. In v2, Bridge will export the timezone as a separate field (5 chars). We can even re-use the old TIMESTAMP type in Upload Schemas, since this change is entirely additive.

Example (all times represent 2016-04-04T20:30-0700):

date-time-old

date-time-new

date-time-new.timezone

date-time-old

date-time-new

date-time-new.timezone

1459827000000

1459827000000

-0700

Additional notes:

  • Currently, you can query timestamps in Synapse using longs (epoch milliseconds) or strings. Querying by strings uses the format "YYYY-MM-DD hh:mm:ss". This is functional, but may require fidgeting as most DateTime libraries use ISO8601. See also

    PLFM-3814 - Getting issue details... STATUS

  • The Synapse web client uses "YYYY/MM/DD hh:mm:ss" to display timestamps, and it doesn't display the timezone (even though it's not necessarily in UTC). It was decided that this was fine (in fact, Synapse users complain the least when the timestamp is in this format), since most analysis will be done with programmatically, where the timestamp is a long.

Time: Also called time-of-day. A time is a relative moment within a day. It's generally used to represent things where the exact day doesn't matter, but knowing what time of day it happened is relevant. Examples: wake up time, sleep time, lunch time, etc. Similar to date-time, all systems should use the ISO 8601 time format hh:mm:ss.sss.

These use cases are rare enough that it's not worth building a first-class type in Synapse. In Bridge, this was previously a passthrough of whatever ResearchKit/AppCore provided (which was in some cases unparseable garbage) as a string with maxLength=100. In v2, we will implement a TIME_V2 type to support surveys and to signal to Bridge to parse the time and format it into a consistent format ("hh:mm:ss.sss"). This will still be a string, but the maxLength can be reduced to 12, as that's the longest time format.

Duration: A duration is a length of time without a fixed start or end time. Examples: ???? All systems should use the ISO 8601 duration format, see https://en.wikipedia.org/wiki/ISO_8601#Durations.

Similar to time, durations will not have a Synapse column type, but will have a Bridge schema type to parse and export the data in a consistent format. This will be a string with maxLength=24, as that's the longest duration format.

Subtask 1b: Mutable Schemas

Tracking JIRA:

BRIDGE-1289 - Getting issue details... STATUS

Schema revisions were originally immutable in response to the original ResearchKit/AppCore data formats being in flux and due to various design decisions made during the initial rapid prototyping of the Bridge Data Pipeline. With a year's worth of learnings, we've decided to lift these restrictions and allow adding new fields and reordering fields in schema revisions.

Mutable Schema Details and Restrictions

  • The new update API will detect added fields by sorting the field names in string-sort order. This will allow us to detect added fields without using the N^2 diff algorithm.

  • Existing fields cannot be deleted. This is because (a) older app versions may still be using that field and (b) deleting a Synapse table column will result in data loss.

    • Similarly, existing fields cannot be changed. The only exception is the "maxAppVersion" attribute.

  • Fields will have a minAppVersion and maxAppVersion field, which for required fields tells Strict Validation which app versions should expect those fields. (Used for adding and "deleting" schema fields.)

    • For v1 uploads, we parse the app version from info.json appVersion field, which is in the format "version 1.3, build 42". The build number should match the version number in the User-Agent string.

  • For survey-to-schema conversion, we'll need to compare the newly published survey to the previous schema revision and keep old deleted fields in the new schema. (We don't need to worry about min/maxAppVersion as survey questions are always optional.)

    • Alternatively, if the researchers wish to cut a new schema revision (if the new survey is drastically different from the old one, for example), they can pass in a query param flag to specify whether they want to cut a new schema revision. If the parameter is not provided, we default to keeping the old schema.

    • We'll need to extend schemas to know their survey guid and createdOn, if applicable. See also

      BRIDGE-810 - Getting issue details... STATUS

NOTE: These restrictions only apply if we're modifying an existing schema revision in place. Creating new schema revisions will never have any restrictions. Creating a new schema revision will always create a new Synapse table.

Revision Number Handling

In v1, schema revision numbers had to be sequential. A newly created schema was always revision 1. Subsequent revisions would be 2, 3, 4, and so forth. Attempting to uppdate a revision out of sequence would throw a 400 Bad Request exception.

In v2, study developers can create schemas with any (positive) revision number they chose. The revision numbers don't need to start at at 1, and they don't need to be sequential. For example, you can create Tapping Activity v5 without needing v1-4 exist in the study. This could be useful, for example, if a study developer is copying a schema from another study, or if they have multiple studies with the same activity and they want the schema revision to be the same across all of them.

If the schema revision is not specified, the behavior falls back to v1. Specifically, if a revision number is not specified when the schema is created, it defaults to previous revision number + 1, or 1 if this is a new schema.

Upload Schema Fields

Schema

Field

Type

Description

Field

Type

Description

fieldDefinitions

List<UploadDefinition>

List of fields this schema contains, corresponds to Synapse table columns.

key (hash key)

String

DDB hash key, in the format "[studyId]:[schemaId]". This is not exposed outside of DDB.

name

String

User-friendly schema name.

revision (range key)

int

Revision number. This is a secondary ID used to partition different Synapse tables based on breaking changes in a schema.

schemaId

String

Schema ID, unique within a study.

schemaType

Enum

Backwards-compatible schema type for v1 uploads from ResearchKit/AppCore. This is needed because v1 uploads don't specify whether they are surveys or not, so we needed to encode this data in the schema. Valid values are IOS_DATA and IOS_SURVEY. For v2, this is meaningless, but we should introduce a NEW value UPLOAD_V2 as a placeholder.

surveyGuid NEW

String

Used to identify the survey from the schema.

surveyCreatedOn NEW

DateTime

Used to identify the survey from the schema. Represented in DDB as a long. Represented in JSON as a String in ISO8601 format.

studyId

String

Study that the schema lives in.

UploadDefinition

Field

Type

Description

Field

Type

Description

fileExtension NEW

String

Used for ATTACHMENT_V2 types. Used as a hint by BridgeEX to preserve the file extension as a quality-of-life improvement. Optional, defaults to no extension.

mimeType NEW

String

Used for ATTACHMENT_V2 types. Used as a hint by BridgeEX to mark a Synapse file handle with the correct MIME type as a quality-of-life improvement. Optional, defaults to "application/octet-stream".

minAppVersion NEW

int

The oldest app version number for which this field is required. App versions before this will treat this field as optional, as it doesn't exist yet. Does nothing if required is false.

maxAppVersion NEW

int

Similar to minAppVersion. This is used for when required fields are removed from the app, but we want to re-use the old Synapse table.

maxLength NEW

int

Used for STRING, SINGLE_CHOICE, and INLINE_JSON_BLOB types. This is a hint for BridgeEX to create a Synapse column with the right width.

name

String

Field name. Must be unique. (Is actually an identifier.)

required

boolean

True if Strict Validation should reject the upload when the field is missing.

type

Enum

Field type. See table above for valid field types.

New APIs

We will need new APIs for creating and updating schemas. These APIs will have the new semantics. This will also allow us to clean up the schema APIs. See also

BRIDGE-966 - Getting issue details... STATUS
(NOTE: Only the create and update APIs are changing. The get, list, and delete APIs will remain the same.)

POST /v4/schemas - Creates a new schema rev using the new Upload v2 semantics described above. Uses the the schemaId and rev (if specified) of the request body JSON. Can be used for creating an entirely new schema or for creating a new rev of an existing schema.

POST /v4/schemas/[schemaId]/revisions/[rev] - Updates a schema rev using the new Upload v2 semantics described above. Will validate that the schema rev can be updated using the request body JSON and will throw a 400 Bad Request if invalid.

Related Tasks

Subtask 1c: Export Survey Questions to Synapse

See also

BRIDGE-1095 - Getting issue details... STATUS

Scenario: A data analyst is looking at the data for the first time in Synapse. They open a table and they see a bunch of survey answers. They're trying to make sense of what they're looking at. Unfortunately, the Synapse table only contains the answers, not the questions, so they have to reference some other page somewhere else.

One possible solution is for tables and table columns to have descriptions that are rendered in the Synapse Web Client. This allows data analysts to understand the data without having to cross-ref to another source. See

Table Description Example

Survey Name: Test Survey

Survey GUID: 1c215650-52db-40dd-b4ca-1bf26d8eb6be

Survey created on: 2016-04-12T01:39:39.926Z

Note: These fields come from the Bridge survey object.

Column Description Example

Survey Question GUID: cdb3e3d2-5037-429b-9ca2-dc2ff5b92892

Prompt: How are you feeling today? - Select an answer below based on how you're feeling today.

Answers: 1 - Awesome; 2 - Good; 3 - Okay; 4 - Bad; 5 - Terrible

Type: SINGLE_CHOICE

Notes:

  • GUID comes from the Bridge survey question.

  • Prompt includes both the prompt and the prompt detail.

  • Answers are only present if it's a multi-choice or single-choice question.

  • Answer format has the server value, followed by the label, followed by the detail if present.

  • For multi-choice questions, this is replicated across all of the columns.

  • If the server value is the same as the label and detail (or no label or detail), we just have a flat list for answers.