Upload Format v2
Goals
Primary goals:
- Data exported to Synapse has a more researcher-friendly format
- Health data (survey answers, sensor data, activity data) submitted by apps in a more parseable, more portable, non-ResearchKit/AppCore-specific format.
Secondary Goals - Potential future improvements that we should do, but will not materially impact the design of the core goals of Upload Format v2:
- Allow study developers to rapidly set up new studies with a shared library of surveys, activities, and schemas - Spun off into a separate design project: Shared Module Library
- Validating within JSON blobs and CSVs, possibly using Open mHealth(http://www.openmhealth.org/) and/or FHIR (https://www.hl7.org/fhir/)
- We'll want the option for validated JSON and unvalidated JSON. The former will be useful for importing from the shared module. The latter will be useful for rapid prototyping. In general, we as the platform should allow study developers to be as restrictive or as permissive as they'd like.
- List of released app versions in Researcher Portal, in Synapse.
- Test data should go to separate Synapse tables than production data. We'll want to key off data group and possibly app version.
- This came up with both FPHS and with Android Mole Mapper.
Non-goals:
- Removing schemas - Bridge will continue to need schemas in the forseeable future. Schemas are needed so Bridge knows how to parse incoming data from apps (auto-detection has proven to be unreliable) and how to export data to Synapse.
Overview
We want to change both (a) the data format being exported to Synapse and (b) the data format and APIs used by apps to submit health data to Bridge. There is no strict requirement to keep or replace the current Upload Schema system. However, keeping the current Upload Schemas and making incremental changes to it will allow us to modify the back-end without affecting the front-end and vice versa.
Phase 1 is updating the Synapse export. Phase 2 is updating the app data format. Upload Schemas are the anchor to hold everything in place. We want to do the back-end before the front-end so that when we update the front-end, the back-end is already ready to receive the data.
Phase 1: Synapse Export
Subtask 1a: Field Types
Tracking JIRA: - BRIDGE-1288Getting issue details... STATUS
For an example of old and new field types in Synapse, see https://www.synapse.org/#!Synapse:syn5853757. A few notes:
- There are 4 old attachment types and one new attachment that differ in Synapse only by MIME type and file extention. For simplicity, only ATTACHMENT_BLOB and ATTACHMENT_V2 are represented, using the audio example to highlight the difference in file name and extension. (Synapse Web Client doesn't let me pick the MIME type, but that difference is largely quality-of-life anyway.)
- Multi-Choice Int isn't represented in the example Synapse table, as this is no longer being supported. It doesn't look sufficiently different enough from Multi-Choice string to be interesting anyway.
- Short Inline JSON and Strings look the same between v1 and v2, the only difference is the threshold for what's considered "short". As such, in the example Synapse table, we will only have one example of "short" for both v1 and v2.
- The Synape table doesn't have examples of time and duration as there are very few examples of time in the real world and no examples of duration. Instead, we only have the new examples to hammer down what it should look like.
Type Overview
Conceptual Type | Bridge Survey Type | Bridge Upload Schema Type | Synapse Table Column Type |
---|---|---|---|
Boolean (unchanged) | dataType=BOOLEAN | BOOLEAN | BOOLEAN |
Decimal (unchanged) | dataType=DECIMAL | FLOAT | DOUBLE |
Int (unchanged) | dataType=INTEGER | INT | INTEGER |
Attachment (any) (old) | N/A | ATTACHMENT_BLOB | FILEHANDLEID, mimeType=application/octet-stream |
Attachment (CSV) (old) | N/A | ATTACHMENT_CSV | FILEHANDLEID, mimeType=text/csv |
Attachment (JSON) (old) | N/A | ATTACHMENT_JSON_BLOB | FILEHANDLEID, mimeType=text/json |
Attachment (JSON table) (old) | N/A | ATTACHMENT_JSON_TABLE | FILEHANDLEID, mimeType=text/json |
Attachment (new) | N/A | ATTACHMENT_V2, mimeType=*, fileExtension=* | FILEHANDLEID, mimeType=* |
Multi-Choice String (old) | MultiValueConstraints, allowMultiple=true, dataType=STRING | ATTACHMENT_JSON_BLOB | FILEHANDLEID, mimeType=text/json |
Multi-Choice Int w/ > 20 enum values (old, no longer supported in v2) | MultiValueConstraints, allowMultiple=true, dataType!=STRING, enumeration.size > 20 | ATTACHMENT_JSON_BLOB | FILEHANDLEID, mimeType=text/json |
Multi-Choice Int w/ <= 20 enum values (old, no longer supported in v2) | MultiValueConstraints, allowMultiple=true, dataType!=STRING, enumeration.size <= 20 | INLINE_JSON_BLOB | STRING, maxLength=100 |
Single-Choice (old) | MultiValueConstraints, allowMultiple=false | INLINE_JSON_BLOB | STRING, maxLength=100 |
Multi-Choice (new) | MultiValueConstraints, allowMultiple=true | MULTI_CHOICE | multiple BOOLEANs |
Single-Choice (new) | MultieValueConstraints, allowMultiple=false | SINGLE_CHOICE, maxLength=* | STRING, maxLength=* |
String, maxLength <= 100 (old) | dataType=STRING, maxLength <= 100 | STRING | STRING, maxLength=100 |
String, maxLength > 100 (old) | dataType=STRING, maxLength > 100 || maxLength not defined | ATTACHMENT_BLOB | FILEHANDLEID, mimeType=application/octet-stream |
String, maxLength <= 1000 (new) | dataType=STRING, maxLength <= 1000 | STRING, maxLength=* | STRING, maxLength=* |
String, maxLength > 1000 (new) | dataType=STRING, maxLength > 1000 || maxLength not defined | STRING, maxLength=* | BLOB |
Inline JSON, maxLength <= 100 (old) | N/A | INLINE_JSON_BLOB | STRING, maxLength=100 |
Inline JSON, maxLength > 100 (old) | N/A | ATTACHMENT_BLOB | FILEHANDLEID, mimeType=application/octet-stream |
Inline JSON, maxLength <= 1000 (new) | N/A | INLINE_JSON_BLOB, maxLength=* | STRING, maxLength=* |
Inline JSON, maxLength > 1000 (new) | N/A | INLINE_JSON_BLOB, maxLength=* | BLOB |
Date (YYYY-MM-DD) (old) | dataType=DATE | CALENDAR_DATE | STRING, maxLength=10 |
Date (YYYY-MM-DD) (new) | dataType=DATE | DATE_V2 | CALENDAR_DATE (because "DATE" is already used for date-time) |
Date-Time (old) | dataType=DATETIME | TIMESTAMP | DATE (internally, this is epoch time in milliseconds) |
Date-Time (new) | dataType=DATETIME | TIMESTAMP (changes to Bridge-EX) | DATE for timestamp + STRING maxLength=5 for timezone (+ZZZZ) |
Time (old) | dataType=TIME | STRING | STRING, maxLength=100 |
Time (new) | dataType=TIME | TIME_V2 | STRING, maxLength=12 (hh:mm:ss.sss) |
Duration (old) | dataType=DURATION | STRING | STRING, maxLength=100 |
Duration (new) | dataType=DURATION | DURATION_V2 | STRING, maxLength=24 |
Highlighted cells indicate a new feature in the type system.
Unchanged Types
Booleans, ints, and floats are straightforward and are unchanged between v1 and v2.
Attachments
In v1, we had multiple attachment types that mapped to specific MIME types. We had a MIME type for JSON and a MIME type for CSV, but not for other file types (notably audio files). In v2, we're consolidating all attachment types to ATTACHMENT_V2, which will have metadata for MIME type and for file extension. While MIME type and file extension don't affect the researcher's ability to download the data, having the correct MIME type and file extension provide "quality of life" improvements.
Concrete example, for the audio_audio.m4a file in the Voice Activity:
- In v1, the Bridge type is ATTACHMENT_BLOB, and it's exported with MIME type application/octet-stream and file name audio_audio.mp4-[guid].tmp.
- In v2, the Bridge type is ATTACHMENT_V2 w/ MIME type audio/mp4 and file extension m4a, and it's exported with MIME type audio/mp4 and file name audio_audio-[guid].m4a.
Multiple Choice
Terminology note: "Multi-Choice" refers to a multiple choice question where users can select multiple answers (equivalent to allowMultiple=true in the survey model). "Single-Choice" refers to a multiple choice question where users can only select a single answer (equivalent to allowMultiple=false in the survey model). This terminology comes from ResearchKit/AppCore.
In v1, Multi-Choice answers with string types could potentially be very long (notable example question: "What sports do you play?"; example response: [ "football", "fencing", "swimming", "running", "ballet" ]). In order to fit within Synapse table row width limits, we wrote these to a file handle. This is not researcher-friendly because you can't query on file handles, and downloading hundreds of small files is very inefficient. In v2, to make these queryable and to fit within row width limits, we're exporting this as multiple boolean columns, corresponding to whether the user selected the choice or not. (See below for example.)
Also, in v1, Multi-Choice answers were sometimes represented as ints, where the ints mapped to enum values. This only occurred in client-side hardcoded surveys and is considered not very useful, as ints generally won't have meaning to researchers if they can't map the answer to a value. This is no longer being directly supported, but is still indirectly supported as ints can be trivially converted to strings.
For Single-Choice questions, ResearchKit/AppCore submits answers as a JSON array with only one element. When we first implemented Uploads and Bridge-EX, there was no documentation on the ResearchKit/AppCore data formats, so we decided to touch the submitted JSON as little as possible. As such, Bridge would pass along this JSON array as is. (Example question: "What is your gender?"; ResearchKit/AppCore submits: [ "Male" ]; Bridge passes this along as is.) These extra brackets were inconvenient to researchers and were unnecessary, and now that we have a year's worth of experience working with the format, we believe we can remove the brackets altogether. In v2, same example question, ResearchKit/AppCore submits [ "Male" ]; Bridge passes along the value Male without quotes or brackets.
Example table:
multi-choice-old | multi-choice-new.football | multi-choice-new.fencing | multi-choice-new.swimming | single-choice-old | single-choice-new |
---|---|---|---|---|---|
file handle: multi-choice-old-1234.tmp | false | true | true | [ "Male" ] | Male |
multi-choice-old-1234.tmp
[ "fencing", "swimming" ]
Strings and Inline JSON
In v1, Strings had a max length of 100 chars. Anything longer than that had to be an Attachment (File Handle in Synapse). Researchers were expected to know the difference when setting up their schemas. This is a problem because (a) File Handles aren't queryable, and having hundreds of small File Handles is inefficient (b) it's too easy for researchers to accidentally break themselves (c) if the string is always very short, we waste lots of bytes that aren't being used.
In v2, Bridge types will have metadata to know String length, so we can use a Synapse column size that is just big enough. (Note that Bridge survey string fields have a maxLength property, but Bridge schema string fields do not.) This allows us to scale up to 1000 chars or down to 1 char as needed. Also, for Strings longer than 1000 chars (but not more than, say, 1mb), Synapse will have a new data type called a Blob which will fit this use case.
The same issues and changes are being applied to inline JSON. However, inline JSON was generally not used directly by researchers, rather used indirectly for multi-choice and single-choice survey questions.
See also - PLFM-3808Getting issue details... STATUS
Date and Time
There are several issues with how Surveys, Schemas, and Synapse tables currently handle date and time.
- The terminology is inconsistent across all three, so we're standardizing on a consistent set of terminology "date", "date-time", and "time".
- Upload schemas don't support times or durations.
- Synapse tables don't support dates, times, or durations. Additionally, date-times are represented as epoch times, which means they lose timezone information. Also, the Synapse Web Client represents date-times in one format and queries them using another format, and neither format is ISO 8601.
Additional details below:
Date: Also called calendar date. A date is an abstract date without a timezone. It's used to represent moments in time where the exact time and time zone are ambiguous, unknown, or unnecessary. For example: birthdates, holidays, diagnosis dates, release dates, etc. All systems should use the ISO 8601 date format YYYY-MM-DD. Concrete example: 2016-04-01.
Dates can't be converted into date-times and vice versa. A date-time can be represented as 2 (or potentially 3) dates, depending on the time zone and semantics. A date can only be represented as a date-time by choosing an arbitrary time and time zone. Problems can occur when a date is converted to a date-time using one set of semantics, then converted back to a date using a different set of semantics. This is frequently a problem when one system uses the local timezone but another system forces UTC. To avoid these problems and ambiguities, we should avoid converting between dates and date-times.
Synapse currently doesn't have a native calendar date column type, but there is a use case for this outside of Bridge. We can bake the design for a calendar date column in Synapse into this design and use Bridge as the driver for this feature. See also - PLFM-3816Getting issue details... STATUS
Date-Time: Also called timestamp. A date-time is a specific moment in time with a timezone. It's generally used to represent things where both the date and relative time of day are important. For example: activity start and end times, medication times, etc. All systems should use the ISO 8601 date-time format YYYY-MM-DDThh:mm:ss.sss+zzzz as it is both human readable and machine readable. Concrete example: 2016-04-01T16:15:00.000-0700. Variations on time and time zone (such as 2016-04-01T23:15Z) are also acceptable.
This differs from epoch time (milliseconds since 1970-01-01T00:00Z) in that date-times include time zone information while epoch times are always implicitly in UTC. (Aside from representational differences.) We want to preserve the time zone rather than force UTC because the time zone includes information that is relevant to research. For example, knowing User X took their medication at YYYY-MM-DDT15:00-0700 tells us that the user took their medication in the mid-afternoon, while YYYY-MM-DDT22:00Z is ambiguous, depending on where the user lives.
Because Synapse tables are backed in a SQL database, it's infeasible to have a single Synapse table column represent both timestamp and timezone. In v2, Bridge will export the timezone as a separate field (5 chars). We can even re-use the old TIMESTAMP type in Upload Schemas, since this change is entirely additive.
Example (all times represent 2016-04-04T20:30-0700):
date-time-old | date-time-new | date-time-new.timezone |
---|---|---|
1459827000000 | 1459827000000 | -0700 |
Additional notes:
- Currently, you can query timestamps in Synapse using longs (epoch milliseconds) or strings. Querying by strings uses the format "YYYY-MM-DD hh:mm:ss". This is functional, but may require fidgeting as most DateTime libraries use ISO8601. See also - PLFM-3814Getting issue details... STATUS
- The Synapse web client uses "YYYY/MM/DD hh:mm:ss" to display timestamps, and it doesn't display the timezone (even though it's not necessarily in UTC). It was decided that this was fine (in fact, Synapse users complain the least when the timestamp is in this format), since most analysis will be done with programmatically, where the timestamp is a long.
Time: Also called time-of-day. A time is a relative moment within a day. It's generally used to represent things where the exact day doesn't matter, but knowing what time of day it happened is relevant. Examples: wake up time, sleep time, lunch time, etc. Similar to date-time, all systems should use the ISO 8601 time format hh:mm:ss.sss.
These use cases are rare enough that it's not worth building a first-class type in Synapse. In Bridge, this was previously a passthrough of whatever ResearchKit/AppCore provided (which was in some cases unparseable garbage) as a string with maxLength=100. In v2, we will implement a TIME_V2 type to support surveys and to signal to Bridge to parse the time and format it into a consistent format ("hh:mm:ss.sss"). This will still be a string, but the maxLength can be reduced to 12, as that's the longest time format.
Duration: A duration is a length of time without a fixed start or end time. Examples: ???? All systems should use the ISO 8601 duration format, see https://en.wikipedia.org/wiki/ISO_8601#Durations.
Similar to time, durations will not have a Synapse column type, but will have a Bridge schema type to parse and export the data in a consistent format. This will be a string with maxLength=24, as that's the longest duration format.
Subtask 1b: Mutable Schemas
Tracking JIRA: - BRIDGE-1289Getting issue details... STATUS
Schema revisions were originally immutable in response to the original ResearchKit/AppCore data formats being in flux and due to various design decisions made during the initial rapid prototyping of the Bridge Data Pipeline. With a year's worth of learnings, we've decided to lift these restrictions and allow adding new fields and reordering fields in schema revisions.
Mutable Schema Details and Restrictions
- The new update API will detect added fields by sorting the field names in string-sort order. This will allow us to detect added fields without using the N^2 diff algorithm.
- Existing fields cannot be deleted. This is because (a) older app versions may still be using that field and (b) deleting a Synapse table column will result in data loss.
- Similarly, existing fields cannot be changed. The only exception is the "maxAppVersion" attribute.
- Fields will have a minAppVersion and maxAppVersion field, which for required fields tells Strict Validation which app versions should expect those fields. (Used for adding and "deleting" schema fields.)
- For v1 uploads, we parse the app version from info.json appVersion field, which is in the format "version 1.3, build 42". The build number should match the version number in the User-Agent string.
- For survey-to-schema conversion, we'll need to compare the newly published survey to the previous schema revision and keep old deleted fields in the new schema. (We don't need to worry about min/maxAppVersion as survey questions are always optional.)
- Alternatively, if the researchers wish to cut a new schema revision (if the new survey is drastically different from the old one, for example), they can pass in a query param flag to specify whether they want to cut a new schema revision. If the parameter is not provided, we default to keeping the old schema.
- We'll need to extend schemas to know their survey guid and createdOn, if applicable. See also - BRIDGE-810Getting issue details... STATUS
NOTE: These restrictions only apply if we're modifying an existing schema revision in place. Creating new schema revisions will never have any restrictions. Creating a new schema revision will always create a new Synapse table.
Revision Number Handling
In v1, schema revision numbers had to be sequential. A newly created schema was always revision 1. Subsequent revisions would be 2, 3, 4, and so forth. Attempting to uppdate a revision out of sequence would throw a 400 Bad Request exception.
In v2, study developers can create schemas with any (positive) revision number they chose. The revision numbers don't need to start at at 1, and they don't need to be sequential. For example, you can create Tapping Activity v5 without needing v1-4 exist in the study. This could be useful, for example, if a study developer is copying a schema from another study, or if they have multiple studies with the same activity and they want the schema revision to be the same across all of them.
If the schema revision is not specified, the behavior falls back to v1. Specifically, if a revision number is not specified when the schema is created, it defaults to previous revision number + 1, or 1 if this is a new schema.
Upload Schema Fields
Schema
Field | Type | Description |
---|---|---|
fieldDefinitions | List<UploadDefinition> | List of fields this schema contains, corresponds to Synapse table columns. |
key (hash key) | String | DDB hash key, in the format "[studyId]:[schemaId]". This is not exposed outside of DDB. |
name | String | User-friendly schema name. |
revision (range key) | int | Revision number. This is a secondary ID used to partition different Synapse tables based on breaking changes in a schema. |
schemaId | String | Schema ID, unique within a study. |
schemaType | Enum | Backwards-compatible schema type for v1 uploads from ResearchKit/AppCore. This is needed because v1 uploads don't specify whether they are surveys or not, so we needed to encode this data in the schema. Valid values are IOS_DATA and IOS_SURVEY. For v2, this is meaningless, but we should introduce a NEW value UPLOAD_V2 as a placeholder. |
surveyGuid NEW | String | Used to identify the survey from the schema. |
surveyCreatedOn NEW | DateTime | Used to identify the survey from the schema. Represented in DDB as a long. Represented in JSON as a String in ISO8601 format. |
studyId | String | Study that the schema lives in. |
UploadDefinition
Field | Type | Description |
---|---|---|
fileExtension NEW | String | Used for ATTACHMENT_V2 types. Used as a hint by BridgeEX to preserve the file extension as a quality-of-life improvement. Optional, defaults to no extension. |
mimeType NEW | String | Used for ATTACHMENT_V2 types. Used as a hint by BridgeEX to mark a Synapse file handle with the correct MIME type as a quality-of-life improvement. Optional, defaults to "application/octet-stream". |
minAppVersion NEW | int | The oldest app version number for which this field is required. App versions before this will treat this field as optional, as it doesn't exist yet. Does nothing if required is false. |
maxAppVersion NEW | int | Similar to minAppVersion. This is used for when required fields are removed from the app, but we want to re-use the old Synapse table. |
maxLength NEW | int | Used for STRING, SINGLE_CHOICE, and INLINE_JSON_BLOB types. This is a hint for BridgeEX to create a Synapse column with the right width. |
name | String | Field name. Must be unique. (Is actually an identifier.) |
required | boolean | True if Strict Validation should reject the upload when the field is missing. |
type | Enum | Field type. See table above for valid field types. |
New APIs
We will need new APIs for creating and updating schemas. These APIs will have the new semantics. This will also allow us to clean up the schema APIs. See also - BRIDGE-966Getting issue details... STATUS (NOTE: Only the create and update APIs are changing. The get, list, and delete APIs will remain the same.)
POST /v4/schemas - Creates a new schema rev using the new Upload v2 semantics described above. Uses the the schemaId and rev (if specified) of the request body JSON. Can be used for creating an entirely new schema or for creating a new rev of an existing schema.
POST /v4/schemas/[schemaId]/revisions/[rev] - Updates a schema rev using the new Upload v2 semantics described above. Will validate that the schema rev can be updated using the request body JSON and will throw a 400 Bad Request if invalid.
Related Tasks
- Update BridgeEX to create columns in the Synapse table when the schema changes.
- Status: Done (https://github.com/Sage-Bionetworks/Bridge-Exporter/pull/11)
Subtask 1c: Export Survey Questions to Synapse
See also - BRIDGE-1095Getting issue details... STATUS
Scenario: A data analyst is looking at the data for the first time in Synapse. They open a table and they see a bunch of survey answers. They're trying to make sense of what they're looking at. Unfortunately, the Synapse table only contains the answers, not the questions, so they have to reference some other page somewhere else.
One possible solution is for tables and table columns to have descriptions that are rendered in the Synapse Web Client. This allows data analysts to understand the data without having to cross-ref to another source. See - PLFM-3831Getting issue details... STATUS
Table Description Example
Survey Name: Test Survey
Survey GUID: 1c215650-52db-40dd-b4ca-1bf26d8eb6be
Survey created on: 2016-04-12T01:39:39.926Z
Note: These fields come from the Bridge survey object.
Column Description Example
Survey Question GUID: cdb3e3d2-5037-429b-9ca2-dc2ff5b92892
Prompt: How are you feeling today? - Select an answer below based on how you're feeling today.
Answers: 1 - Awesome; 2 - Good; 3 - Okay; 4 - Bad; 5 - Terrible
Type: SINGLE_CHOICE
Notes:
- GUID comes from the Bridge survey question.
- Prompt includes both the prompt and the prompt detail.
- Answers are only present if it's a multi-choice or single-choice question.
- Answer format has the server value, followed by the label, followed by the detail if present.
- For multi-choice questions, this is replicated across all of the columns.
- If the server value is the same as the label and detail (or no label or detail), we just have a flat list for answers.
- Example:
Answers: fencing; football; swimming
- Example:
- Type correponds to the Bridge schema type.
Subtask 1d: Other Tasks
Unique Key Constraints for Record ID
Tracking JIRA: - BRIDGE-1290Getting issue details... STATUS
In the early days of Bridge, it was possible for Bridge to export a partial record, then in a later redrive to redrive that same record. This would result in a table with the same record ID twice. For example:
recordId | foo-field | bar-field |
---|---|---|
9988e2fc-51ef-4cdb-8e12-03eacf7e6684 | incomplete | |
9988e2fc-51ef-4cdb-8e12-03eacf7e6684 | incomplete | until now |
This would cause problems in analysis scripts, which frequently assumed that record IDs were unique. This is also semantically incorrect, as record IDs are in fact supposed to be unique and duplicate record IDs is an artifact of early data issues.
To prevent this issue in the future, we'd need unique key constrains in Synapse, and we'd need Bridge-EX to use overwrite semantics when uploading a row with an existing record ID. See also - PLFM-3815Getting issue details... STATUS
Phase 2: App Health Data Submission
Tracking JIRA: - BRIDGE-1291Getting issue details... STATUS
Sub-Goals
- Reconcile survey data format with non-survey data format.
- De-couple from ResearchKit/AppCore data format.
- Allow Bridge to know how to parse data before processing the upload bundle.
- Allow apps to specify the data format via API.
- Allow apps to choose whether they want to submit health data synchronously (ideal for small submissions like surveys) or asynchronously (ideal for bundles like walking activity or during connectivity issues).
New Survey Format
See Bridge Upload Data Format#Surveys for v1 survey format, as originally implemented by ResearchKit/AppCore. This is a problem because:
- Surveys and non-surveys are parsed differently. This means the parsing code behaves differently for surveys than for non-surveys.
- Survey format is needlessly wordy. The survey format contains extra fields that provide no value, such as the numeric questionType.
- Survey format is needlessly complicated. Boolean questions have a "booleanAnswer" field that must be parsed. Multi-choice questions have a "choiceAnswer" field that must be parsed. And so forth.
The new survey format should simply be a JSON object whose key corresponds to the survey question identifier and the value corresponds with the survey question answer. This puts it in line with non-surveys. Example:
{ "foo-question":42, "bar-question":"Male", "baz-question":["fencing", "swimming"] }
Type Spec
We need to maintain strong typing to ensure the highest quality of data is exported to Synapse. As such, this section will serve as a spec for the expected formatting of each type. This section focuses on schema types and expected format the app should submit. For a bigger overview of schema types and how they relate to surveys and Synapse table columns, see previous section Subtask1a:FieldTypes above.
Schema Type | App Submission Format | Example | NOTES |
---|---|---|---|
ATTACHMENT_* | any type | It is recommended that apps submit attachments as a top-level file in a bundle rather than inline in a synchronous submission. However, we will support this case regardless to allow app developers to choose their means of health data submission. | |
BOOLEAN | JSON boolean | true or false | If the app submits a number (ints only), we treat 0 as false and non-zero as true. If the app submits a string, we accept "true" and "false" (ignoring case), but not things like "Yes" and "No" or "Ja" and "Nein" as there are way too many possibilities. |
CALENDAR_DATE / DATE_V2 | string in YYYY-MM-DD format | "2016-04-12" | If the app submits a date-time (as described in the TIMESTAMP row), we ignore the time (and timezone) part and just use the calendar date part. |
DURATION_V2 | string in ISO8601 duration format | See https://en.wikipedia.org/wiki/ISO_8601#Durations for details. | |
FLOAT | JSON decimal | 3.14 | Ints are trivially converted into floats. If the app submits an string, we use Java's BigDecimal to parse it. |
INLINE_JSON_BLOB | any type | ||
INT | JSON int | 42 | If the app submits a float, we truncate using Java's double to int semantics. If the app submits a string, we use Java's BigDecimal to parse it, then truncate to an int. Supports up to 64-bit ints (longs in Java). |
MULTI_CHOICE | array of strings | [ "fencing", "swimming" ] | The values inside the array are expected to be strings. If they are not, they are trivially converted to strings. |
SINGLE_CHOICE | JSON string | "Male" | For backwards compatibility, if the app submits an array, we use the 0th element of the array. If the array doesn't have exactly 1 element, this is an error. If the app submits a value that's neither a string nor an array, the value is trivially converted to a string. |
STRING | JSON string | "This is a string" | If the app submits a value that's not a string, we trivially convert it to a string. |
TIME_V2 (Time w/o Date) | string in hh:mm:ss.sss format | "16:22:09.263" | This is always in 24-hour format and contains no timezone. If the app submits a date-time (as described in the TIMESTAMP row), we ignore the date and timezone and just use the time part. |
TIMESTAMP (Date-Time) | string in ISO8601 format OR epoch milliseconds (implicitly in UTC) | "2016-04-12T16:22:09.263-0700" OR 1460503329263 | For ISO8601 date-times, the timezone will generally be the timezone set on the user's phone. Apps should avoid "canonicalizing" the timezone to UTC or the server time, as this is a lossy conversion. |
Our philosophy here is to be as lenient as possible. This is because whatever is generating the data may not fit our platonic ideals for what data should look like, and rather than force every app to conform to our ideals, we aim to be lenient and convert data into our ideal form (at least within reason).
Synchronous Health Data Submission
New API: POST /v4/records
Field | Type | Description |
---|---|---|
appVersion | int | Optional. App version number that created the health data submission. This is needed because the app may create health data, then get upgraded, then submit the health data with the new version. It's important to know which version of the app created the health data so we can process and analyze it properly. If not specified, this defaults to the app version specified in the User-Agent header. This also corresponds to the build number in the legacy app version format ("version 1.0.2, build 8"). |
format | Enum | Valid values are "LEGACY_RK", "SIMPLE_RECORD", or "BUNDLE_ZIP". "LEGACY_RK" and "BUNDLE_ZIP" are not supported for this API. In the future, we may add other formats. |
phoneInfo | String | Optional. String describing phone hardware and/or OS, passed through to Synapse. Example: "iPhone 6". This must be 48 chars or less. If this is greater than 48 chars, it is truncated to 48 chars. |
schemaId | String | Tells Bridge which schema to parse the health data for. |
schemaRevision | int | Tells Bridge which schema revision. |
data | JSON object | JSON object w/ key-value pairs corresponding survey answers / schema fields / Synapse table columns. |
type | String | "HealthDataSubmission" |
IMPORTANT NOTE: The request body will need to be encrypted with the study's public key, available from the Researcher UI or from /v3/studies/self/publicKey
Example body (unencrypted):
{ "appVersion":42, "format":"SIMPLE_RECORD", "phoneInfo":"iPhone 6", "schemaId":"testSurvey", "schemaRevision":1, "data":{ "AAA":42, "BBB":"Male", "CCC":["fencing", "swimming"] }, "type":"HealthDataSubmission" }
Response: Record ID (guid).
NOTE: This API replaces the defunct Survey Response API. See - BRIDGE-737Getting issue details... STATUS
Asynchronous Bundle Upload
New API: POST /v4/uploads
Field | Type | Description |
---|---|---|
appVersion | int | Optional. App version number that created the health data submission. Same as the synchronous health data submission. |
contenthLength | int | File length in bytes. Used to validate the file was uploaded correctly. |
contentMd5 | String | Base64-encoded MD5 hash of the file contents. Used to validate the file was uploaded correctly. |
format | Enum | Valid values are LEGACY_RK, SIMPLE_RECORD, and BUNDLE_ZIP. See below for details. |
mainRecord | String | Optional. Only used for BUNDLE_ZIP. Specifies a JSON file in the bundle that contains the main health record data. Keys in this JSON file should correspond with survey question identifiers / schema field names / Synapse column names. A bundle zip can have at most one main record. It also may contain no main record. |
phoneInfo | String | Optional. String describing phone hardware and/or OS. Same as the synchronous health data submission. |
schemaId | String | Tells Bridge which schema to parse the health data for. |
schemaRevision | int | Tells Bridge which schema revision. |
supplementalRecords | array of Strings | Optional. Only used in BUNDLE_ZIP. Specifies a list of JSON files that contain additional health record data. Field names are determined by pre-pending the filename to the JSON key name. A bundle zip can have any number of supplemental records, including no supplemental records. See example below. |
type | String | "HealthDataBundle" |
IMPORTANT NOTE: The request body will need to be encrypted with the study's public key, available from the Researcher UI or from /v3/studies/self/publicKey
The return value is an upload session containing an S3 pre-signed URL, as specified in Bridge REST API. The client is then expected to upload the bundle file to S3. For LEGACY_RK, this looks exactly like the old upload zip file specified in Bridge Upload Data Format
For SIMPLE_RECORD, this is an encrypted JSON file (not a zip file). The file is decrypted and submitted as though it were a synchronous health data submission. (Although obviously this API will do so asynchronously.)
For BUNDLE_ZIP, this is an encrypted zip file. The zip file is expected to include the files listed in mainRecord and supplementalRecords as described above. The zip file can contain zero or more additional files, which are treated as opaque attachments (file handles in Synapse). This is used for audio files, accelerometer data (large JSON blobs), CSVs, etc. A bundle zip can have any number of attachments, including no attachments. Note that if the bundle contains attachments or supplemental records that aren't in the schema and strict validation is turned on for the study, the strict validation handler will error out and reject the bundle.
Beyond this, uploading the file to the S3 pre-signed URL, upload complete, and upload validation status look the same as in the Bridge REST API docs.
NOTE: For test purposes, we'll add a flag to the Upload Complete API, which performs upload processing and validation synchronously and returns the upload validation status. This is used to make app development easier. This should be safe, since log scan shows that in the last week, of all the uploads in Prod, only 4 were in the 10+ second range, and none went over 50 seconds.
Simple Record Upload Example
{ "appVersion":42, "contentLength":110, "contentMd5":"EB33q+K8dczcAMevQXRi0w==", "format":"SIMPLE_RECORD", "phoneInfo":"iPhone 6", "schemaId":"testSurvey", "schemaRevision":1, "type":"HealthDataBundle" }
Encrypted file contents:
{ "AAA":42, "BBB":"Male", "CCC":["fencing", "swimming"] }
Bundle Zip Upload Example
{ "appVersion":42, "contentLength":204765, "contentMd5":"9BeTdzsHdltD0oC1ICQ4Zg==", "format":"BUNDLE_ZIP", "mainRecord":"walking-main.json", "phoneInfo":"iPhone 6", "schemaId":"WalkingActivity", "schemaRevision":7, "supplementalRecords":["medication.json"], "type":"HealthDataBundle" }
Zip file contains: accelerometer.json (attachment), medication.json, motion.json (attachment), pedometer.json (attachment), walking-main.json
walking-main.json:
{ "startDateTime":"2016-04-12T17:20:23.849-0700", "endDateTime":"2016-04-12T17:21:05.972-0700", "numSteps":23 }
medication.json:
{ "medication":"I do not take Parkinson medication" }
Example record:
Field | Value |
---|---|
accelerometer.json | (attachment) |
endDateTime | "2016-04-12T17:21:05.972-0700" |
medication.json.medication | "I do not take Parkinson medication" |
motion.json | (attachment) |
numSteps | 23 |
pedometer.json | (attachment) |
startDateTime | "2016-04-12T17:20:23.849-0700" |