Health Data Code API
Background
Bridge users save health data about themselves and then interact with that data through Bridge, to learn more about their medication conditions, and to contribute that data to researchers doing compelling research on their community. The data they save must be anonymous (you should not be able to derive whose data it is if you were able to break into the database), while all the records from an individual need to be related so that researchers can look at a set of records and know they describe one individual.
Toward this end, we propose assigning a "health data code" to each participant, probably scoped to an individual study. (Note: for reasons, we're not supposed to refer to Bridge participants as "patients", so we talk about health data codes here, not patient codes.) The code will be used when creating records, not any other identifier for that user.
This API is to generate a code to record anonymous health data records as part of Bridge. These codes must have the following properties:
- Given a base string and a given user, it should always return the same string code;
- Like any strong hash, it should not be possible to determine the original user from the code;
- We wish to have the ability to change a given code that is returned by the API in the event the Bridge data is compromised. This doesn't have to be efficient or easy, just possible.
Since Bridge uses Synapse for user authentication, it has no information about users in its data stores, and putting the health data code API in Synapse would keep it that way (it is a very small API). It can also leverage some of the password hashing code that already exists in Synapse.
API
URL | HTTP type | Description |
---|---|---|
/auth/healthdata/code | POST | {"base": "baseValue", "provisionalSeed": "seedValue"} Given a base String and a specific user, always return the same health data code (a hash of some kind). The seed property is optional; if submitted, the patient code returned will use that seed value without persisting it (see the other method call). If not, the private seed for a user will be used. A first seed will be generated on demand within Synapse, and persisted (much like the current password hashing seed), if necessary. |
/auth/healthdata/seed | POST | {"seed": "seedValue"} Submit a new seed value that will be persisted for this user, and used when sending back new patient codes. |
Resetting Health Data Codes
If the Bridge system were compromised, or a specific user wished to reset his or her password, we would initiate a worker process (on behalf of that user) that goes through all the studies, finding any health data codes that exist for the current user. When found, the /code post would be made with a new seed, and a new patient code would be generated to update those rows in DynamoDB (re-create and then delete). Once all rows were verified as updated, the /seed post would be made to set the new seed as the seed for future patient codes. The Bridge worker would be responsible for transactionality and reliability of finishing the update (not Synapse).
Security
Outside of this process of resetting the seed, the seed used to create the health data code is not passed through the API. The first seed will be generated on demand within Synapse, so in the normal case, the seed on which the hashing function depends would never be exposed through the API. None of the information about the user's patient codes would be available in Bridge. To compromise these codes, someone would need to gain access to both the Bridge and Synapse persistence stores. This is not quite as secure as a the password hash in Synapse (because that is based on an unknown password), but should ensure anonymity for Bridge health data given the constraint that these codes need to be stable.
Changes to Synapse
- New end-to-end code for the two REST calls, probably drawing on existing code to create password hashes and the like
- A new column in the N table, "health data code salt" (Wherever the password salt is persisted).