Table of Contents |
---|
Overview
The This is documentation for Bridge User Data Download (BUDD) Service is the name of a stack that includes the following pieces:
- A way for the user to request user data.
- An asynchronous BUDD Daemon, which processes the request and gathers health data from AWS.
- A way for the user to get their data back from the BUDD Daemon.
Specifics in the following sections.
Requesting User Data
Alternative #1: BridgePF API and SQS Queue
The user calls the requestUserData API, which lives in BridgePF. The user will pass in start date and end date (we can limit the time range to something like a week to make sure they don't brown out our system; also they can use this to get incremental updates of their data). BridgePF will then write the start and end date, along with health code and email address, to an SQS queue and return OK to the user.
The SQS queue will contain entries representing a request for health data, containing health code, email address, start date, and end date. These entries will be consumed by the BUDD Daemon.
Pros:
- We only have one front-end for all user interaction (BridgePF).
- Apps can integrate with this API without having to configure an entirely separate end point.
- Much cheaper to build compared to a web portal.
Cons:
- Potential security risk from having both email address and health code in SQS, even if only temporarily. (Even just the email address may be a security risk.)
Alternative #2: Web Portal
Users log into a web portal using their Bridge credentials and submit a user data request through the web portal. This request will include the start date and end date. (Again, we can limit the time range.) The BUDD Daemon is an asynchronous process on the same host that handles this request.
Pros:
- If we're building a web portal for returning user data, might as well use the same web portal for requesting it too.
- Security is simpler, since everything lives in a single system that's protected by the user's Bridge credentials.
Cons:
- More expensive compared to building a BridgePF API and an SQS queue.
- Users need to get their Bridge credentials from the apps, which may or may not be easy.
Recommendation
Building a web portal is significantly more expensive than building a BridgePF API and an SQS queue. However, the security concerns for not having email address and health code (or even just email address) living in SQS while "in transit" potentially more than justifies the cost.
Recommendation: Web Portal
BUDD Daemon
The BUDD Daemon is either a process running in EC2 that consumes an SQS queue or a process running in the web portal, depending on other design decisions. (TBD What framework? How is this deployed?) The BUDD Daemon processes user data requests by querying the Upload table in DDB and the Upload bucket in S3, downloading the files, decrypting them, bundling them up, and uploading the bundle to S3.
One possible optimization is to make Upload Validation save the decrypted archive file in S3, probably in a structure that looks like: /[healthCode]/[date], so the BUDD Daemon doesn't have to do this work again. However, we'd have to backfill about 4 months worth of data, and we expect user data requests to be sparse, so it's not clear whether this is worth the effort to backfill.
Returning User Data
Alternative #1: Email with S3 Pre-signed URL
The BUDD Daemon generates an S3 pre-signed URL that points to the user's requested data, and emails it to that user. For security, the pre-signed URL expires after 24 hours.
Pros:
- Much cheaper compared to building a web portal.
Cons:
- If the email is intercepted, somebody could use the S3 link to get the user's data.
- If the link expires, the user either needs to request their data again (starting the whole process over again), or we need a way to create a new S3 pre-signed URL from an existing user data request (which increases complexity and cost of this solution).
Alternative #2: Web Portal
The BUDD Daemon notifies the user view email that their data is available on the web portal. The user logs onto the web portal, and go to their completed request. The web portal generates an S3 pre-signed URL, which the user can click to download. For security, the pre-signed URL expires after 5 minutes (we can tweak this up or down depending on security concerns vs user convenience). If the user needs to download their data again, they can log in again and go to the completed request, and the web portal will generate a new S3 pre-signed URL.
Pros:
- If we're building a web portal for requesting user data, might as well use the same web portal for returning it too.
- Better security around the S3 link, since it's not sent out externally, and the expiry time can be much shorter.
- Better story for the user generating a new S3 link after the old one expires.
Cons:
- More expensive compared to building a BridgePF API and an SQS queue.
- Users need to get their Bridge credentials from the apps, which may or may not be easy.
Recommendation
Again, building a web portal is significantly more expensive than just sending out an email. But the web portal has a much better story for security and for generating a new link after the old one expires. Plus, if we're already building a web portal for requesting user data, we might as well use it for returning it too.
Recommendation: Web Portal
Web Portal Design
The web portal will be built with the following components:
- Spring Web MVC framework w/ Spring Boot - Given the lack of support for Play Framework, we decided to try using Spring MVC instead.
- AWS Elastic Beanstalk - We found that Heroku is too simple and too restrictive for our use-case. On the other end of the spectrum, AWS OpsWorks is powerful but incredibly complex. We chose Elastic Beanstalk because it gives us enough power to do what we need to do without the restrictions of Heroku. Additionally, Heroku is currently only available in US an EU, while Elastic Beanstalk is available anywhere AWS is available.
- Travis - Travis is a hosted solution while Jenkins requires us to build and maintain our own infrastructure. In the interest of rapid development, we chose to use Travis over Jenkins. We also considered AWS Code Pipeline. Unfortunately, Code Pipeline doesn't host any builds itself and requires Jenkins as a build solution.
- Maven - Maven was chosen as a dependency management system because it's well-supported and most (if not all) Sage engineers are already familiar with Maven.
Future Improvements
These are things we want for future versions, but not things we need for the initial launch.
...
.
Table of Contents |
---|
Data Flow
- User requests their data in the app, specifying the start date and end date. (App may or may not supply default start and end dates.)
- The app calls Bridge REST API API with the start date and end date (requires user authentication).
- Bridge Server writes the request to an internal SQS queue. This request contains study ID, username, start date, and end date.
- BUDD reads from the SQS, aggregates the requested data (which takes roughly a minute, depending on the amount of data), and sends an email to the user with a link to where they can download the data. This link will expire after 12 hours.
BUDD Internal Structure
The main entry point into BUDD is BridgeUddWorker https://github.com/Sage-Bionetworks/BridgeUserDataDownloadService/blob/develop/src/main/java/org/sagebionetworks/bridge/udd/worker/BridgeUddWorker.java. This contains:
- A loop which polls SQS for requests and parses those requests. (There is a wait time configured so that while testing, if you Ctrl+C out of the process, you don't get a bunch of errors from "can't connect to SQS".)
- Gets the study from DynamoDB (because accounts and data are partitioned by study).
- Gets the account from Stormpath by email address. (The code says username, but we recently changed Bridge Server so that all usernames are the same as email address, and everything just keys off email address.)
- Gets the health ID from the user's account and queries DDB with the health ID to get the health code.
- Queries DDB SynapseTables to get a list of all Synapse health data tables for that study and SynapseSurveyTables to get a list of all survey metadata tables for that study.
- Calls the SynapsePackager (https://github.com/Sage-Bionetworks/BridgeUserDataDownloadService/blob/develop/src/main/java/org/sagebionetworks/bridge/udd/synapse/SynapsePackager.java) to download all the user's data from Synapse (within the specified date range). The SynapsePackager does the following.
- It kicks off a bunch of async tasks for each Synapse table and for each survey table.
- Some of these tasks are SynapseDownloadFromTableTask (https://github.com/Sage-Bionetworks/BridgeUserDataDownloadService/blob/develop/src/main/java/org/sagebionetworks/bridge/udd/synapse/SynapseDownloadFromTableTask.java), which queries a Synapse table for the user's data by health code and date range, downloads the results as a TSV, and downloads all the file handles.
- Some of these tasks are SynapseDownloadSurveyTask (https://github.com/Sage-Bionetworks/BridgeUserDataDownloadService/blob/develop/src/main/java/org/sagebionetworks/bridge/udd/synapse/SynapseDownloadSurveyTask.java), which downloads the complete table of survey metadata as a TSV.
- Zips up all files (TSVs and file handles) into a master zip file.
- Uploads the master zip file to S3
- Creates a pre-signed URL for the master zip file and returns the pre-signed URL to BridgeUddWorker.
- Emails the pre-signed URL back to the user.
Development
Local Development
Create a fork from https://github.com/Sage-Bionetworks/BridgeUserDataDownloadService. Follow the steps in https://github.com/Sage-Bionetworks/BridgeUserDataDownloadService/blob/develop/README.md (If you're only planning on running the code but not on editing, you should be able to pull from the root fork directly.)
Testing
First, make sure your test account has uploads for the time range you want to test with.
To test through the Bridge Server, use the following example request:
POST https://webservices.sagebridge.org/v3/users/self/emailData
{
"startDate":"2015-08-15",
"endDate":"2015-08-19",
"type":"DateRange"
}
To test against BUDD directly, log into the AWS Console, go to the SQS dashboard, and submit the following example request as an SQS message:
{
"studyId":"api",
"username":"dwayne.jeng+test01@sagebase.org",
"startDate":"2015-07-23",
"endDate":"2015-07-30"
}
Either method will send an email to your registered email address.
Deploy to Dev
Submit your code changes to your own personal fork. Create a pull request to the root fork. Once the pull request has been merged, Travis will automatically build and deploy to the dev server on Elastic Beanstalk.
Deploy to Staging/Prod
- Create a workspace from the root fork (https://github.com/Sage-Bionetworks/BridgeUserDataDownloadService) if you don't already have one.
- Make sure all branches are up to date (git pull as necessary).
- Go to the staging branch (git checkout uat), merge from develop (git merge --ff-only develop).
- Push back to GitHub (git push). This should trigger Travis to automatically build and deploy to the staging server on Elastic Beanstalk.
Similar steps for Prod.
Rolling Back Deployments
- Log into the AWS Console and go to the Elastic Beanstalk Dashboard.
- In the top nav bar drop down, go to Application Versions.
- You'll see a list of versions named travis-[git commit hash]-[timestamp in epoch seconds]. Check the version you want to roll back to and click Deploy.
- Select the environment from the drop down and click Deploy.
Access Logs
Logs can be found at https://logentries.com/. Credentials to the root Logentries account can be found at belltown:/work/platform/PasswordsAndCredentials/passwords.txt. Alternatively, get someone with account admin access to add your user account to Logentries.
If for some reason, the logs aren't showing up in Logentries, file a support ticket with Logentries. The alternate steps to reach the logs are below
- Log into the AWS Console and go to the Elastic Beanstalk Dashboard.
- Select the environment you want to view logs for.
- Click on Logs in the left nav bar.
- In the drop down (top right), click Request Logs.
- Last 100 Lines will give you a link to a page with the logs on screen.
- Full Logs will give you a link to a zip file you can download.
If this still doesn't work, you can SSH directly into BUDD hosts (see below) and find logs at /var/log/tomcat8/catalina.out
Logging Into BUDD Hosts
You may need to be in the Fred Hutch intranet or logged into the Fred Hutch VPN for this to work.
- Log into the AWS Console and go to the EC2 Dashboard.
- Click on Instances in the left nav bar.
- In the table, find the host(s) with the name Bridge-UDD-Worker-Dev (or whatever environment you want to log into). Select that host. (If there's more than one in the environment you want, select just one host.)
- In the information panel on the bottom, find the field Public DNS host. This is the hostname you want to SSH into. But first, you'll need the PEM file to log in.
- Log into belltown and download the PEM files from /work/platform/PasswordsAndCredentials
- On your machine, run ssh -i [path to PEM file] ec2-user@[EC2 hostname]
You can save yourself some time with an entry in your ~/.ssh/config that looks like
host Bridge-UDD-Dev
HostName ec2-52-20-91-245.compute-1.amazonaws.com
User ec2-user
IdentityFile ~/Bridge-UDD-Dev.pem
Now you can just run ssh Bridge-UDD-Dev.
Next Steps
Short/Medium-Term
- add User Data Download to iOS SDKJira Legacy server JIRA (sagebionetworks.jira.com) serverId ba6fb084-9827-3160-8067-8ac7470f78b2 key BRIDGE-735
- Log archiving and alarmsJira Legacy server JIRA (sagebionetworks.jira.com) serverId ba6fb084-9827-3160-8067-8ac7470f78b2 key BRIDGE-761
- MonitoringJira Legacy server JIRA (sagebionetworks.jira.com) serverId ba6fb084-9827-3160-8067-8ac7470f78b2 key BRIDGE-762
- Refactor shared copy-pasted code into shared packageJira Legacy server JIRA (sagebionetworks.jira.com) serverId ba6fb084-9827-3160-8067-8ac7470f78b2 key BRIDGE-763
- Audit IAM credentialsJira Legacy server JIRA (sagebionetworks.jira.com) serverId ba6fb084-9827-3160-8067-8ac7470f78b2 key BRIDGE-764
- Move Stormpath keys from env vars to key management solutionJira Legacy server JIRA (sagebionetworks.jira.com) serverId ba6fb084-9827-3160-8067-8ac7470f78b2 key BRIDGE-765
Long-Term
- Performance improvements - Multi-threading? Map-Reduce?
- Web Portal - Better user interface than email?
- Data visualization - More useful than raw JSON dump
- Caching/De-duping - If the user requests the same data again, use the existing master zip file instead of generating a new one. Also helpful if their link expires and they want to get the data again.
- Cleanup task to delete old user requests and user-requested data.?
See Also
Original design doc: Design Doc