This is documentation for Bridge User Data Download (BUDD) Service.
Data Flow
- User requests their data in the app, specifying the start date and end date. (App may or may not supply default start and end dates.)
- The app calls Bridge REST API API with the start date and end date (requires user authentication).
- Bridge Server writes the request to an internal SQS queue. This request contains study ID, username, start date, and end date.
- BUDD reads from the SQS, aggregates the requested data, and sends an email to the user with a link to where they can download the data. This link will expire after 24 hours.
What BUDD Does Internally
- Parses the request from SQS (https://github.com/Sage-Bionetworks/BridgeUserDataDownloadService/blob/develop/src/main/java/org/sagebionetworks/bridge/udd/helper/SqsHelper.java)
- Gets the study from Dynamo DB Study table (https://github.com/Sage-Bionetworks/BridgeUserDataDownloadService/blob/develop/src/main/java/org/sagebionetworks/bridge/udd/dynamodb/DynamoHelper.java#L41) - This is needed to get the Stormpath information to get the user's account, as well as to get the configured "from" email address.
- Gets the user's health ID (user to obtain health code) and email address from Stormpath (https://github.com/Sage-Bionetworks/BridgeUserDataDownloadService/blob/develop/src/main/java/org/sagebionetworks/bridge/udd/accounts/StormpathHelper.java)
- Gets the user's health code from the Dynamo DB HealthId table and queries for uploads with that health code and with upload dates within the start and end date (inclusive) from the Dynamo DB Upload table, index healthCode-uploadDate-index (https://github.com/Sage-Bionetworks/BridgeUserDataDownloadService/blob/develop/src/main/java/org/sagebionetworks/bridge/udd/dynamodb/DynamoHelper.java#L59)
- S3 Packager (https://github.com/Sage-Bionetworks/BridgeUserDataDownloadService/blob/develop/src/main/java/org/sagebionetworks/bridge/udd/s3/S3Packager.java). This does the following:
- Downloads the uploads from S3.
- Decrypts the uploads and writes them to a temp directory. (A new temp directory is created for every request.) Individual files are named in the format YYYY-MM-DD-UploadId.zip, so users can organize their uploads by date.
- Errors in downloading or decrypting are written to a error.log in the temp directory, which is included in the master zip file.
- Creates a master zip file called userdata-YYYY-MM-DD-to-YYYY-MM-DD-randomGuid.zip (start date and end date) and zips all upload files and error.log into the master zip file.
- Uploads the master zip file to S3.
- Creates a pre-signed URL for the master zip file, for HTTP GET only and with an expiration date 24 hours from now.
- Deletes the temp files and temp directories.
- Emails the S3 pre-signed URL to the user's registered email address (https://github.com/Sage-Bionetworks/BridgeUserDataDownloadService/blob/develop/src/main/java/org/sagebionetworks/bridge/udd/helper/SesHelper.java)
Development
Local Development
Create a fork from https://github.com/Sage-Bionetworks/BridgeUserDataDownloadService. Follow the steps in https://github.com/Sage-Bionetworks/BridgeUserDataDownloadService/blob/develop/README.md (If you're only planning on running the code but not on editing, you should be able to pull from the root fork directly.)
Deploy to Dev
Submit your code changes to your own personal fork. Create a pull request to the root fork. Once the pull request has been merged, Travis will automatically build and deploy to the dev server on Elastic Beanstalk.
Deploy to Staging/Prod
- Create a workspace from the root fork (https://github.com/Sage-Bionetworks/BridgeUserDataDownloadService) if you don't already have one.
- Make sure all branches are up to date (git pull as necessary).
- Go to the staging branch (git checkout uat), merge from develop (git merge --ff-only develop).
- Push back to GitHub (git push). This should trigger Travis to automatically build and deploy to the staging server on Elastic Beanstalk.
Similar steps for Prod.
Rolling Back Deployments
- Log into the AWS Console and go to the Elastic Beanstalk Dashboard.
- In the top nav bar drop down, go to Application Versions.
- You'll see a list of versions named travis-[git commit hash]-[timestamp in epoch seconds]. Check the version you want to roll back to and click Deploy.
- Select the environment from the drop down and click Deploy.
Access Logs
- Log into the AWS Console and go to the Elastic Beanstalk Dashboard.
- Select the environment you want to view logs for.
- Click on Logs in the left nav bar.
- In the drop down (top right), click Request Logs.
- Last 100 Lines will give you a link to a page with the logs on screen.
- Full Logs will give you a link to a zip file you can download.
Logging Into BUDD Hosts
- Log into the AWS Console and go to the EC2 Dashboard.
- Click on Instances in the left nav bar.
- In the table, find the host(s) with the name Bridge-UDD-Worker-Dev (or whatever environment you want to log into). Select that host. (If there's more than one in the environment you want, select just one host.)
- In the information panel on the bottom, find the field Public DNS host. This is the hostname you want to SSH into. But first, you'll need the PEM file to log in.
- Log into belltown and download the PEM files from /work/platform/PasswordsAndCredentials
- On your machine, run ssh -i [path to PEM file] ec2-user@[EC2 hostname]
You can save yourself some time with an entry in your ~/.ssh/config that looks like
host Bridge-UDD-Dev
HostName ec2-52-20-91-245.compute-1.amazonaws.com
User ec2-user
IdentityFile ~/Bridge-UDD-Dev.pem
Now you can just run ssh Bridge-UDD-Dev, and it will just work.
Next Steps
Short/Medium-Term
- - BRIDGE-735Getting issue details... STATUS - add User Data Download to iOS SDK
- - BRIDGE-761Getting issue details... STATUS - Log archiving and alarms
- - BRIDGE-762Getting issue details... STATUS - Monitoring
- - BRIDGE-763Getting issue details... STATUS - Refactor shared copy-pasted code into shared package
- - BRIDGE-764Getting issue details... STATUS - Audit IAM credentials
- - BRIDGE-765Getting issue details... STATUS - Move Stormpath keys from env vars to key management solution
Long-Term
- Performance improvements - Multi-threading? Map-Reduce?
- Web Portal - Better user interface than email?
- Data visualization - More useful than raw JSON dump
- Caching/De-duping - If the user requests the same data again, use the existing master zip file instead of generating a new one. Also helpful if their link expires and they want to get the data again.
- Cleanup task to delete old user requests and user-requested data?
See Also
Original design doc: Design Doc