Transferring Egress Costs to Data Accessor
Transferring data from S3 has associated egress costs, roughly $0.09 per GB (https://aws.amazon.com/s3/pricing/), so downloading 10TB costs close to $1000. In multiple collaborations we need to host data and make it accessible, while protecting ourselves from egress costs. There are two technical approaches, described below. In each case assume:
the data has been contributed by a provider who declines to pay egress charges;
the data has been placed in a 'private' bucket, linked to Synapse (so custom bucket settings are a possibility);
data accessors are to navigate content through Synapse (using folders or file views);
data accessors need to use the data in compute locations of their choice, including their institution's compute center;
simplifying the steps necessary to download data will increase data accessors' willingness to follow the process we require;
Region restriction
The private bucket is configured to disallow egress from its region. So if the bucket is in us-east-1, data can only be moved to a bucket or EC2 instance in the same region. Configure an STS folder in the bucket.
The data accessor does the following:
in an AWS account that is prepared to pay egress charges, create a bucket in the same region as the data bucket.
uses Synapse to find a file of interest, and obtain the underlying S3 file key.
Use the STS feature to get S3 access to the folder and use the S3 client to copy the file to the second bucket.
download the file using the S3 client, paying egress charges
data accessor optionally sets a retention policy on their bucket to eliminate the duplicated files after a short time period, avoiding excessive storage costs.
Pros:
Access controls on the data bucket (ACLs and access restrictions) are enforced by Synapse as with any other file.
Cons:
Complexity: Getting each file requires two steps.
Duplication: S3 objects are replicated by the number of files downloaded times the number of users downloading them.
Performance: The "two hop" nature of the solution will slow download.
2. Requester pays
The data bucket is configured for RequestPayment=Requester (vs. Owner)
The data accessor does the following:
in an AWS account that is prepared to pay egress charges, create a role that is allowed to download data from the data bucket and trusts Synapse to assume the role.
in Synapse, associate this role with their account and the data bucket.
The bucket owner will add the role to the bucket's policy.
The data accessor uses Synapse to find a file of interest and download the file.
Synapse will assume their role (with ExternalId=their user id) and create a pre-signed URL to download the file. Egress costs go to the data downloader's AWS account.
Pros:
Simplicity: The downloaded file is not copied. There is no extra file handle or file entity created. All Synapse features for data download can be used.
Cons:
Access control: Upon set up the collaborator has access to the entire bucket in S3. Synapse's sharing settings and governance controls will limit creation of presigned URLs but will not limit the data accessor's ability to access the bucket through S3.
3. Another way would be to have customers subscribe to a Sage SaaS listing in the AWS Marketplace. You could use the marketplace metering API to bill customers for accessing data. You could bill them just to recoup your costs (for example, customer Dave accessed 10 GB in the last hour so we will bill him for $0.81). Marketplace would add Sage charges to the customer's AWS bill.
Details: TBD
Pros: TBD
Cons: TBD