Streaming data from aspera host to S3 bucket
One way of executing a large data transfer (e.g. for a seq project) is to write the data directly from the data provider to an S3 bucket, without having to first download the data. This requires that the data provider has an aspera license. The data will not be written to the machine running the transfer, but will go directly to the S3 bucket. Supposedly this works equally well streaming to Google Cloud Storage, but I have not tested it.
Step-by-step guide
To run the data transfer from an AWS EC2 machine:
- Create an EC2 instance with permissions to access the S3 bucket.
- Install aspera software.
- Copy aspera license to /opt/aspera/etc/aspera-license
- Install AWS CLI
- Check that you can access the S3 bucket, for example by running "ls".
- Run the data copy:
/opt/aspera/bin/astrap-config.sh enable ascp -A ascp -k 3 -v -p -l400m --mode recv --user=“USERNAME” --host=ASPERA.HOST.ADDRESS DIRECTORY_TO_DOWNLOAD s3://s3.amazonaws.com/YOUR_BUCKET/
Where
- USERNAME is your login ID on the aspera server
- ASPERA.HOST.ADDRESS is the location of the aspera server
- DIRECTORY_TO_DOWNLOAD is the name of the file or directory you want to copy to S3
- YOUR_BUCKET is the name of the S3 bucket to which the data will be copied.
Note that I was provided a temporary copy of the full aspera software and license in order to test this procedure. I was told at the time (Fall, 2014) by the CEO that they were interested in offering a product for doing this type of transfer.
I tested this procedure from an Amazon EC2 instance using ami-8997afe0. I do not know what performance to expect running it from belltown or other Hutch local compute.