Document toolboxDocument toolbox

Aspera Transfer to EC2

Purpose
To document the set-up and commands to transfer data from Aspera to an AWS EC2 instance (from Sage Service Catalog) and finally to Synapse.

 Instructions

Provision an EC2 instance on Sage Service Catalog

  1. If you have not used the Service Catalog before, follow the instructions to provision an EC2 instance within the Service Catalog.

Set-up an ‘EC2 Linux Docker’ instance of type: m5n.4xlarge with 5Tb of storage (current max), add tags for ‘SysBio’ and ‘amp-aim'



Installing Aspera Connect on an EC2 instance

  1. Go to your EC2 instance in the Service Catalog by clicking on “Provisioned Products”, select your EC2 instance, scroll to the bottom of the Provisioned product details page and click on “Connection URI”. 

NOTE: The EC2 instance can take a minute to set up.  You may need to refresh the page to see where to spin up the shell session in the browser with a “Connection URI”.

  1. Change users to become the “ec2-user” as:
    sudo su - ec2-user

  2. Run the command to download Aspera:
    wget -qO- https://d3gcli72yxqn2z.cloudfront.net/connect_latest/v4/bin/ibm-aspera-connect_4.1.0.46-linux_x86_64.tar.gz | tar xvz

    where https://d3gcli72yxqn2z.cloudfront.net/connect_latest/v4/bin/ibm-aspera-connect_4.1.0.46-linux_x86_64.tar.gz is the version of Aspera to install on the EC2 instance

     

NOTE: Additional versions of Aspera Connect can be found on the All Aspera Connect Installers. Check the “Change download options” on the upper right-hand side to make sure HTTPS is selected as the transfer type. Copy the link for the download to use in the “wget” command on the EC2 instance. You may be prompted to create an account to access the installer page, if so it’s free to create the account and registration is straightforward.

  1. Once the IBM Connect zip file is loaded and unzipped, run commands to install IBM Connect and tmux

    1. Make the shell script executable:
      chmod +x <YOUR-IBM-CONNECT-VERSION>.sh

    2. Run the shell script
      ./<YOUR-IBM-CONNECT-VERSION>.sh
      NOTE: There was an error message when running the shell script, but ascp was installed

    3. Add it to the path now and in the future
      export PATH=$PATH:~/.aspera/connect/bin/
      echo 'export PATH=$PATH:~/.aspera/connect/bin/' >> ~/.bash_profile

    4. Type ascp to see if the command is recognized

    5. Add your password as an environment variable so you don’t need to type it manually after each asp call
      export ASPERA_SCP_PASS=yourPasswordHere

    6. To install tmux (allows you to not have a long job interrupted), enter the following:
      sudo yum install tmux

Transfer the Data from Aspera to EC2

  1. Go to this document that tracks the progress of the data transfer and choose a folder that is not yet transferred.

  2. Using the web-based Aspera client, navigate to a high-level folder and try to download it, then go to the ‘Transfer Monitor’ (leftmost button) to see the size of the full contents of the folder.  You’ll want to quit the download immediately after you find out the size.  If it is less than 5Tb, proceed.  Else you’ll need to navigate downwards in the hierarchy and take individual folders that are smaller than 5Tb



  1. Run the command to transfer the data within a “tmux” session to keep the connection alive once you are logged out of the server. To do this, just enter tmux and a new screen will appear where you can enter the ascp command.

  2. Run the ascp command to transfer the data as:

tmux
ascp --mode recv -P 33001 --user=<YOUR-ASPERA-USERNAME> --host=aspera-immport.niaid.nih.gov <PATH-TO-TRANSFER-FILE> <LOCATION-TO-TRANSFER-TO>tmux
ascp --mode recv -P 33001 --user=<YOUR-ASPERA-USERNAME> --host=aspera-immport.niaid.nih.gov <PATH-TO-TRANSFER-FILE> <LOCATION-TO-TRANSFER-TO>

Note that it is typically helpful to ‘compose’ your command in a text file to check it first, and then you also have a record of what you entered.


Initial working example to transfer data to current directory:

ascp --mode recv -P 33001 --user=aspera.twhetzel --host=aspera-immport.niaid.nih.gov "AMP_RA_SLE.Phase2/Histology/PHASE 2 301/301-157.svs" .

** Preferred  command - includes “-l” parameter for the transfer rate :

tmux
ascp --policy=fixed -l 300000 --mode recv -P 33001 --user=aspera.dwebster --host=aspera-immport.niaid.nih.gov "AMP_RA_SLE.Phase1/RA" .



  1. Check the size of the data transferred to the EC2 instance using the command:

du -sh * | sort -h

(Check remaining space on EC2 instance: df -h)

 

 

EC2 Instance to Synapse Transfer

  1. Install the develop version of the synapse client
    pip3 install git+https://github.com/Sage-Bionetworks/synapsePythonClient@develop

  2. Clean up the filenames of any characters that will disrupt the upload to Synapse by finding and renaming within a given high-level directory (e.g. ‘RA’). Execute the following commands, which will not return a value, but you can check to see if any of the files have these characters by running the versions of the commands below leaving off the ‘-exec’ and everything after it.

    find 'RA' -iname "*#*" -exec rename '#' '_num_' '{}' \;
    find 'RA' -iname "*&*" -exec rename '&' '_and_' '{}' \;
    find 'RA' -iname "*[*" -exec rename '[' '_bracketF_' '{}' \;
    find 'RA' -iname "*]*" -exec rename ']' '_bracketR_' '{}' \;

  3. Create the high-level folder you are going to be transferring on Synapse using the Synapse webUI (folder tools). Make sure this is the name that is currently being used on Aspera and that you have transferred into your EC2 bucket. The synID of the folder you just created will be the  ‘parentID’ in the following command.  This is NOT the overall Synapse ID of the project located in the header of the page.  Use the one that the red arrow is pointing to:

 

  1. Generate the manifest file that will be used to catalog what is
    synapse manifest --parent-id syn012345678 --manifest-file manifest.tsv /path/to/folder/

    For concrete example:
    synapse manifest --parent-id syn26324313 --manifest-file manifest_clinicalData.tsv 'Clinical Data'

  2. If you are not already in tmux, do so and then run the synapse sync command with your manifest file
    tmux
    synapse sync manifest_phase1_clinicalData.tsv

 

You can exit out of the session by just closing the browser window, then re-login to the session, sudo in, and check the progress of the download with the commands df and top

  1. To confirm that you’ve transferred everything from aspera, run the following command which will provide an ‘auditable record’ of the manifest that we can compare
    ascp --policy=fixed -l 300000 --mode recv -P 33001 -k 3 --file-manifest aspera_manifest_clinicalData.txt --user=aspera.dwebster --host=aspera-immport.niaid.nih.gov "AMP_RA_SLE.Phase1/Clinical Data" .
    Note that you can use -k 2 if the manifest creation is taking too long

  2. The above command will generate a file in the current directory that looks something like the following: aspera-transfer-f7479cee-f568-45fd-b543-90cd281c06c5.manifest.txt

    Then rename the file to have the same file naming convention as the previous manifest, with this command:
    mv aspera-transfer-f7479cee-f568-45fd-b543-90cd281c06c5.manifest.txt aspera_manifest_phase1_clinicalData.txt

  3. Once you have completed your transfers to Synapse, enter the information in the Transfer progress doc

  4. Create a manifest file of your manifest files, and sync these to Synapse in the following folder: syn26338257

  5. Add the synIDs for the manifest and the aspera transfer manifest to this google doc




Ref. ONGOING: Data Transfer Technical Issues Learning Document