Content Comparison

...

Table of Content Zone

Table of Contents

Jira tickettickets:

Jira Legacy

key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution

server	System JIRA	columns
serverId	ba6fb084-9827-3160-8067-8ac7470f78b2
key	PLFM-5314

Jira Legacy

server	System JIRA
serverId	ba6fb084-9827-3160-8067-8ac7470f78b2
key	PLFM-52275315

Background Jiras:

Jira Legacy

server	System JIRA
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
maximumIssues	20
jqlQuery	key in (PLFM-5009, PLFM-5082, PLFM-5085, PLFM-5108, PLFM-5201, PLFM-5108)
serverId	ba6fb084-9827-3160-8067-8ac7470f78b2

Background

See Use Case 1 in Service to migrate content at the S3/Storage Level: Use Cases for use case notes.

In Synapse, the top 10 projects by file size account for about 3/4 of our total S3 bucket. These projects can be very expensive, so there is a need to determine the costs of projects. By moving this content into new buckets, we can leverage AWS Cost Allocation Tags to facet S3 costs by bucket (and thus, groups of content).

API

We can create a new Cost Allocation . Using file metadata, we can determine approximate size very easily. Egress is more difficult to determine, but per the analysis in PLFM-5009, storage counts for approximately 80% of our bill. There is not currently a need to be incredibly precise/accurate, so we may simply ignore egress for now. At the moment, cost of egress can be assumed to be distributed proportionally to costs of storage.

API

Create a new Storage Report asynchronous service in Synapse that can be used by members of a "Synapse Cost Allocation Reports Team", a bootstrapped team with users that have the authority to manage create these cost allocationsreports.

The members A member of the Cost Allocation Team can create new cost allocations with a descriptive name (e.g. ampad, nih-r01-12345) that matches how costs should be broken down in Synapse. When a new cost allocation being created, a new bucket is provisioned with an AWS Cost Allocation Tag.After a cost allocation is created, a project can be assigned to it. Cost allocations can have many projects, but a project can be assigned to at most one cost allocation. When a project is assigned to a cost allocation, the underlying files associated with that project will eventually be moved to the new S3 bucket, and the file handles will be updated. After all files have been moved, the AWS bill and cost explorer tools can facet costs by the cost allocations, as desired.Reports team can make an asynchronous query to retrieve a CSV report about the usage of the Synapse S3 bucket with project-level resolution across all projects.

...

Verb	URI	Request body	Response body	Notes
GET

/costAllocation

None

PaginatedList<CostAllocation>

Lists all existing CostAllocationsPOST/entity/{entityId}/costAllocation

name: String

CostAllocation

id: String

name: String

bucket: String

eTag: String

createdBy: Long

createdOn: Date

Associates a project with a cost allocation. The underlying files that are associated with the project and in Synapse storage (either the default bucket or a different cost allocation) will be moved to the provisioned storage for the cost allocation.

Name is case-insensitive (will be coerced to lowercase) and can include alphanumeric, "-", ".", and "_".

DELETE/entity/{entityId}/costAllocationNoneNone

Removes the cost allocation tied to a project. The contents of the project that are in the cost allocation storage location will be moved to the default Synapse storage.

After all of the contents have been moved, the project is removed from the cost allocation.

Implementation Details

Workflow

This gives a general overview of the process required to apply a cost allocation to a project.

A user applies a cost allocation to the project. If the cost allocation does not exist, a new bucket is created. If the bucket is successfully created or if the cost allocation exists, the call succeeds for the user. A successful call triggers the next step.
A worker collects the file handles in that project that are stored in Synapse storage and are not in the cost allocation. These file handles are sent to another worker to accomplish the next step.
A worker is given a file handle and a destination bucket. If the file is not already in that bucket, it is copied to the destination bucket. If the copy is successful, the file handle is modified to point to the new location. The old file is archived and marked for later deletion.

Detailed Requirements (Unrelated to the API, feel free to ignore)

/storageReport/csv/async/get/{token}

None

DownloadStorageReportResponse:

resultsFileHandleId: String

timestamp: Date

Get an object containing a file handle that points to a Storage Report CSV

The caller can download a CSV with the file handle ID

POST

/storageReport/csv/async/start

DownloadStorageReportRequest

type: Enum (ALL_PROJECTS)

AsyncJobId

Initiates a job to create a CSV report for the sizes of projects in Synapse (where size is usage of the Synapse S3 bucket).

The request will create a report about all projects when specifying ALL_PROJECTS. The enum allows requests for different types of reports (for example, project groups, if that gets implemented)

Sample Report

Type: ALL_PROJECTS

Project ID	Name	Size (B)
syn5382532	Cool Project 1	424483013985391
syn635535	NIH-Grant 53532 Public Data Repository	53579813875383
syn9359135	Dr. Smith's Private FASTQ files	31482472428417
...	...	...

Implementation Details

Note
This section is unrelated to the API. Feel free to ignore it if it is not within your scope of concern.

Detailed Requirements

Creation of a new bootstrapped "Reports Team" to have access to these APIs.
Retrieval of file handle metadata, particularly project association and file size
- File Replication Table
  - Columns from File Table: ID, ETAG, PREVIEW_ID, CREATED_ON, CREATED_BY, METADATA_TYPE, CONTENT_TYPE, CONTENT_SIZE, CONTENT_MD5, BUCKET_NAME, NAME, KEY, STORAGE_LOCATION_ID, ENDPOINT
  - Primary key ID
- Creation of this table allows retrieval of file metadata by joining it with the entity replication table. This allows us to find all of the file handles and their metadata in for a particular project in one database call. Without this table, we must query the tables database to find the entities in a project, and then separately query the repo database to retrieve the metadata of those files.
- Jira Legacy
  server System JIRA
  columns key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
  serverId ba6fb084-9827-3160-8067-8ac7470f78b2
  key PLFM-4148
  is another issue that may benefit from this
Enumeration of cost allocations
- Cost allocation table: ID, NAME, BUCKET, CREATED_BY, CREATED_ON
Associate cost allocations and projects
- Cost Allocation association table
  - Columns: COST_ALLOCATION_ID, PROJECT_ID
  - Primary key: PROJECT_ID (a project may have no more than one cost allocation)
Overriding uploads to default Synapse storage to redirect to the cost allocation bucket.
An eventually consistent algorithm to move files to a new bucket, and then modifying the file handle to reflect the move
- Asynchronous worker scans file replication table for files that are in the project with the cost allocation AND are in Synapse storage AND are not in the correct cost allocation bucket
  - This worker creates bundles of files that should be updated and sends these bundles to another asynchronous worker
- Asynchronous worker finds each file in the bundle in the files table and verifies that it is not in the correct cost allocation bucket
  - The file is copied to the new S3 bucket and updates the file handle, archiving/deleting the old underlying file.
  - The actual copying may be fairly trivial with AWS Batch Operations for S3 Buckets and Copy Objects Between S3 Buckets With Lambda
    - We still must somehow mark the old file for deletion, perhaps by moving it to low-cost storage with a lifecycle deletion policy
    - If we can track the batch operations, we can modify file handles one-by-one. Otherwise we may have to find another solution, or wait for the batch operation to complete

Concerns

...

Concerns

This method will not accurately capture egress. It simply calculates proportions of cost based on storage.

Version	Old Version 2	New Version Current
Changes made by	Nick Grosenbacher	Nick Grosenbacher
Saved on	Dec 04, 2018	Jan 10, 2019

Versions Compared

Key

Background

API

API

Implementation Details

Workflow

Detailed Requirements (Unrelated to the API, feel free to ignore)

Sample Report

Detailed Requirements

Concerns

Concerns