Page Comparison

...

Jira Legacy

server	System JIRA
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
maximumIssues	20
jqlQuery	key in (PLFM-5009, PLFM-5082, PLFM-5085, PLFM-5108, PLFM-5201, PLFM-5108)
serverId	ba6fb084-9827-3160-8067-8ac7470f78b2

...

In Synapse, the top 10 projects by file size account for about 3/4 of our total S3 bucket. These projects can be very expensive, so there is a need to determine the costs of projects. By moving this content into new buckets, we can leverage AWS Cost Allocation Tags to facet S3 costs by bucket (and thus, groups of content). Using file metadata, we can determine approximate size very easily. Egress is more difficult to determine, but per the analysis in PLFM-5009, storage counts for approximately 80% of our bill. There is not currently a need to be incredibly precise/accurate, so we may simply ignore egress for now. At the moment, cost of egress can be assumed to be distributed proportionally to costs of storage.

API

We can create a new Cost Allocation service in Synapse that can be used by members of a "Synapse Cost Allocation Team", a bootstrapped team with users that have the authority to manage these cost allocations.

The members of the Cost Allocation Team can create new cost allocations with a descriptive name (e.g. ampad, nih-r01-12345) that matches how costs should be broken down in Synapse. When a new cost allocation being created, a new bucket is provisioned with an AWS Cost Allocation Tagproject sizes can be calculated by group to determine that cost allocation's total impact.

After a cost allocation is created, a project can be assigned to it. Cost allocations can have many projects, but a project can be assigned to at most one cost allocation. When a project is assigned to a cost allocation, the underlying files associated with that project (that is, the file handles pointed to by all versions of all file entities in a project, and their previews) will eventually be moved to the new S3 bucket, and the file handles will be updated. After all files have been moved, the AWS bill and cost explorer tools can facet costs by the cost allocations, as desiredincluded in the total cost of that cost allocation.

Verb	URI	Request body	Response body	Notes
GET	/costAllocation	None	CostAllocationPage body: Array<CostAllocation> nextPageToken: String	Lists all existing CostAllocations
POST	/costAllocation/report/csv/async/start	CostAllocationReportRequest numberOfResults: Long allocated: Boolean	AsyncJobId	Initiates a job to create a CSV report for the sizes of cost allocations or unallocated projects in Synapse. The results will contain the top <numberOfResults> cost allocations/unallocated projects by descending size. If allocated is true, the report will include the largest cost allocations. If allocated is false, the report will include the largest projects that are not currently assigned to a cost allocation. This request can only be made by a member of the Synapse Cost Allocation Team
GET	/costAllocation/report/csv/async/get/{token}	None	CostAllocationReportResult: resultsFileHandleId: String timestamp: Date	Get an object containing a file handle that points to a Cost Allocation Report CSV
POST	/entity/{entityId}/costAllocation

None

CostAllocation

id: String

bucket: String

projects: Array<String>

eTag: String

createdBy: Long

createdOn: Date

Gets the cost allocation for a specific project.POST/entity/{entityId}/costAllocation

CostAllocation

Associates a project with a cost allocation. If the cost allocation doesn't exist, it creates a new one. If the project is currently associated with a different cost allocation, it will be replaced with a new one.

The files belonging to entities in the project and in Synapse storage (either the default bucket or a different cost allocation) will be moved to the provisioned storage for the specified cost allocation.

Name is case-insensitive (will be coerced to lowercase) and can include alphanumeric, "-", ".", and "_".

GET

/entity/{entityId}/costAllocation

None

CostAllocation

Gets the cost allocation for a specific project.

DELETE

/entity/{entityId}/costAllocation

None

Removes the cost allocation tied to a project. The contents of the project that are in the cost allocation storage location will be moved to the default Synapse storage.

After all of the contents have been moved, the project is removed from the cost allocation.

Sample Reports

CostAllocation report (default, allocated=TRUE)

Cost Allocation ID	Name	Size (B)	Proportion of Synapse Storage
1	cost_alloc_1	424483013985391	0.6534
2	amp-ad	53579813875383	0.0824
3	grant123	27148247242841	0.0414
...	...	...	...
0	unallocated	89573285798719	0.2139

Unallocated projects report (allocated=FALSE)

Project ID	Name	Size (B)	Proportion of Synapse Storage
syn123456	Cool Project 123	14483013985391	0.1244
syn999999	Research Group Data	7579813875383	0.0644
syn583725	Smith Lab Repository	3148247242841	0.0311
...	...	...	...
0	allocated	424483013985391	0.8538

Implementation Details

Note
This section is unrelated to the API. Feel free to ignore it if it is not within your scope of concern.

Workflow

This gives a general overview of the process required to apply a cost allocation to a project.

...

.

...

Detailed Requirements

Creation of a new bootstrapped "Cost Allocation Team" to have access to these APIs.
Retrieval of file handle metadata, particularly project association and file size
- File Replication Table
  - Columns from File Table: ID, ETAG, PREVIEW_ID, CREATED_ON, CREATED_BY, METADATA_TYPE, CONTENT_TYPE, CONTENT_SIZE, CONTENT_MD5, BUCKET_NAME, NAME, KEY, STORAGE_LOCATION_ID, ENDPOINT
  - Primary key ID
- Creation of this table allows retrieval of file metadata by joining it with the entity replication table. This allows us to find all of the file handles and metadata for a particular project in one database call. Without this table, we must query the tables database to find the entities in a project, and then separately query the repo database to retrieve the metadata of those files.
- Entity Replication Table
  - New column COST_ALLOCATION_ID
- Jira Legacy
  server System JIRA
  columns key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
  serverId ba6fb084-9827-3160-8067-8ac7470f78b2
  key PLFM-4148
  is another issue that may benefit from this
Enumeration of cost allocations
- Cost allocation table: ID, NAME, BUCKET, CREATED_BY, CREATED_ON
Associate cost allocations and projects
- Cost Allocation association table
  - Columns: COST_ALLOCATION_ID, PROJECT_ID
  - Primary key: PROJECT_ID (a project may have no more than one cost allocation)
Overriding uploads to default Synapse storage to redirect to the cost allocation bucket.
An eventually consistent algorithm to move files to a new bucket, and then modifying the file handle to reflect the move
- Worker that is not a part of the stack scans file replication table for files that are in the project with the cost allocation AND are in Synapse storage AND are not in the correct cost allocation bucket
  - This worker creates bundles of files that should be updated and sends these bundles to another asynchronous worker
  - This cannot be part of the stack because it is possible for file handles on different stacks to point to the same underlying file (e.g. prod and staging). If a cost allocation were created on staging, then it would break the files in prod.
- Asynchronous worker finds each file in the bundle in the files table and verifies that it is not in the correct cost allocation bucket
  - The file is copied to the new S3 bucket and updates the file handle
  - The old underlying file is stored in an Amazon Glacier instance for 12 months where the file is stored in a folder /key/filename. These files can be restored manually if necessary, and can be found using the updated file handle, which will have the same key and filename.
  - The actual copying may be fairly trivial with AWS Batch Operations for S3 Buckets and Copy Objects Between S3 Buckets With Lambda
Depending on the performance/reliability of the operation, a way to report progress of the job to ensure it has been completed.
- This may be as simple as noting CloudWatch metrics like the decreasing size of a prod bucket

Concerns

...

Concerns

This method will not accurately capture egress. It simply calculates proportions of cost based on storage.

Versions Compared

Old Version 6

New Version 7

Key

API

Sample Reports

Workflow

Detailed Requirements

Concerns

Concerns