Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

After a cost allocation is created, a project can be assigned to it. Cost allocations can have many projects, but a project can be assigned to at most one cost allocation. When a project is assigned to a cost allocation, the underlying files associated with that project will (that is, the file handles pointed to by all versions of all file entities in a project, and their previews) will eventually be moved to the new S3 bucket, and the file handles will be updated. After all files have been moved, the AWS bill and cost explorer tools can facet costs by the cost allocations, as desired.

VerbURIRequest bodyResponse bodyNotes
GET/costAllocation

NonePaginatedList<CostAllocation>

CostAllocationPage

body: Array<CostAllocation>

nextPageToken: String

Lists all existing CostAllocations
POSTGET/entity/{entityId}/costAllocationname: StringNone

CostAllocation

id: String

name: String

bucket: String

projects: Array<String>

eTag: String

createdBy: Long

createdOn: Date

Gets the cost allocation for a specific project.
POST/entity/{entityId}/costAllocation

name: String

CostAllocation

Associates a project with a cost allocation. The underlying files that are associated with the project and in Synapse storage (either the default bucket or a different cost allocation) will be moved to the provisioned storage for the cost allocation.

Name is case-insensitive (will be coerced to lowercase) and can include alphanumeric, "-", ".", and "_".

DELETE/entity/{entityId}/costAllocationNoneNone

Removes the cost allocation tied to a project. The contents of the project that are in the cost allocation storage location will be moved to the default Synapse storage.

After all of the contents have been moved, the project is removed from the cost allocation.

Implementation Details

Note

This section is unrelated to the API. Feel free to ignore it if it is not within your scope of concern.

Workflow

This gives a general overview of the process required to apply a cost allocation to a project.

  1. A user applies a cost allocation to the project. If the cost allocation does not exist, a new bucket is created. If the bucket is successfully created or if the cost allocation exists, the call succeeds for the user. A successful call triggers the next step.
  2. A worker (separate from the stack) collects the file handles in that project that are stored in Synapse storage and are not in the cost allocation. These file handles are sent to another worker to accomplish the next step. This worker can run when a cost allocation is created, as well as at regular intervals to ensure files are not missed.
  3. A worker on the stack is given a file handle and a destination bucket. If the file is not already in that bucket, it is copied to the destination bucket. If the copy is successful, the file handle is modified to point to the new location. The old file is archived and marked for later deletion.

Detailed Requirements 

...

  • Creation of a new bootstrapped "Cost Allocation Team" to have access to these APIs.
  • Retrieval of file handle metadata, particularly project association and file size 
    • File Replication Table
      • Columns from File Table: ID, ETAG, PREVIEW_ID, CREATED_ON, CREATED_BY, METADATA_TYPE, CONTENT_TYPE, CONTENT_SIZE, CONTENT_MD5, BUCKET_NAME, NAME, KEY, STORAGE_LOCATION_ID, ENDPOINT
      • Primary key ID
    • Creation of this table allows retrieval of file metadata by joining it with the entity replication table. This allows us to find all of the file handles and their metadata in for a particular project in one database call. Without this table, we must query the tables database to find the entities in a project, and then separately query the repo database to retrieve the metadata of those files.
    • Jira Legacy
      serverSystem JIRA
      columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
      serverIdba6fb084-9827-3160-8067-8ac7470f78b2
      keyPLFM-4148
       is another issue that may benefit from this
  • Enumeration of cost allocations
    • Cost allocation table: ID, NAME, BUCKET, CREATED_BY, CREATED_ON
  • Associate cost allocations and projects
    • Cost Allocation association table
      • Columns: COST_ALLOCATION_ID, PROJECT_ID
      • Primary key: PROJECT_ID (a project may have no more than one cost allocation)
  • Overriding uploads to default Synapse storage to redirect to the cost allocation bucket.
  • An eventually consistent algorithm to move files to a new bucket, and then modifying the file handle to reflect the move
    • Asynchronous worker Worker that is not a part of the stack scans file replication table for files that are in the project with the cost allocation AND are in Synapse storage AND are not in the correct cost allocation bucket
      • This worker creates bundles of files that should be updated and sends these bundles to another asynchronous worker
      • This cannot be part of the stack because it is possible for file handles on different stacks to point to the same underlying file (e.g. prod and staging). If a cost allocation were created on staging, then it would break the files in prod.
    • Asynchronous worker finds each file in the bundle in the files table and verifies that it is not in the correct cost allocation bucket
      • The file is copied to the new S3 bucket and updates the file handle, archiving/deleting the old underlying file
      • The old underlying file is stored in an Amazon Glacier instance for 12 months where the file is stored in a folder /key/filename. These files can be restored manually if necessary, and can be found using the updated file handle, which will have the same key and filename.
      • The actual copying may be fairly trivial with AWS Batch Operations for S3 Buckets and Copy Objects Between S3 Buckets With Lambda 
      • We still must somehow mark the old file for deletion, perhaps by moving it to low-cost storage with a lifecycle deletion policy
      • If we can track the batch operations, we can modify file handles one-by-one. Otherwise we may have to find another solution, or wait for the batch operation to complete
  • Depending on the performance/reliability of the operation, a way to report progress of the job to ensure it has been completed.
    • This may be as simple as noting CloudWatch metrics like the decreasing size of a prod bucket

Concerns

  • AWS has a 100-bucket soft cap. They can increase the cap to 200 if they approve of your use case. Do we anticipate needing to store more than 100-200 internally managed projects that are 10+TB or larger? This may be a distant/unlikely problem. There are also many buckets that we can probably remove from our prod account (we currently have ~65)

...