Synapse Storage Reports: API Design Document

Jira tickets:

PLFM-5314 - Getting issue details... STATUS

PLFM-5315 - Getting issue details... STATUS

Background Jiras:

key	summary	type	created	updated	due	assignee	reporter	priority	status	resolution

Loading...

Refresh

Background

See Use Case 1 in Service to migrate content at the S3/Storage Level: Use Cases for use case notes.

In Synapse, the top 10 projects by file size account for about 3/4 of our total S3 bucket. These projects can be very expensive, so there is a need to determine the costs of projects. Using file metadata, we can determine approximate size very easily. Egress is more difficult to determine, but per the analysis in PLFM-5009, storage counts for approximately 80% of our bill. There is not currently a need to be incredibly precise/accurate, so we may simply ignore egress for now. At the moment, cost of egress can be assumed to be distributed proportionally to costs of storage.

API

Create a new Storage Report asynchronous service in Synapse that can be used by members of a "Synapse Reports Team", a bootstrapped team with users that have the authority to create these reports.

A member of the Reports team can make an asynchronous query to retrieve a CSV report about the usage of the Synapse S3 bucket with project-level resolution across all projects.

Verb

URI

Request body

Response body

Notes

GET

/storageReport/csv/async/get/{token}

None

DownloadStorageReportResponse:

resultsFileHandleId: String

timestamp: Date

Get an object containing a file handle that points to a Storage Report CSV

The caller can download a CSV with the file handle ID

POST

/storageReport/csv/async/start

DownloadStorageReportRequest

type: Enum (ALL_PROJECTS)

AsyncJobId

Initiates a job to create a CSV report for the sizes of projects in Synapse (where size is usage of the Synapse S3 bucket).

The request will create a report about all projects when specifying ALL_PROJECTS. The enum allows requests for different types of reports (for example, project groups, if that gets implemented)

Sample Report

Type: ALL_PROJECTS

Project ID	Name	Size (B)
syn5382532	Cool Project 1	424483013985391
syn635535	NIH-Grant 53532 Public Data Repository	53579813875383
syn9359135	Dr. Smith's Private FASTQ files	31482472428417
...	...	...

Implementation Details

This section is unrelated to the API. Feel free to ignore it if it is not within your scope of concern.

Detailed Requirements

Creation of a new bootstrapped "Reports Team" to have access to these APIs.
Retrieval of file handle metadata, particularly project association and file size
- File Replication Table
  - Columns from File Table: ID, ETAG, PREVIEW_ID, CREATED_ON, CREATED_BY, METADATA_TYPE, CONTENT_TYPE, CONTENT_SIZE, CONTENT_MD5, BUCKET_NAME, NAME, KEY, STORAGE_LOCATION_ID, ENDPOINT
  - Primary key ID
- Creation of this table allows retrieval of file metadata by joining it with the entity replication table. This allows us to find all of the file handles and metadata for a particular project in one database call. Without this table, we must query the tables database to find the entities in a project, and then separately query the repo database to retrieve the metadata of those files.

Concerns

This method will not accurately capture egress. It simply calculates proportions of cost based on storage.