Synapse Storage Reports: API Design Document
Jira tickets:
- PLFM-5314Getting issue details... STATUS
- PLFM-5315Getting issue details... STATUS
Background Jiras:
Background
See Use Case 1 in Service to migrate content at the S3/Storage Level: Use Cases for use case notes.
In Synapse, the top 10 projects by file size account for about 3/4 of our total S3 bucket. These projects can be very expensive, so there is a need to determine the costs of projects. Using file metadata, we can determine approximate size very easily. Egress is more difficult to determine, but per the analysis in PLFM-5009, storage counts for approximately 80% of our bill. There is not currently a need to be incredibly precise/accurate, so we may simply ignore egress for now. At the moment, cost of egress can be assumed to be distributed proportionally to costs of storage.
API
Create a new Storage Report asynchronous service in Synapse that can be used by members of a "Synapse Reports Team", a bootstrapped team with users that have the authority to create these reports.
A member of the Reports team can make an asynchronous query to retrieve a CSV report about the usage of the Synapse S3 bucket with project-level resolution across all projects.
Verb | URI | Request body | Response body | Notes |
---|---|---|---|---|
GET | /storageReport/csv/async/get/{token} | None | DownloadStorageReportResponse: resultsFileHandleId: String timestamp: Date | Get an object containing a file handle that points to a Storage Report CSV The caller can download a CSV with the file handle ID |
POST | /storageReport/csv/async/start | DownloadStorageReportRequest type: Enum (ALL_PROJECTS) | AsyncJobId | Initiates a job to create a CSV report for the sizes of projects in Synapse (where size is usage of the Synapse S3 bucket). The request will create a report about all projects when specifying ALL_PROJECTS. The enum allows requests for different types of reports (for example, project groups, if that gets implemented) |
Sample Report
Type: ALL_PROJECTS
Project ID | Name | Size (B) |
---|---|---|
syn5382532 | Cool Project 1 | 424483013985391 |
syn635535 | NIH-Grant 53532 Public Data Repository | 53579813875383 |
syn9359135 | Dr. Smith's Private FASTQ files | 31482472428417 |
... | ... | ... |
Implementation Details
This section is unrelated to the API. Feel free to ignore it if it is not within your scope of concern.
Detailed Requirements
- Creation of a new bootstrapped "Reports Team" to have access to these APIs.
- Retrieval of file handle metadata, particularly project association and file size
- File Replication Table
- Columns from File Table: ID, ETAG, PREVIEW_ID, CREATED_ON, CREATED_BY, METADATA_TYPE, CONTENT_TYPE, CONTENT_SIZE, CONTENT_MD5, BUCKET_NAME, NAME, KEY, STORAGE_LOCATION_ID, ENDPOINT
- Primary key ID
- Creation of this table allows retrieval of file metadata by joining it with the entity replication table. This allows us to find all of the file handles and metadata for a particular project in one database call. Without this table, we must query the tables database to find the entities in a project, and then separately query the repo database to retrieve the metadata of those files.
- File Replication Table
Concerns
- This method will not accurately capture egress. It simply calculates proportions of cost based on storage.