Document toolboxDocument toolbox

File Service API

Jira Epic:  BRIDGE-2564 - Getting issue details... STATUS

Overview

This new service will provide a means to host files (of any type) through Bridge, so they can be downloaded by clients. Since Bridge will know about the files, we can include them more easily in app configs and other APIs. However, the main benefit for developers is having access to file hosting without having to use another SDK.

Each file will have a series of revisions (which are immutable), and clients will be able to link to a specific version (as you'd do when including a file in an app config, the preferred approach), or there will be a way to get the latest version of a file, if desired.

The API will use S3 presigned URLs to offload file upload tasks, and will provide Cloud Front URLs to access the files through a CDN that fronts the S3 file store.

App configs differ from this API in that they are filtered through user criteria. We expect that files will often be included in app configs to consolidate all configuration for an app in one location.

Study reports differ from this API in that they provide data in time series and they expect JSON. 

Participant reports differ from this API in that they provide data in time series and they expect JSON for a given user. We currently use fake dates to store records for a user using this API, but because it requires authentication, we wouldn't fold this into the file service API. The participants report API should be augmented to allow non-time series reports  (and might be altered to allow non-JSON files).

REST API

MethodURLDescriptionPermissions
GET/v3/filesGet a list of files for dev's studydeveloper
POST/v3/filesCreate a new filedeveloper
DELETE/v3/files/<guid>?physical=<boolean>Logical or physical delete of file (revisions remain on S3 and remain accessible in case they are referenced in clientsdeveloper, admin
GET/v3/files/<guid>/revisionsGet paged list of revisions of filedeveloper
POST/v3/files/<guid>/revisionsCreate a new revision (step 1: create metadata record and S3 presigned URL)developer
GET/v3/files/<guid>/revisions/<createdOn>Get a paged list of revisions of a given filedeveloper
POST/v3/files/<guid>/revisions/<createdOn>Update description (and any other metadata)developer
DELETE/v3/files/<guid>/revisions/<createdOn>If the record is in the pending state and the resignedURL has expired, delete this recorddeveloper
GET/v3/studies/<studyId>/files/<guid>Get the most recent revision that exists for this file. Will return 302 with a Location header pointing to the CloudFront URLPUBLIC
POST/v3/studies/<studyId>/files/<guid>/revisions/<createdOn>step 2: mark record as available.worker

Clients access the file (not the file metadata) via a link like: http://docs.sagebridge.org/<studyId>/<guid>/<createdOn> (where createdOn timestamp is the most recent and available revision). This is usually referenced somewhere like an app config, or hard-coded, with the exception of the one public URL above that will return whatever is the latest and greatest revision of a file.

Workflow

Most of this will look very familiar, but the workflow for uploading files will be an attempt to smooth out the usability of our upload workflow. 

For creating files:

  1. Create a metadata File record;
  2. Create a FileRevision record, which returns a pre-signed URL to upload a file;
  3. Upload that file;
  4. S3 upload even puts a work item on a queue for the BridgeWorkerPlatform to mark the item as available (see error handling below);

Presigned URLs can be used more than once. If this happens, we should update the revision record.

When retrieving a revision record, if it is marked pending and the presigned URL is expired, we should update it with a new presigned URL and return that. Basically a pending revision record will always give a way to upload a file, until it's deleted. So the expiration period on this presigned URL can be very long. Once the revision record has been marked available, it can't be deleted.

If the worker process fails to mark a pending record as available, we can check the status of the file when an individual revision is requested through the API (GET /v3/files/<guid>/revisions/<createdOn>) by looking for the file and if it exists, marking the record as available. Until we see how often this fails, this may be sufficient for admin users to clean up orphaned records (they can also delete them and try over).

Changelog

CREATE TABLE IF NOT EXISTS `Files` (
`studyId` varchar(255) NOT NULL,
`guid` varchar(60) NOT NULL,
`name` varchar(255) DEFAULT NULL,
`description` text,
`mimeType` varchar(255) DEFAULT NULL,
`deleted` tinyint(1) NOT NULL DEFAULT '0',
`version` int(10) unsigned NOT NULL, // optimistic lock
PRIMARY KEY (`guid`),
KEY `Studies_idx` (`studyId`), // retrieve all for a study
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;


CREATE TABLE IF NOT EXISTS `FileRevisions` (
`fileGuid` varchar(60) NOT NULL,
`createdOn` BIGINT UNSIGNED NOT NULL,
`description` text,
`uploadURL` VARCHAR(512) DEFAULT NULL,
`status` ENUM('PENDING','AVAILABLE') NOT NULL
PRIMARY KEY (`fileGuid`, `createdOn`),
CONSTRAINT `Files-Guid-Constraint` FOREIGN KEY (`fileGuid`) REFERENCES `Files` (`guid`) ON DELETE CASCADE
) CHARACTER SET utf8 COLLATE utf8_unicode_ci;

Object Model

public interface File {

    String getName();

    void setName(String name);

  

    String getGuid();

    void setGuid(String guid);

    

    String getDescription();

    void setDescription(String description);

    

    /** Probably optional since S3 autodetects, but some files (e.g. without extensions) may need it. */

    String getMimeType();

    void setMimeType(String mimeType);

    

    /** Files support logical deletion. Revisions remain accessible on S3 for released clients and configurations. */

    boolean isDeleted();

    void setDeleted(boolean deleted);

}


public interface FileRevision {

    String getFileGuid();

    void setFileGuid(String fileGuid);

    

    /** To handle upload fails and upload URL expiration, we use the creation time of this record, not any timestamp of the S3 file itself. */

    DateTime getCreatedOn();

    void setCreatedOn(DateTime createdOn);

    

    String getDescription();

    void setDescription(String description);

    

    /** This is the presigned S3 URL to upload contents. Will be cleared when revision is moved to "available." */

    String getUploadURL();

    void setUploadURL(String uploadURL);

    

    FileStatus getFileStatus();

    void setFileStatus(FileStatus status);

    

    /** Synthetically derived from other data as: http://docs.sagebridge.org/<studyId>/<guid>/<createdOn> */

    String getDownloadURL();

    void setDownloadURL(String downloadURL);

}


public enum FileStatus {

    PENDING,

    AVAILABLE

}

Services & DAO

FileDao {
    getFiles(StudyIdentifier studyId);
    createFile(File file);
    updateFile(File file);
    deleteFilePermanently(StudyIdentifier, String guid);
}

FileRevisionDao {
     getFileRevisions(StudyIdentifier studyId, String guid);
    createFileRevision(FileRevision revision);
    updateFileRevision(FileRevision revision);
    deleteFileRevision(StudyIdentifier studyId, String guid);
}

FileService {
    getFiles(StudyIdentifier studyId);
    createFile(File file);
    updateFile(File file);
    deleteFile(StudyIdentifier, String guid);
    deleteFilePermanently(StudyIdentifier, String guid);
}

FileRevisionService {
    getFileRevisions(StudyIdentifier studyId, String guid);
    createFileRevision(FileRevision revision);
    updateFileRevision(FileRevision revision);
    finishFileRevisionUpload(StudyIdentifier studyId, String guid);
    deleteFileRevision(StudyIdentifier studyId, String guid);
}

S3 File Organization

There's currently a bucket for all documents in each environment. Treating that as root, documents would be filed in the following directory structure:

/<file-guid>.<createdOn>

 /<file-guid>.<createdOn>

For other files we've used the long value for createdOn, the DateTime as a UTC ISO 8601 string might be easier to find when debugging so I would prefer to use that.

Note that these files will be served to clients through the CDN, so not all requests will require S3 (this help to keep download times acceptable regardless of where the client is located).