Document toolboxDocument toolbox

ISB Project Background and Use Cases

This is a DRAFT. 

This document captures the background and use cases that arise in the ISB project. This document will help the Synapse Engineering team understand the outstanding technology unknowns and technical problems for this project.

While reading this document, please keep in mind that many business logics are yet to be determined. Even though I'm trying to write down Milen's vision, not all details are fixed. So please feel free to suggest changes.

Background

Milen Nikolov, please correct me if I'm wrong here. Please feel free to reword my words in this section.
We have different groups of collaborators with data stored on their own storage locations including on some hard drive, AWS S3, and on Google Cloud Storage. There are valuable metadata that can be applied to these data. The goal of the project is to bring these dataset together, applied metadata to them, and enable searching for dataset based on the metadata and launching analysis tools like Google Big Query on the data. We want to use Synapse Entity Annotation to track metadata on the dataset, and Synapse Team to manage access to the dataset.

There is a list of known dataset and metadata. (Yet to be mapped?)

As part of this project, we are working on the list of supported storage locations.

Workflow

In the sections below, "Metadata Service" can be a stand-alone service, outside of Synapse, built and maintained by ISB working group and Sage. 

Data Submission

  1. A data contributor first makes a call to a Metadata Service, asking the question: "given this data set, what metadata should I specify?"
    The Metadata Service responses with a manifest schema (the list of annotations).
  2. The data contributor fill in the manifest file with explicit details for each file in the data set and submit the manifest file to the Metadata Service.
    The Metadata Service will validate the submitted manifest file and responses with either "the manifest file is completed", or "change need to be made to the manifest".
    This step may happen multiple times before the manifest file is completed.
  3. The data contributor makes a call to the Metadata Service, asking the question: "where should this dataset be uploaded to?"
    The Metadata Service responses with a Synapse Project; or an AWS S3 bucket; or a Google Cloud Storage bucket.
  4. a. The data contributor uploads the dataset to Synapse.
    b. The data contributor uploads the dataset to AWS S3 bucket, or to a Google Cloud Storage bucket, that Synapse does not have write permissions to. Then the data contributor creates external File Handles in Synapse that point to the uploaded data.
  5. The data contributor annotates the dataset/ Links using the manifest file.

A data contributor may perform multiple Data Submission. After a Data Submission, the data is now called staging data. Members of ISB project will curate the staging data before pushing it to production, where it is discoverable by target audience. 

Manage Access

ISB members will setup:

  • Synapse Projects with different storage locations and different permissions for different Synapse Teams.
  • AWS S3 bucket policy.
  • Google Cloud Storage policy.

ISB members want use Synapse Teams at the central tool to manage access to all dataset in this project.

Access Data

A data consumer enters the production data collections on Synapse and explores the data for ISB project.

  1. S/he may want to export the manifest file that represents the metadata for the dataset/ data collection.

  2. S/he will submit this manifest file to the Metadata Service to get a graph object that represents the metadata of the data.
    S/he would import this graph object to a graph tool to explore the data.
  3. S/he may want to run a Google Big Query on the dataset/ data collections.

Synapse Related Use Cases

  1. Links existing dataset from AWS S3 bucket to Synapse.
  2. Links existing dataset from Google Cloud Storage to Synapse.
  3. Upload data from a collaborator's drive to Synapse Project, storing the data in Synapse Storage/ ISB managed AWS S3 bucket/ ISB managed Google Cloud Storage.
  4. Upload data from a collaborator managed AWS S3 bucket to Synapse Project, storing the data in Synapse Storage/ ISB managed AWS S3 bucket/ ISB managed Google Cloud Storage.
  5. Upload data from a collaborator managed Google Cloud Storage bucket to Synapse Project, storing the data in Synapse Storage/ ISB managed AWS S3 bucket/ ISB managed Google Cloud Storage.
  6. Run Google Big Query on data that is indexed in Synapse Project, stored in Google Cloud Storage, having Synapse authorizes access. 

Questions

  1. Can we treat Google Cloud Storage as External S3 bucket?
    Is our S3-like bucket APIs sufficient to support Google Cloud Storage as an S3-like storage?
  2. Can we set Synapse as an Authorization Server on Google Cloud Storage?
  3. What does Big Query do? 
    How does it different than other AWS tools for analysis?
    Does any AWS tool/ tools replace the need to use Big Query?
  4. Is there a technical reasons on why we need to host data on Google Cloud Storage?