Design for Records-Based Metadata Curation
Use Case
A new Synapse project is established to capture data and curate metadata for a new study. The study involves collecting biological samples and for each sample gathering whole genome sequencing and RNA sequencing data. The project will capture and curate three types of data/metadata files:
Sample Metadata - CSV files where each row represents a sample. This metadata includes basic information about each sample such as when it was collected. Each sample row has a sample identifier.
Whole Genome Sequence - A file for each sample that contains the sample’s whole genome sequencing data. The metadata for each file must include the sample’s identifier.
RNA Sequencing - A file for each sample that contains RNA sequencing data. The metadata for each file must include the sample’s identifier.
For this project, data curators will be expected to curate both the data file’s metadata and the sample metadata. All metadata will be validated against a JSON Schema. Over time, both new sequence data files and sample metadata files will need to be added to the project. New samples files might contain updates to existing samples in addition to adding new samples. Therefore the system must correctly update and insert (Upsert) such sample files.
Note: In this use case we only have one type of CSV metadata (samples). However, we should expect that many projects will have multiple types of CSV metadata. Each type of CSV metadata would be maintained separately.
Definitions
File-based Metadata - Metadata that describes the contents of a data file. In Synapse, such metadata is typically added as annotations on a data file. For this use cases each type of sequencing data file will have metadata annotations that define the file’s data type and its sample identifier.
Data Normalization - One way to capture sample metadata on the sequencing data files would be add a “copy” of the sample’s metadata as annotations on each data file. However, for this use case, the sample metadata would get duplicated for each type of sequencing file. However, taking this approach would require that we make any corrections in multiple locations, making it easy to create inconsistency if a location is missed. Data normalization is a process that involves storing data in a manner where duplicates are eliminated. Since there is only one copy of the data, corrections can be made without causing causing inconsistency.
Record-based Metadata - A “record” is metadata that has been normalized to eliminate duplication and prevent inconsistency. Each row of a sample CSV is an example of a metadata record. Each sequencing data files “reference” a sample record using the sample’s identifier (instead of duplicating the sample metadata). A project might have multiple record-based metadata types that would each receive separate treatment.
Goals
Add Record-Based Support
Synapse has established systems for managing and curating file-based metadata as annotations including the following:
Upsert - When new data files are added or updated within a File View’s scope, the File View is automatically updated to capture the changes from the Upsert. This provides a tabular “view” of all files within a scope where the file’s metadata are shown as columns of the view.
JSON Schema Support - If a JSON Schema is bound to a project/folder/file Synapse will automatically check each file’s annotation against the bound schema. The validation results for each files are then available through the API and UI.
Grid Data Curation (new) - The contents of a file view can be loaded into a new data grid that allows a user to curate the metadata with real-time JSON Schema validation and AI agent assistance (upcoming). The final curated results can then be “exported” to update the annotations of the original files.
Synapse currently lacks equivalent systems for record-based metadata (like the samples files in the use case). An upsert can only be achieved using specialized command line tools. If a JSON schema is bound to a sample CSV, Synapse will validate the annotations on the CSV, but not the file’s content. Finally, we cannot currently populate a Synapse data grid with the contents of record-based metadata files. Therefore, our goals of this design is to describe a technical approach for support for all three (upsert, schema validation, & grid support) for record-based metadata files.
Record-Based Flow
A record-based metadata flow starts with a data manager defining a “template”. The record-based template is simply a CSV file that contains a “header” row only (no record rows). The header row defines the ordered names of the record columns. A JSON schema can be bound to the template to define the schema of each record to be added to the template. Finally, the data manager defines the “upsert” key to be used to insert and update records into the template.
When a data contributor is ready to add metadata records they will have two options:
Enter new metadata records manually by inserting new rows into the template.
Upsert an existing CSV of metadata records into the template.
Each time a data contributor adds new metadata records the new records are merged into the template. This means the template is the “truth” of the metadata records for a specific type.
Metadata Tab
In the future, we plan to add support for a “data landscape survey” which will be used by data contributors to setup metadata tasks (see: https://www.canva.com/design/DAGuxJRcHiM/r9_N8Gl1oEItUZHJqulT2g/edit and Updated Decision Tree for Data Landscape Survey 📊 ). Both scenario A and B end with data contributor on the “metadata” tab. This tab lists tasks that will be assigned to users or teams that fulfill the data contributor or data curator role. The task should guide the user in either uploading or curating data of a specific type. There are two main types of metadata tasks for the two types of metadata:
File-Based-Task - A task that informs a data contributor where to upload data files of a specific type. There would be a separate file-based-task for each type of data file. This tasks type would also provide a way for a contributor to join/start a grid session to curate the file metadata.
Records-Based-Task - A task that informs a data contributor where edit or “upsert” changes for records of a specific type. Each type will have its own task. This task type would also provide a way for a contributor to join/start a grid session to curate the records.
For the MVP we do not need full support for scenario A, but we do need to add support for scenario B. This means the data landscape survey is out-of-scope, but we still need to add support for something like “tasks” under the metadata tab. The following task features are considered out-of-scope for the MVP:
Task assignment to users or teams.
Task state, such as progress towards completion.
Data landscape survey that creates new tasks.
The following task features are considered in-scope for the MVP:
Support for task for each type of file-based metadata.
Support for tasks for each type of record-based metadata.
Data manages should be able to perform task CRUD (Create, Read, Update, Delete). We will cover task creation in more detail in a later section.
Data contributors need a mechanism to match data types to the appropriate task under a project’s metadata tab.
Once a data contributor selects a task, it should guide them through the steps for uploading and curating data/metadata of that type.
Proposal
The following is a proposal of the API changes needed to meet the goals of both the addition of record-based support and metadata tasks.
RecordSet
We suggest adding a new Synapse Entity type called: RecordSet defined as follows:
{
"title": "RecordSet",
"description": "Captures record-based metadata as special type of CSV.",
"extends": {
"$ref": "org.sagebionetworks.repo.model.FileEntity"
},
"properties": {
"upsertKey": {
"description": "One or more column names that define this upsert key for this set. This key is used to determine if a new record should be treated as an update or an insert.",
"type": "array",
"items": {
"type": "string"
}
},
"csvDescriptor": {
"description": "Defines how to both read-from and write-to, this CSV file. If excluded then all of the default values will be used. Note: The 'isFirstLineHeader' must be true.",
"$ref": "org.sagebionetworks.repo.model.table.CsvTableDescriptor"
}
}
}
Since a RecordSet is a FileEntity, all features available to FileEntities will be available to a RecordSet. This means each RecordSets will be added to a folder and will be managed similar to FileEntities. In fact, the following features of a RecordSet will behave exactly like a FileEntity:
Access Control List
Access Requirements
JSON Schema Binding (While the binding is the same, a bound schema to a RecordSet will be used to validate its rows instead of its annotations).
For the MVP, a data manager is expected create a “templates” as a RecordSet. The data manager is expected to provide the following:
Create a CSV containing a single header row that defines the ordered column names, and use the resulting fileHandle ID to create the RecordSet
Set the CsvTableDescriptor, if anything other than the default values are used. For example, if you wish to use tabs instead of commas then the separator should be set to '\t'.
Add the RecordSet to the desired folder within the project.
Set the appropriate ACL to control access. Note: Since is is a FileEntity, users will need the “download” permission to be able to download the CSV.
Work with ACT to establish the appropriate AccessRequirments.
Bind a JSON Schema to either the RecordSet (or its folder/hierarchy).
Set the
‘upsertKey’
that defines if new rows are inserts or updates.Create a Metadata task for each RecordSet.
We will extend the new data grid feature as follows:
Data contributors will be able to start a new grid session from a RecordSet. The ‘csvDescriptor’ will be used to determine how to read the CSV into the grid.
We will add a new asynchronous job to support the
upsert
into the grid. This job will use theupsert
key to determine if row should be inserted or updated.The grid export job will be extended to write a new fileHandle of a CSV that will be applied back to the original RecordSet as a new version. The CSV will be written using the ‘csvDescriptor’.
The JSON schema validation for a RecordSet will occur within the grid, if a JSON schema is bound. The export job will also add a ValidationSummaryStatistics object to a RecordSet. This provides a summary of the validation state of each record within the CSV.
The rows of a RecordSet should only be editable via the grid.
In the future we will add a RecordSetView that will be a “view” of a single RecordSet. This view will updated anytime the source RecordSet changes.
There was a suggestion that we consider using the CSVW standard as an alternative to both the ‘upsertKey’
and the ‘csvDescriptor’
. This approach seems promising. However, CSVW is a large standard that potentially duplicates/overlaps with our use of JSON Schemas for validation. We estimate that it would be more work to adopt the CSVW standard than to proceed with the current suggestion. Therefore, for the MVP, we will proceed with both ‘upsertKey’
and the ‘csvDescriptor’
instead of using CSVW. If we later want to adopt CSVW, we should be able migrate RecordSets with ‘upsertKey’
and ‘csvDescriptor’
to a new CSVW solution in the future.
Curation Task API
We are planing to add a new “Metadata” tab for Synapse projects. This tab will list the full menu of metadata-related “curation tasks” for that project. When a data contributor is ready contribute either file-based data/metadata or record-based data, they will go to the metadata tab. They will then filter or scroll through the available tasks and find the task that matches the data type they wish to contribute. Once a task is selected, the task will guide the contributor through the process for data of that type. This guidance will greatly depend of the type of task: file-based or record-base:
File-based Guidance - For tasks of this type a data contributor will be instructed to upload data files to the folder configured for that task. After uploading to the folder the Contributor will be instructed join/start a grid session to curate the file. The contributor does not need to know about the FileView that defines which rows and columns should be included in the grid’s session. Once curation is complete, the grid session will be ‘exported’ to update the annotation of the files involved.
Record-Based Guidance - For tasks of this type a data contributor will be instructed to join/start a grid session. The contributor does not need to know about the RecordSet that is used to define the rows and columns of the grid session. When a data contributor wants to “upsert' records, they will do so from within the grid (grid will include an ‘upsert’ button?). Once curation is complete, the grid session will be ‘exported’ to create a new version of the RecordSet.
We propose adding a task API that will allow data mangers to create curation tasks for type data/metadata. This API will also be used to list tasks under the project’s metadata tab. We will provide task filtering if needed for projects with many types of data.
The following JSON schemas define the task model objects:
CurationTask.json:
{
"description": "The CurationTask provides instructions for a Data Contributor on how data or metadata of a specific type should be both added to a project and curated. There should be a CurationTask for each type of data/metadata to be contributed to a project. There are currently two categories of curation tasks: file-based metadata collection and record-based metadata collection. For each category there will be a concrete implementation of this interface. This interfaces defines the common fields of all CurationTasks.",
"properties": {
"taskId": {
"type": "integer",
"description": "The unique identifier issued to this task when it was created."
},
"dataType": {
"type": "string",
"description": "Will match the data type that a contributor plans to contribute. The dataType must be unique within a project."
},
"projectId": {
"type": "string",
"description": "The synId of the project."
},
"instructions": {
"type": "string",
"description": "Instructions to the data contributor."
},
"etag": {
"type": "string",
"description": "Synapse employs an Optimistic Concurrency Control (OCC) scheme to handle concurrent updates. Since the E-Tag changes every time an entity is updated it is used to detect when a client's current representation of an entity is out-of-date.",
"transient": true
},
"createdOn": {
"type": "string",
"format": "date-time"
},
"modifiedOn": {
"type": "string",
"format": "date-time"
},
"createdBy": {
"type": "string"
},
"modifiedBy": {
"type": "string"
},
"taskProperties": {
"$ref": "org.sagebionetworks.repo.model.curation.CurationTaskProperties"
}
}
}
CurationTaskProperties.json
{
"description": "The properties of a CurationTask",
"type": "interface",
"properties": {
"concreteType": {
"type": "string",
"description": "Indicates which implementation of CurationTaskProperties this object represents. Possible values include: 'org.sagebionetworks.repo.model.curation.metadata.FileBasedMetadataTaskProperties' and 'org.sagebionetworks.repo.model.curation.metadata.RecordBasedMetadataTaskProperties'."
}
}
}
FileBasedMetadataTaskProperties.json:
{
"description": "A CurationTaskProperties for file-based data, describing where data is uploaded and a view which contains the annotations.",
"implements": [
{
"$ref": "org.sagebionetworks.repo.model.curation.CurationTaskProperties"
}
],
"properties": {
"uploadFolderId": {
"type": "string",
"description": "The synId of the folder where data files of this type are to be uploaded."
},
"fileViewId": {
"type": "string",
"description": "The synId of the FileView that shows all data of this type. This FileView will be used to start new grid sessions for file annotation curation."
}
}
}
RecordBasedMetadataTaskProperties.json
{
"description": "A CurationTaskProperties for record-based metadata",
"implements": [
{
"$ref": "org.sagebionetworks.repo.model.curation.CurationTaskProperties"
}
],
"properties": {
"recordSetId": {
"type": "string",
"description": "The synId of the RecordSet that will contain all record-based metadata for a specific type. This RecordSet will be used to start new grid sessions for both 'upsert' and record-based curation for this type."
}
}
}
ListCurationTaskRequest.json:
{
"description": "Request for a single page of CurationTasks for a project.",
"properties": {
"projectId": {
"type": "string",
"description": "The project ID. Required.",
"required": true
},
"nextPageToken": {
"type": "string",
"description": "Forward the returned 'nextPageToken' to get the next page of results."
}
}
}
ListCurationTaskResponse.json:
{
"description": "A single page of CurationTasks.",
"properties": {
"page": {
"type": "array",
"items":{
"$ref" : "org.sagebionetworks.repo.model.curation.CurationTask"
}
},
"nextPageToken": {
"type": "string",
"description": "Forward this token to get the next page of results."
}
}
}
Note: Each CurationTask belongs to a project.
APIs:
Type | Response | URL | Request | Description |
---|---|---|---|---|
POST | ListCurationTaskResponse | curation/task/list | ListCurationTaskRequest | List the curation tasks for a project with filtering/sorting options |
POST | CurationTask | curation/task | CurationTask | Create a new CurationTask |
PUT | CurationTask | curation/task/<task_id> | CurationTask | Update a CurationTask |
GET | CurationTask | curation/task/<task_id> |
| Get a CurationTask by its taskId. |
DELETE |
| curation/task/<task_id> |
| Delete a CurationTask by its taskId. |
FAQ