Page Comparison

...

This page describes the workflow required to build, edit, and update the data model for MODEL-AD.

Table of Contents

minLevel	1
maxLevel	3
outline	false
type	list
printable	false

Schematic

Summary

Data Modeling at Sage requires using two in-house tools: Schematic and the Data Curator App (DCA). SCHEMATIC is an acronym for Schema Engine for Manifest Ingress and Curation. The Python based tool is a schema-based, metadata ingress ecosystem, intended to streamline of biomedical dataset annotation, metadata validation and submission to a data repository for various data contributors.

Install for data curator app:

Code Block
python3 -m venv .venv source .venv/bin/activate python3 -m pip install schematicpy

Setup Python Environment

Schematic will run on Python 3.10. We must control the Pytheon Python Environment. PyEnv is one option., https://fathomtech.io/blog/python-environments-with-pyenv-and-vitualenv/

Code Block
pyenv install 3.10.1110 pyenv virtualenv 3.10.1110 schematicpy_3_10_1110 pyenv activate schematicpy_3_10_11 pyenv -m pip install schematic_3_10_11 pip install schematicpy

Create a data model formatted as a CSV

Where is the reference to how data model needs to be formatted?

Convert data model from CSV to JSONLD

schematic schema convert input.csv output.jsonld

Upload Data

https://dca-docs.scrollhelp.site/DCA/Working-version/Project-Agnostic/organize-your-data-upload#OrganizeyourDataUpload-FlattenedDataLayoutExample

https://dca-docs.scrollhelp.site/DCA/Working-version/Project-Agnostic/uploading-data

https://dca-docs.scrollhelp.site/DCA/Working-version/ELITE/validate-and-submit-your-metadata

AD Data Models https://github.com/adknowledgeportal/data-models
DCA app development version

https://dca-dev.app.sagebionetworks.org/
Abby's request for testing

https://sagebionetworks.slack.com/archives/C02A2FBN3G8/p1682116574295509

...

Edit Configuration

The following parameters need to be set in the config.yml

https://github.com/Sage-Bionetworks/schematic/blob/develop/config.yml

Using Schematic

Command Line Reference

https://sage-schematic.readthedocs.io/en/develop/cli_reference.html

Need to run commands from ~/schematic

Data Model Development

A data model defines attributes (i.e. data elements) describing metadata associated with any given dataset type. The data model also describes relationships between these attributes.

Documentation

/wiki/spaces/SCHEM/pages/2473623559

https://sagebionetworks.jira.com/wiki/spaces/SCHEM/pages/2473623559/The+Data+Model+Schema#A.-Schema-properties-and-relationships

Create Data Model

https://sagebionetworks.jira.com/wiki/spaces/SCHEM/pages/2967568387/Guide+How+to+use+Schematic+for+Data+Model+Development#Create-a-Data-Model

The data model is defined in a table, then stored (i.e. serialized) in a JSON-LD schema which specifies attributes as suggested by Schema.org.

/wiki/spaces/SCHEM/pages/2473623559

https://docs.google.com/presentation/d/129pSx58qDm7Y1OQmSSHKDq6tsoD3pW_gDRNXiX2rd0w/edit#slide=id.g13aaf3b8358_0_0

https://github.com/adknowledgeportal/data-models

Sage Data Models for Reference

https://github.com/Sage-Bionetworks/1kD-model
https://portal.includedcc.org/dashboard
https://docs.google.com/spreadsheets/d/1w6zDfz3_yrCjjrqfpXBGNmd0LZL4B03gr1KfzJtk5Cs/edit#gid=674286209
Lref gdrive file
url https://docs.google.com/spreadsheets/d/1vDdcqt3Lgehyq1iCnlF1H9JZi63pLj-u/edit#gid=1939820452

Recommendations

Draw a diagram. A diagram is a useful reference when developing the model.
Start small with a basic skeleton and then build.
Use schematic in dev mode to convert model to JSON-LD regularly to check for errors

Requirements

The data model requires these columns:

Attribute
Description
ValidValues
DependsOn
required
source
parent
properties
dependsOnComponent

Example Model

Github: https://github.com/Sage-Bionetworks/schematic/blob/develop/tests/data/example.model.csv
Formatted for readability:
Lref gdrive file
url https://docs.google.com/spreadsheets/d/1Wde5YBFtEa4GhO-smXgbVApGioBGNnc-95n4LY8YB_E/edit#gid=925738608

This model does NOT validate as provided.

Schematic DB

https://sagebionetworks.jira.com/wiki/spaces/SCHEM/pages/2473623559/The+Data+Model+Schema#Schemas-and-Schematic-DB

Schematic DB is a package used to ingress the manifests created by Schematic into a database.

Schematic DB will use any of these validation rules:
- str, float, num, int, date
- If no rule provided, defaults to a string type
- the attribute datatype is based on the rule

Data Model Validation

/wiki/spaces/SCHEM/pages/2645262364

Data Model Visualization

Convert Data Model from CSV to JSON-LD

https://sagebionetworks.jira.com/wiki/spaces/SCHEM/pages/2967568387/Guide+How+to+use+Schematic+for+Data+Model+Development#Convert-Data-Model

schematic schema convert model.csv

What is JSON-LD?

Data models are formatted in JavaScript Object Notation-LinkedData. JSON-LD in schematic is its support by http://schema.orgdataset discoverability in search engines like: Dataset Search

Guide to Developing Data Models in JSON-LD

JSON-LD, or JavaScript Object Notation for Linked Data, is a JSON-based format for serializing Linked Data. It extends JSON with additional functionality to represent linked data structures, such as contexts, @id, and @type. JSON-LD is a lightweight and flexible format that can be used to represent a variety of data models.

JSON-LD Syntax

JSON-LD documents are valid JSON documents. They consist of key-value pairs, where the keys are strings and the values can be strings, numbers, objects, arrays, or booleans. JSON-LD documents can also contain additional keywords that provide additional information about the data.

The following is an example of a simple JSON-LD document:

Code Block
JSON{ "@context": "https://schema.org/", "@id": "http://example.com/book1", "type": "Book", "name": "The Hitchhiker's Guide to the Galaxy", "author": "Douglas Adams" }

This document describes a book with the following properties:

@context: The context URI specifies the vocabulary that is used to interpret the data. In this case, the vocabulary is http://Schema.org .
@id: The @id property uniquely identifies the resource. In this case, the resource is a book.
type: The type property specifies the type of the resource. In this case, the resource is a book.
name: The name property specifies the name of the book.
author: The author property specifies the author of the book.

JSON-LD Contexts

JSON-LD contexts are used to map IRIs (Internationalized Resource Identifiers) to human-readable names. Contexts can also be used to define prefixes for IRIs. This can make JSON-LD documents easier to read and write.

The @context property in a JSON-LD document specifies a context URI. When a JSON-LD processor encounters an IRI in a document, it uses the context to resolve the IRI to a human-readable name.

For example, the following context defines a prefix for the http://Schema.org vocabulary:

Code Block
JSON{ "@context": { "schema": "https://schema.org/" } }

Using this context, the following JSON-LD document can be interpreted:

Code Block
JSON{ "@context": { "schema": "https://schema.org/" }, "@id": "http://example.com/book1", "type": "schema:Book", "name": "The Hitchhiker's Guide to the Galaxy", "author": "Douglas Adams" }

The type property is now prefixed with schema:. This makes the document easier to read and understand.

Modeling Entities and Relationships

Entities in a JSON-LD data model are represented by objects. Relationships between entities are represented by properties. For example, the following JSON-LD document describes a book and a person:

Code Block

JSON{
  "@context": {
    "schema": "https://schema.org/"
  },
  "@id": "http://example.com/book1",
  "type": "schema:Book",
  "name": "The Hitchhiker's Guide to the Galaxy",
  "author": {
    "@id": "http://example.com/douglas-adams",
    "type": "schema:Person",
    "name": "Douglas Adams"
  }
}

The author property in the book object refers to the person object. This indicates that Douglas Adams is the author of The Hitchhiker's Guide to the Galaxy.

Using Vocabularies

Vocabularies are collections of terms and definitions that are used to describe data. JSON-LD data models can use vocabularies to provide a common understanding of the data.

There are many different vocabularies available. Some popular vocabularies include:

http://Schema.org
Dublin Core

The vocabulary should be relevant to the type of data that you are modeling.

Ontology Resources

Metadata Dictionary

AD Knowledge Portal Metadata Dictionary

https://sagebio.shinyapps.io/adknowledgeportalamp-ad-datametadata-curatordictionary/

Data Curator App

httpshttp://wwwdca.app.synapsesagebionetworks.org/#!Synapse:syn33582398/wiki/619343

https://dca-dev.app.sagebionetworks.org

https://github.com/adknowledgeportal/data_curator

https://github.com/adknowledgeportal/test-data-model

Annotate study folder with contentType = 'dataset'

https://www.synapse.org/#!Synapse:syn36759435/tables/
Add CSV + JSONLD to github – models

https://sagebionetworks.jira.com/wiki/spaces/SCHEM/pages/2458648589/Setting+up+a+DCC+Asset+Store#How-do-I-Structure-My-DCC-Synapse-Project-to-Work-with-the-Data-Curator-App%3F

Projects

Folder Structure

https://dca-docs.scrollhelp.site/DCA/Working-version/Project-Agnostic/organize-your-data-upload#OrganizeyourDataUpload-FlattenedDataLayoutExample

Code Block

.
├── biospecimen_experiment_1
    ├── manifest1.csv
├── biospecimen_experiment_2
    ├── manifestA.csv
├── single_cell_RNAseq_batch_1
    ├── manifestX.csv
    ├── fileA.txt
    ├── fileB.txt
    ├── fileC.txt
    └── fileD.txt
└── single_cell_RNAseq_batch_2
    ├── manifestY.csv
    └── file1.txt