Last updated on 2023-10-25
This page describes the workflow required to build, edit, and update the data model for MODEL-AD.
Schematic
Summary
Data Modeling at Sage requires using two in-house tools: Schematic and the Data Curator App (DCA). SCHEMATIC is an acronym for Schema Engine for Manifest Ingress and Curation. The Python based tool is a schema-based, metadata ingress ecosystem, intended to streamline of biomedical dataset annotation, metadata validation and submission to a data repository for various data contributors.
Documentation
Guide: How to use Schematic for Data Model Development
Code in Github
https://github.com/Sage-Bionetworks/schematic
Installation
https://pypi.org/project/schematicpy/
Install for data curator app:
python3 -m venv .venv source .venv/bin/activate python3 -m pip install schematicpy
Setup Python Environment
Schematic will run on Python 3.10. We must control the Python Environment. PyEnv is one option., https://fathomtech.io/blog/python-environments-with-pyenv-and-vitualenv/
pyenv install 3.10.10 pyenv virtualenv 3.10.10 py_3_10_10 pyenv activate py_3_10_10 pip install schematicpy
Edit Configuration
The following parameters need to be set in the config.yml
https://github.com/Sage-Bionetworks/schematic/blob/develop/config.yml
Using Schematic
Command Line Reference
https://sage-schematic.readthedocs.io/en/develop/cli_reference.html
Need to run commands from ~/schematic
Data Model Development
A data model defines attributes (i.e. data elements) describing metadata associated with any given dataset type. The data model also describes relationships between these attributes.
Documentation
/wiki/spaces/SCHEM/pages/2473623559
Create Data Model
https://sagebionetworks.jira.com/wiki/spaces/SCHEM/pages/2967568387/Guide+How+to+use+Schematic+for+Data+Model+Development#Create-a-Data-Model
The data model is defined in a table, then stored (i.e. serialized) in a JSON-LD schema which specifies attributes as suggested by Schema.org.
/wiki/spaces/SCHEM/pages/2473623559
https://github.com/adknowledgeportal/data-models
Sage Data Models for Reference
Recommendations
Draw a diagram. A diagram is a useful reference when developing the model.
Start small with a basic skeleton and then build.
Use schematic in dev mode to convert model to JSON-LD regularly to check for errors
Requirements
The data model requires these columns:
Attribute
Description
ValidValues
DependsOn
required
source
parent
properties
dependsOnComponent
Example Model
Github: https://github.com/Sage-Bionetworks/schematic/blob/develop/tests/data/example.model.csv
Formatted for readability:
This model does NOT validate as provided.
Schematic DB
Schematic DB is a package used to ingress the manifests created by Schematic into a database.
Schematic DB will use any of these validation rules:
str, float, num, int, date
If no rule provided, defaults to a string type
the attribute datatype is based on the rule
Data Model Validation
/wiki/spaces/SCHEM/pages/2645262364
Data Model Visualization
Convert Data Model from CSV to JSON-LD
https://sagebionetworks.jira.com/wiki/spaces/SCHEM/pages/2967568387/Guide+How+to+use+Schematic+for+Data+Model+Development#Convert-Data-Model
schematic schema convert model.csv
What is JSON-LD?
Data models are formatted in JavaScript Object Notation-LinkedData. JSON-LD in schematic is its support by http://schema.orgdataset discoverability in search engines like: Dataset Search
Guide to Developing Data Models in JSON-LD
JSON-LD, or JavaScript Object Notation for Linked Data, is a JSON-based format for serializing Linked Data. It extends JSON with additional functionality to represent linked data structures, such as contexts, @id, and @type. JSON-LD is a lightweight and flexible format that can be used to represent a variety of data models.
JSON-LD Syntax
JSON-LD documents are valid JSON documents. They consist of key-value pairs, where the keys are strings and the values can be strings, numbers, objects, arrays, or booleans. JSON-LD documents can also contain additional keywords that provide additional information about the data.
The following is an example of a simple JSON-LD document:
JSON{ "@context": "https://schema.org/", "@id": "http://example.com/book1", "type": "Book", "name": "The Hitchhiker's Guide to the Galaxy", "author": "Douglas Adams" }
This document describes a book with the following properties:
@context
: The context URI specifies the vocabulary that is used to interpret the data. In this case, the vocabulary is http://Schema.org .@id
: The@id
property uniquely identifies the resource. In this case, the resource is a book.type
: Thetype
property specifies the type of the resource. In this case, the resource is a book.name
: Thename
property specifies the name of the book.author
: Theauthor
property specifies the author of the book.
JSON-LD Contexts
JSON-LD contexts are used to map IRIs (Internationalized Resource Identifiers) to human-readable names. Contexts can also be used to define prefixes for IRIs. This can make JSON-LD documents easier to read and write.
The @context
property in a JSON-LD document specifies a context URI. When a JSON-LD processor encounters an IRI in a document, it uses the context to resolve the IRI to a human-readable name.
For example, the following context defines a prefix for the http://Schema.org vocabulary:
JSON{ "@context": { "schema": "https://schema.org/" } }
Using this context, the following JSON-LD document can be interpreted:
JSON{ "@context": { "schema": "https://schema.org/" }, "@id": "http://example.com/book1", "type": "schema:Book", "name": "The Hitchhiker's Guide to the Galaxy", "author": "Douglas Adams" }
The type
property is now prefixed with schema:
. This makes the document easier to read and understand.
Modeling Entities and Relationships
Entities in a JSON-LD data model are represented by objects. Relationships between entities are represented by properties. For example, the following JSON-LD document describes a book and a person:
JSON{ "@context": { "schema": "https://schema.org/" }, "@id": "http://example.com/book1", "type": "schema:Book", "name": "The Hitchhiker's Guide to the Galaxy", "author": { "@id": "http://example.com/douglas-adams", "type": "schema:Person", "name": "Douglas Adams" } }
The author
property in the book
object refers to the person
object. This indicates that Douglas Adams is the author of The Hitchhiker's Guide to the Galaxy.
Using Vocabularies
Vocabularies are collections of terms and definitions that are used to describe data. JSON-LD data models can use vocabularies to provide a common understanding of the data.
There are many different vocabularies available. Some popular vocabularies include:
Dublin Core
The vocabulary should be relevant to the type of data that you are modeling.
Ontology Resources
Metadata Dictionary
AD Knowledge Portal Metadata Dictionary
https://sagebio.shinyapps.io/amp-ad-metadata-dictionary/
Data Curator App
http://dca.app.sagebionetworks.org
https://dca-dev.app.sagebionetworks.org
https://github.com/adknowledgeportal/data_curator
https://github.com/adknowledgeportal/data-models
Projects
Folder Structure
. ├── biospecimen_experiment_1 ├── manifest1.csv ├── biospecimen_experiment_2 ├── manifestA.csv ├── single_cell_RNAseq_batch_1 ├── manifestX.csv ├── fileA.txt ├── fileB.txt ├── fileC.txt └── fileD.txt └── single_cell_RNAseq_batch_2 ├── manifestY.csv └── file1.txt
Study Content
/wiki/spaces/AKP/pages/1057882353
Study Description in wiki
Methods description in each data folder
AMP-AD
Second Test
AD Portal DCA Test ProjectFileview AD Portal DCA Test Project - Table
https://github.com/adknowledgeportal/test-data-model/blob/main/model-ad/model-ad.data.model.jsonld
https://github.com/adknowledgeportal/data-models/blob/main/README.md#editing-data-models
AD data model → modular
repo:
branch: test-split-csvs
folders:
modules/
..biosopecimen/
..mouse/
- ADM-836Getting issue details... STATUS
Term = Attribute in the data model where Parent = DataProperty
test-split0csvs branch
MODEL-AD
ELITE
Annotate study folder with contentType = 'dataset'
Flattened file structure
Create Project
Maintain File permission access easily
Top level: assay folders
All data files of one type in assay folder
These assay folder names will be displayed
data_folder/
Schematic Configuration needed config.yml
master_file view ‘synID’
which refers to this:
Fileview - Files and Folders https://www.synapse.org/#!Synapse:syn51753858/tables/
https://github.com/Sage-Bionetworks/data_curator_config
needs to point to this fileview and the data model
fork repo
edit dca-template-config.json
add MODEL-AD folder and edit configuration as needed send a pull request
ADKP example
Fileview DCA Asset View that DCA uses
folder contentType = ‘dataset’
One project for all of AD
Templates
https://dca-docs.scrollhelp.site/DCA/Working-version/Project-Agnostic/uploading-data
https://dca-docs.scrollhelp.site/DCA/Working-version/ELITE/validate-and-submit-your-metadata
Resources
https://linkml.io/schemasheets/#examples
https://linkml.io/linkml/intro/tutorial.html
https://learnxinyminutes.com/docs/yaml/
Glossary
Template
Manifest - metadata table submitted for dataset
Add Comment