Schematic
Glossary
Manifest - metadata table submitted for datasetsData Modeling involves using two Sage-built tools: Schematic and the Data Curator App (DCA). This document is being written as of 2023-09-14 to describe the workflow required to build, edit, and update the data models for MODEL-AD.
Schematic
Summary
SCHEMATIC is an acronym for Schema Engine for Manifest Ingress and Curation. The Python based infrastructure provides tool is a novel schema-based, metadata ingress ecosystem, that is meant intended to streamline the process of biomedical dataset annotation, metadata validation and submission to a data repository for various data contributors.
Documentation
/wiki/spaces/SCHEM/pages/2967568387
Code in Github
https://github.com/Sage-Bionetworks/schematic
Installation
https://pypi.org/project/schematicpy/
Install for data curator app:
Code Block |
---|
python3 -m venv .venv source .venv/bin/activate python3 -m pip install schematicpy |
Setup Python Environment
Schematic will run on Python 3.10. We must control the Python Environment. PyEnv is one option., https://fathomtech.io/blog/python-environments-with-pyenv-and-vitualenv/
Code Block |
---|
pyenv install 3.10.11 pyenv virtualenv 3.10.11 schematic_3_10_11 pyenv activate schematic_3_10_11 pyenv -m pip install schematic_3_10_11 pip install schematicpy |
Edit Configuration
The following parameters need to be set in the config.yml
https://github.com/Sage-Bionetworks/schematic/blob/develop/config.yml
Using Schematic
Command Line Reference
https://sage-schematic.readthedocs.io/en/develop/cli_reference.html
Need to run commands from ~/schematic
Data Model
Summary
A data model defines attributes (i.e. data elements) describing metadata associated with any given dataset type. The data model also describes relationships between these attributes.
Documentation
/wiki/spaces/SCHEM/pages/2473623559
Build a Data Model
The data model is defined in a table, then stored (i.e. serialized) in a JSON-LD schema.
...
Schematic DB
...
Schematic DB will use any of these validation rules:
str
float
num
int
date
If the attribute has none of the above rules it use a string type
the attribute datatype will be determined based on the rule
Build a Data Model
Documentation/wiki/spaces/SCHEM/pages/2473623559
Recommendations
Draw a diagram for data model
Lucid.app - can use templates like ERD example
Start small - skeleton --> schema
Schema visualization tools?
Useful reference when building
Start from single table
Use schematic in dev mode to convert model to JSON-LD regularly to check for errors
Model Requirements
The data model requires these columns:
Attribute
Description
ValidValues
DependsOn
required
source
parent
properties
dependsOnComponent
Data Model Validation
/wiki/spaces/SCHEM/pages/2645262364
Example Model
https://github.com/Sage-Bionetworks/schematic/blob/develop/tests/data/example.model.csv
Lref gdrive file | ||
---|---|---|
|
https://ontofox.hegroup.org/
Namespace collisions - should use "Biothings" schema
graph modeling requires unique names. some protected names too. no underscores.
Schematic dev mode helps find and deal with erors by iteratively checking JSON-LD
Generate JSON-LDF from CSV: schematic schema convert data_model.csv
`schematic model --config config.hyml submit --manifest_path manifest.csv --datset_id synId -- manifest_record_type table
Data Model Visualization
https://linkml.io/linkml/intro/tutorial.html
https://docs.google.com/spreadsheets/d/1vDdcqt3Lgehyq1iCnlF1H9JZi63pLj-u/edit#gid=1939820452
https://portal.includedcc.org/dashboard
https://linkml.io/schemasheets/#examples
https://docs.google.com/spreadsheets/d/1w6zDfz3_yrCjjrqfpXBGNmd0LZL4B03gr1KfzJtk5Cs/edit#gid=674286209
https://docs.google.com/presentation/d/129pSx58qDm7Y1OQmSSHKDq6tsoD3pW_gDRNXiX2rd0w/edit#slide=id.g4d21a8c2ba_0_11
...
/wiki/spaces/SCHEM/pages/2458419217
Glossary
Manifest - metadata table submitted for datasets
JSON for Linking Data JSON-LD
JSON
https://cambridgesemantics.com/blog/semantic-university/learn-rdf/rdf-nuts-bolts-2/
One reason we use JSON-LD in schematic is its support by http://schema.org : https://schema.org/
And a reason for http://schema.org is dataset discoverability: https://datasetsearch.research.google.com/
JSON-LD useful for search engines (http://schema.org )
for anyone who wants to learn more about "linked data”, reading about ideas related to "semantic web" can be a fun rabbit hole to go down…w3 stds
Error Troubleshooting
SchemaHub Documentation on Confluence. This includes definitions of data model like validation rules
Github Tickets Sage-Bionetworks/schematic
Add ticket workflow
Click on Issue
Issue: Feature Request
Add title
Describe problem and a potential solution
Importance
Timeline
Additional Context
Attach any needed documents or screenshots
...
Where is the reference to how data model needs to be formatted?
Convert data model from CSV to JSONLD
schematic schema convert input.csv output.jsonld
...
When developing a JSON-LD data model, it is important to choose the appropriate vocabulary. The vocabulary should be relevant to the type of data that you are modeling.
**Best Practices
Upload Data
...
RegEx - extract individual and specimen ID from filenames
Data Model
Lref gdrive file url https://docs.google.com/document/d/1nZGLRKW5LXpY-LBrtrgs4MyO-fb0kDDeouEOvW36xo0/edit#heading=h.o7ihd22lafi
...
ELITE
Lref gdrive file | ||
---|---|---|
|
Use schematic in dev mode to convert model to JSON-LD regularly to check for errors