Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Schematic

Glossary

Manifest - metadata table submitted for datasetsData Modeling involves using two Sage-built tools: Schematic and the Data Curator App (DCA). This document is being written as of 2023-09-14 to describe the workflow required to build, edit, and update the data models for MODEL-AD.

Schematic

Summary

SCHEMATIC is an acronym for Schema Engine for Manifest Ingress and Curation. The Python based infrastructure provides tool is a novel schema-based, metadata ingress ecosystem, that is meant intended to streamline the process of biomedical dataset annotation, metadata validation and submission to a data repository for various data contributors.

Documentation

/wiki/spaces/SCHEM/pages/2967568387

Code in Github

https://github.com/Sage-Bionetworks/schematic

Installation

https://pypi.org/project/schematicpy/

Install for data curator app:

Code Block
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install schematicpy

Setup Python Environment

Schematic will run on Python 3.10. We must control the Python Environment. PyEnv is one option., https://fathomtech.io/blog/python-environments-with-pyenv-and-vitualenv/

Code Block
pyenv install 3.10.11
pyenv virtualenv 3.10.11 schematic_3_10_11
pyenv activate schematic_3_10_11
pyenv -m pip install schematic_3_10_11
pip install schematicpy

Edit Configuration

The following parameters need to be set in the config.yml

https://github.com/Sage-Bionetworks/schematic/blob/develop/config.yml

Using Schematic

Command Line Reference

https://sage-schematic.readthedocs.io/en/develop/cli_reference.html

Need to run commands from ~/schematic

Data Model

Summary

A data model defines attributes (i.e. data elements) describing metadata associated with any given dataset type. The data model also describes relationships between these attributes.

Documentation

/wiki/spaces/SCHEM/pages/2473623559

Build a Data Model

The data model is defined in a table, then stored (i.e. serialized) in a JSON-LD schema.

...

https://sagebionetworks.jira.com/wiki/spaces/SCHEM/pages/2473623559/The+Data+Model+Schema#A.-Schema-properties-and-relationships

Schematic DB

https://sagebionetworks.jira.com/wiki/spaces/SCHEM/pages/2473623559/The+Data+Model+Schema#Schemas-and-Schematic-DB

...

  • Schematic DB will use any of these validation rules:

    • str

    • float

    • num

    • int

    • date

    • If the attribute has none of the above rules it use a string type

    • the attribute datatype will be determined based on the rule

Build a Data Model

https://docs.google.com/presentation/d/129pSx58qDm7Y1OQmSSHKDq6tsoD3pW_gDRNXiX2rd0w/edit#slide=id.g13aaf3b8358_0_0

Documentation/wiki/spaces/SCHEM/pages/2473623559

Recommendations

  • Draw a diagram for data model

  • Lucid.app - can use templates like ERD example

  • Start small - skeleton --> schema

  • Schema visualization tools?

  • Useful reference when building

  • Start from single table

  • Use schematic in dev mode to convert model to JSON-LD regularly to check for errors

Model Requirements

The data model requires these columns:

  1. Attribute

  2. Description

  3. ValidValues

  4. DependsOn

  5. required

  6. source

  7. parent

  8. properties

  9. dependsOnComponent

Data Model Validation

/wiki/spaces/SCHEM/pages/2645262364

Example Model

https://github.com/Sage-Bionetworks/schematic/blob/develop/tests/data/example.model.csv

Lref gdrive file
urlhttps://docs.google.com/spreadsheets/d/1Wde5YBFtEa4GhO-smXgbVApGioBGNnc-95n4LY8YB_E/edit#gid=925738608

https://ontofox.hegroup.org/

Namespace collisions - should use "Biothings" schema

graph modeling requires unique names. some protected names too. no underscores.

Schematic dev mode helps find and deal with erors by iteratively checking JSON-LD

Generate JSON-LDF from CSV: schematic schema convert data_model.csv

`schematic model --config config.hyml submit --manifest_path manifest.csv --datset_id synId -- manifest_record_type table

command line reference

Data Model Visualization

https://linkml.io/linkml/intro/tutorial.html
https://docs.google.com/spreadsheets/d/1vDdcqt3Lgehyq1iCnlF1H9JZi63pLj-u/edit#gid=1939820452
https://portal.includedcc.org/dashboard
https://linkml.io/schemasheets/#examples
https://docs.google.com/spreadsheets/d/1w6zDfz3_yrCjjrqfpXBGNmd0LZL4B03gr1KfzJtk5Cs/edit#gid=674286209
https://docs.google.com/presentation/d/129pSx58qDm7Y1OQmSSHKDq6tsoD3pW_gDRNXiX2rd0w/edit#slide=id.g4d21a8c2ba_0_11

...

/wiki/spaces/SCHEM/pages/2458419217

Glossary

Manifest - metadata table submitted for datasets

JSON for Linking Data JSON-LD

JSON
https://cambridgesemantics.com/blog/semantic-university/learn-rdf/rdf-nuts-bolts-2/
One reason we use JSON-LD in schematic is its support by http://schema.org : https://schema.org/
And a reason for http://schema.org is dataset discoverability: https://datasetsearch.research.google.com/
JSON-LD useful for search engines (http://schema.org )
for anyone who wants to learn more about "linked data”, reading about ideas related to "semantic web" can be a fun rabbit hole to go down…w3 stds

Error Troubleshooting

SchemaHub Documentation on Confluence. This includes definitions of data model like validation rules

Github Tickets Sage-Bionetworks/schematic

Add ticket workflow

Click on Issue
Issue: Feature Request
Add title
Describe problem and a potential solution
Importance
Timeline
Additional Context
Attach any needed documents or screenshots

...

Where is the reference to how data model needs to be formatted?

Convert data model from CSV to JSONLD

schematic schema convert input.csv output.jsonld

...

When developing a JSON-LD data model, it is important to choose the appropriate vocabulary. The vocabulary should be relevant to the type of data that you are modeling.

**Best Practices

Upload Data

https://dca-docs.scrollhelp.site/DCA/Working-version/Project-Agnostic/organize-your-data-upload#OrganizeyourDataUpload-FlattenedDataLayoutExample

...


RegEx - extract individual and specimen ID from filenames

Data Model

...

ELITE

Lref gdrive file
urlhttps://drive.google.com/drive/folders/1M90FJX2seyb1s-QzKIHRrSCDuLC97NJO

Use schematic in dev mode to convert model to JSON-LD regularly to check for errors