Data Model Workflow

Data Model Workflow

This page describes the workflow required to build, edit, and update the data model for MODEL-AD.

Schematic

Summary

Data Modeling at Sage requires using two in-house tools: Schematic and the Data Curator App (DCA). SCHEMATIC is an acronym for Schema Engine for Manifest Ingress and Curation. The Python based tool is a schema-based, metadata ingress ecosystem, intended to streamline of biomedical dataset annotation, metadata validation and submission to a data repository for various data contributors.

Documentation

https://sagebionetworks.jira.com/wiki/spaces/SCHEM/pages/2967568387

Code in Github

https://github.com/Sage-Bionetworks/schematic

Installation

https://pypi.org/project/schematicpy/

Install for data curator app:

python3 -m venv .venv source .venv/bin/activate python3 -m pip install schematicpy

Setup Python Environment

Schematic will run on Python 3.10. We must control the Python Environment. PyEnv is one option., https://fathomtech.io/blog/python-environments-with-pyenv-and-vitualenv/

pyenv install 3.10.10 pyenv virtualenv 3.10.10 py_3_10_10 pyenv activate py_3_10_10 pip install schematicpy

Edit Configuration

The following parameters need to be set in the config.yml

https://github.com/Sage-Bionetworks/schematic/blob/develop/config.yml

Using Schematic

Command Line Reference

https://sage-schematic.readthedocs.io/en/develop/cli_reference.html

Need to run commands from ~/schematic

Data Model Development

A data model defines attributes (i.e. data elements) describing metadata associated with any given dataset type. The data model also describes relationships between these attributes.

Documentation

https://sagebionetworks.jira.com/wiki/spaces/SCHEM/pages/2473623559

https://sagebionetworks.jira.com/wiki/spaces/SCHEM/pages/2473623559/The+Data+Model+Schema#A.-Schema-properties-and-relationships

Create Data Model

https://sagebionetworks.jira.com/wiki/spaces/SCHEM/pages/2967568387/How+to+use+Schematic+for+Data+Model+Development#Create-a-Data-Model

The data model is defined in a table, then stored (i.e. serialized) in a JSON-LD schema which specifies attributes as suggested by Schema.org.

https://sagebionetworks.jira.com/wiki/spaces/SCHEM/pages/2473623559

https://docs.google.com/presentation/d/129pSx58qDm7Y1OQmSSHKDq6tsoD3pW_gDRNXiX2rd0w/edit#slide=id.g13aaf3b8358_0_0

https://github.com/adknowledgeportal/data-models

Sage Data Models for Reference

  • https://github.com/Sage-Bionetworks/1kD-model

  • https://portal.includedcc.org/dashboard

  • https://docs.google.com/spreadsheets/d/1w6zDfz3_yrCjjrqfpXBGNmd0LZL4B03gr1KfzJtk5Cs/edit#gid=674286209

Recommendations

  • Draw a diagram. A diagram is a useful reference when developing the model.

  • Start small with a basic skeleton and then build.

  • Use schematic in dev mode to convert model to JSON-LD regularly to check for errors

Requirements

The data model requires these columns:

  1. Attribute

  2. Description

  3. ValidValues

  4. DependsOn

  5. required

  6. source

  7. parent

  8. properties

  9. dependsOnComponent

Example Model

  • Github: https://github.com/Sage-Bionetworks/schematic/blob/develop/tests/data/example.model.csv

  • Formatted for readability:

This model does NOT validate as provided.

Schematic DB

https://sagebionetworks.jira.com/wiki/spaces/SCHEM/pages/2473623559/The+Data+Model+Schema#Schemas-and-Schematic-DB

Schematic DB is a package used to ingress the manifests created by Schematic into a database.

  • Schematic DB will use any of these validation rules:

    • str, float, num, int, date

    • If no rule provided, defaults to a string type

    • the attribute datatype is based on the rule

Data Model Validation

https://sagebionetworks.jira.com/wiki/spaces/SCHEM/pages/2645262364