This document has last been updated on 2023-09-14 to describe page describes the workflow required to build, edit, and update the data models model for MODEL-AD.
Table of Contents | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
Schematic
Summary
Data Modeling at Sage requires using two in-house tools: Schematic and the Data Curator App (DCA).
Schematic
Summary
SCHEMATIC is an acronym for Schema Engine for Manifest Ingress and Curation. The Python based tool is a schema-based, metadata ingress ecosystem, intended to streamline of biomedical dataset annotation, metadata validation and submission to a data repository for various data contributors.
...
Code Block |
---|
pyenv install 3.10.1110 pyenv virtualenv 3.10.1110 schematicpy_3_10_1110 pyenv activate schematicpy_3_10_11 pyenv -m pip install schematic_3_10_11 pip install schematicpy |
Edit Configuration
...
Need to run commands from ~/schematic
Data Model
...
Development
A data model defines attributes (i.e. data elements) describing metadata associated with any given dataset type. The data model also describes relationships between these attributes.
...
/wiki/spaces/SCHEM/pages/2473623559
Build a Data Model
The data model is defined in a table, then stored (i.e. serialized) in a JSON-LD schema.
The JSON-LD schema follows the specifications from Schema.org for attributes.
Convert data model from CSV to JSONLD
schematic schema convert input.csv output.jsonld
Schematic DB
Create Data Model
https://sagebionetworks.jira.com/wiki/spaces/SCHEM/pages/
...
2967568387/Guide+How+to+use+Schematic+for+Data+Model+
...
Development#Create-
...
a-
...
Data-
...
Schematic DB is a package used to ingress the manifests created by Schematic into a database.
Schematic DB will use any of these validation rules:
str
float
num
int
date
If the attribute has none of the above rules it use a string type
the attribute datatype will be determined based on the rule
Build a Data Model
Model
The data model is defined in a table, then stored (i.e. serialized) in a JSON-LD schema which specifies attributes as suggested by Schema.org.
/wiki/spaces/SCHEM/pages/2473623559
Documentation
/wiki/spaces/SCHEM/pages/2473623559
...
https://github.com/adknowledgeportal/data-models
Sage Data Models for Reference
Lref gdrive file url https://docs.google.com/spreadsheets/d/1vDdcqt3Lgehyq1iCnlF1H9JZi63pLj-u/edit#gid=1939820452
Recommendations
Draw a diagram. A diagram is a useful reference when developing the model.
Start small with a basic skeleton and then build.
Use schematic in dev mode to convert model to JSON-LD regularly to check for errors
...
The data model requires these columns:
Attribute
Description
ValidValues
DependsOn
required
source
parent
properties
dependsOnComponent
Example Model
Github: https://github.com/Sage-Bionetworks/schematic/blob/develop/tests/data/example.model.csv
Formatted for readability:
Lref gdrive file url https://docs.google.com/spreadsheets/d/1Wde5YBFtEa4GhO-smXgbVApGioBGNnc-95n4LY8YB_E/edit#gid=925738608
...
Generate JSON-LD from CSV
schematic schema convert data_model.csv
This model does NOT validate as provided.
Schematic DB
Schematic DB is a package used to ingress the manifests created by Schematic into a database.
Schematic DB will use any of these validation rules:
str, float, num, int, date
If no rule provided, defaults to a string type
the attribute datatype is based on the rule
Data Model Validation
/wiki/spaces/SCHEM/pages/2645262364
Data Model Visualization
Convert Data Model from CSV to JSON-LD
https://sagebionetworks.jira.com/wiki/spaces/SCHEM/pages/2967568387/Guide+How+to+use+Schematic+for+Data+Model+Development#Convert-Data-Model
schematic schema convert model.csv
What is JSON-LD?
Data models are formatted in JavaScript Object Notation-LinkedData. JSON-LD in schematic is its support by http://schema.orgdataset discoverability in search engines like: Dataset Search
Guide to Developing Data Models in JSON-LD
JSON-LD, or JavaScript Object Notation for Linked Data, is a JSON-based format for serializing Linked Data. It extends JSON with additional functionality to represent linked data structures, such as contexts, @id, and @type. JSON-LD is a lightweight and flexible format that can be used to represent a variety of data models.This guide provides an introduction to developing data models in
JSON-LD
...
JSON-LD syntax
JSON-LD contexts
Modeling entities and relationships
Using vocabularies
Best practices for developing JSON-LD data models
JSON-LD Syntax
JSON-LD documents are valid JSON documents. They consist of key-value pairs, where the keys are strings and the values can be strings, numbers, objects, arrays, or booleans. JSON-LD documents can also contain additional keywords that provide additional information about the data.
...
Dublin Core
Friend of a Friend (FOAF)
GoodRelations
GeoNames
MusicBrainz
When developing a JSON-LD data model, it is important to choose the appropriate vocabulary. The vocabulary should be relevant to the type of data that you are modeling.Data models are formatted in JavaScript Object Notation-LinkedData (JSON-LD).
Ontology Resources
...
...
Error Checking
Schematic dev mode helps find and deal with errors by iteratively checking data models
Data Model Visualization
Data Model Validation
/wiki/spaces/SCHEM/pages/2645262364
References
Metadata Dictionary
AD Knowledge Portal Metadata Dictionary
https://ontofox.hegroup.org/https://linkmlsagebio.shinyapps.io/linkml/intro/tutorial.html
httpsamp-ad-metadata-dictionary/
Data Curator App
http://docsdca.app.google.com/spreadsheets/d/1vDdcqt3Lgehyq1iCnlF1H9JZi63pLj-u/edit#gid=1939820452sagebionetworks.org
https://portal.includedccdca-dev.app.sagebionetworks.org/dashboard
https://linkmlgithub.iocom/schemasheetsadknowledgeportal/#examplesdata_curator
https://docs.googlegithub.com/spreadsheets/d/1w6zDfz3_yrCjjrqfpXBGNmd0LZL4B03gr1KfzJtk5Cs/edit#gid=674286209adknowledgeportal/data-models
https://docssagebionetworks.googlejira.com/presentation/d/129pSx58qDm7Y1OQmSSHKDq6tsoD3pW_gDRNXiX2rd0w/edit#slide=id.g4d21a8c2ba_0_11/wiki/spaces/SCHEM/pages/2453176326
/wiki/spaces/SCHEM/pages/2458419217
Glossary
Manifest - metadata table submitted for datasets
...
Projects
Folder Structure
https://dca-docs.scrollhelp.site/DCA/Working-version/Project-Agnostic/uploading-data
https://dca-docs.scrollhelp.site/DCA/Working-version/ELITE/validate-and-submit-your-metadata
...
Code Block |
---|
.
├── biospecimen_experiment_1
├── manifest1.csv
├── biospecimen_experiment_2
├── manifestA.csv
├── single_cell_RNAseq_batch_1
├── manifestX.csv
├── fileA.txt
├── fileB.txt
├── fileC.txt
└── fileD.txt
└── single_cell_RNAseq_batch_2
├── manifestY.csv
└── file1.txt |
Study Content
/wiki/spaces/AKP/pages/1057882353
Study Description in wiki
Methods description in each data folder
/wiki/spaces/EPD1/pages/2900819969
AMP-AD
Second Test
AD Portal DCA Test ProjectFileview AD Portal DCA Test Project - Table
...
DCA app development version https://dca-dev.app.sagebionetworks.org/
Abby's request for testing
https://sagebionetworks.slack.com/archives/C02A2FBN3G8/p1682116574295509
https://github.com/adknowledgeportal/test-data-model/blob/main/model-ad/model-ad.data.model.jsonld
https://sagebiogithub.shinyapps.iocom/adknowledgeportal-/data-curatormodels/
https:blob/main/www.synapse.org/#!Synapse:syn33582398/wiki/619343
https://github.com/adknowledgeportal/data_curator
https://github.com/adknowledgeportal/test-data-model README.md#editing-data-models
AD data model → modular
repo:
branch: test-split-csvs
folders:
modules/
..biosopecimen/
..mouse/
Jira Legacy | ||||||
---|---|---|---|---|---|---|
|
Term = Attribute in the data model where Parent = DataProperty
test-split0csvs branch
MODEL-AD
ELITE
Annotate study folder with contentType = 'dataset'
Flattened file structure
Create Project
Maintain File permission access easily
Top level: assay folders
All data files of one type in assay folder
These assay folder names will be displayed
data_folder/
Schematic Configuration needed config.yml
master_file view ‘synID’
which refers to this:
Fileview - Files and Folders https://www.synapse.org/#!Synapse:syn36759435syn51753858/tables/
Add CSV + JSONLD to github – test-data-model
extract individual and specimen ID from filenames
...
https://github.com/adknowledgeportal/test-data-model
https://github.com/adknowledgeportalSage-Bionetworks/data_curator/blob/18dc00723f2e95a98525ff695401ac67e7785475/schematic_config.yml#L31
Data Model Validation Rules
/wiki/spaces/SCHEM/pages/2645262364
...
needs to point to this fileview and the data model
fork repo
edit dca-template-config.json
add MODEL-AD folder and edit configuration as needed send a pull request
ADKP example
Fileview DCA Asset View that DCA uses
folder contentType = ‘dataset’
One project for all of AD
Templates
Lref gdrive file | ||
---|---|---|
|
...
|
...
|
...
|
...
|
...
/wiki/spaces/SCHEM/pages/2473623559
...
https://dca-docs.scrollhelp.site/DCA/Working-version/Project-Agnostic/uploading-data
...
dca-docs.scrollhelp.site/DCA/Working-version/ELITE/validate-and-submit-your-metadata
Resources
...
...
...
...
...
Lref gdrive file | ||
---|---|---|
|
...
|
...
...
ELITE
Lref gdrive file | ||
---|---|---|
|
...
Glossary
Template
Manifest - metadata table submitted for dataset