Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Data Modeling involves using two Sage-built tools: Schematic and the Data Curator App (DCA). This document is being written as of 2023-09-14 to describe This page describes the workflow required to build, edit, and update the data models model for MODEL-AD.

Table of Contents
minLevel1
maxLevel3
outlinefalse
typelist
printablefalse

Schematic

Summary

Data Modeling at Sage requires using two in-house tools: Schematic and the Data Curator App (DCA). SCHEMATIC is an acronym for Schema Engine for Manifest Ingress and Curation. The Python based tool is a schema-based, metadata ingress ecosystem, intended to streamline of biomedical dataset annotation, metadata validation and submission to a data repository for various data contributors.

...

Code Block
pyenv install 3.10.1110
pyenv virtualenv 3.10.1110 schematicpy_3_10_1110
pyenv activate schematicpy_3_10_11
pyenv -m pip install schematic_3_10_11
pip install schematicpy

Edit Configuration

...

Need to run commands from ~/schematic

Data Model Development

Summary

A data model defines attributes (i.e. data elements) describing metadata associated with any given dataset type. The data model also describes relationships between these attributes.

...

/wiki/spaces/SCHEM/pages/2473623559

Build a Data Model

The data model is defined in a table, then stored (i.e. serialized) in a JSON-LD schema.

The JSON-LD schema follows the specifications from Schema.org for attributes.

https://sagebionetworks.jira.com/wiki/spaces/SCHEM/pages/2473623559/The+Data+Model+Schema#A.-Schema-properties-and-relationships

Convert data model from CSV to JSONLD

schematic schema convert input.csv output.jsonld

Schematic DB

Create Data Model

https://sagebionetworks.jira.com/wiki/spaces/SCHEM/pages/

...

2967568387/Guide+How+to+use+Schematic+for+Data+Model+

...

Development#Create-

...

a-

...

Data-

...

Schematic DB is a package used to ingress the manifests created by Schematic into a database.

  • Schematic DB will use any of these validation rules:

    • str

    • float

    • num

    • int

    • date

    • If the attribute has none of the above rules it use a string type

    • the attribute datatype will be determined based on the rule

Build a Data Model

Model

The data model is defined in a table, then stored (i.e. serialized) in a JSON-LD schema which specifies attributes as suggested by Schema.org.

/wiki/spaces/SCHEM/pages/2473623559

https://docs.google.com/presentation/d/129pSx58qDm7Y1OQmSSHKDq6tsoD3pW_gDRNXiX2rd0w/edit#slide=id.g13aaf3b8358_0_0

Documentation

/wiki/spaces/SCHEM/pages/2473623559

...

https://github.com/adknowledgeportal/data-models

Sage Data Models for Reference

Recommendations

  • Draw a diagram. A diagram is a useful reference when developing the model.

  • Start small with a basic skeleton and then build.

  • Use schematic in dev mode to convert model to JSON-LD regularly to check for errors

...

The data model requires these columns:

  1. Attribute

  2. Description

  3. ValidValues

  4. DependsOn

  5. required

  6. source

  7. parent

  8. properties

  9. dependsOnComponent

Example Model

Generate JSON-LD from CSV

schematic schema convert data_model.csv

This model does NOT validate as provided.

Schematic DB

https://sagebionetworks.jira.com/wiki/spaces/SCHEM/pages/2473623559/The+Data+Model+Schema#Schemas-and-Schematic-DB

Schematic DB is a package used to ingress the manifests created by Schematic into a database.

  • Schematic DB will use any of these validation rules:

    • str, float, num, int, date

    • If no rule provided, defaults to a string type

    • the attribute datatype is based on the rule

Data Model Validation

/wiki/spaces/SCHEM/pages/2645262364

Data Model Visualization

Convert Data Model from CSV to JSON-LD

https://sagebionetworks.jira.com/wiki/spaces/SCHEM/pages/2967568387/Guide+How+to+use+Schematic+for+Data+Model+Development#Convert-Data-Model

schematic schema convert model.csv

What is JSON-LD?

Data models are formatted in JavaScript Object Notation-LinkedData. JSON-LD in schematic is its support by http://schema.orgdataset discoverability in search engines like: ​Dataset Search

Guide to Developing Data Models in JSON-LD

JSON-LD, or JavaScript Object Notation for Linked Data, is a JSON-based format for serializing Linked Data. It extends JSON with additional functionality to represent linked data structures, such as contexts, @id, and @type. JSON-LD is a lightweight and flexible format that can be used to represent a variety of data models.

This guide provides an introduction to developing data models in JSON-LD. It covers the following topics:

...

JSON-LD syntax

...

JSON-LD contexts

...

Modeling entities and relationships

...

Using vocabularies

...

JSON-LD

...

JSON-LD Syntax

JSON-LD documents are valid JSON documents. They consist of key-value pairs, where the keys are strings and the values can be strings, numbers, objects, arrays, or booleans. JSON-LD documents can also contain additional keywords that provide additional information about the data.

...

  • http://Schema.org

  • Dublin Core

  • Friend of a Friend (FOAF)

  • GoodRelations

  • GeoNames

  • MusicBrainz

When developing a JSON-LD data model, it is important to choose the appropriate vocabulary. The vocabulary should be relevant to the type of data that you are modeling.Data models are formatted in JavaScript Object Notation-LinkedData (JSON-LD).

Ontology Resources

...

...

Error Checking

Schematic dev mode helps find and deal with errors by iteratively checking data models

Data Model Visualization

Data Model Validation

/wiki/spaces/SCHEM/pages/2645262364

...

Metadata Dictionary

AD Knowledge Portal Metadata Dictionary

https://ontofoxsagebio.hegroup.org/https://linkml.io/linkml/intro/tutorial.html
https://docs.google.com/spreadsheets/d/1vDdcqt3Lgehyq1iCnlF1H9JZi63pLj-u/edit#gid=1939820452 shinyapps.io/amp-ad-metadata-dictionary/

Data Curator App

http://dca.app.sagebionetworks.org

https://portal.includedccdca-dev.app.sagebionetworks.org/dashboard

https://linkmlgithub.iocom/schemasheetsadknowledgeportal/#examplesdata_curator

https://docsgithub.google.com/spreadsheets/d/1w6zDfz3_yrCjjrqfpXBGNmd0LZL4B03gr1KfzJtk5Cs/edit#gid=674286209adknowledgeportal/data-models

https://docssagebionetworks.googlejira.com/presentation/d/129pSx58qDm7Y1OQmSSHKDq6tsoD3pW_gDRNXiX2rd0w/edit#slide=id.g4d21a8c2ba_0_11/wiki/spaces/SCHEM/pages/2453176326

/wiki/spaces/SCHEM/pages/2458419217

Glossary

Manifest - metadata table submitted for datasets

Upload Data

2458648589/Setting+up+a+DCC+Asset+Store#How-do-I-Structure-My-DCC-Synapse-Project-to-Work-with-the-Data-Curator-App%3F

Projects

Folder Structure

https://dca-docs.scrollhelp.site/DCA/Working-version/Project-Agnostic/organize-your-data-upload#OrganizeyourDataUpload-FlattenedDataLayoutExample

https://dca-docs.scrollhelp.site/DCA/Working-version/Project-Agnostic/uploading-data

https://dca-docs.scrollhelp.site/DCA/Working-version/ELITE/validate-and-submit-your-metadata

...

Code Block
.
├── biospecimen_experiment_1
    ├── manifest1.csv
├── biospecimen_experiment_2
    ├── manifestA.csv
├── single_cell_RNAseq_batch_1
    ├── manifestX.csv
    ├── fileA.txt
    ├── fileB.txt
    ├── fileC.txt
    └── fileD.txt
└── single_cell_RNAseq_batch_2
    ├── manifestY.csv
    └── file1.txt

Study Content

/wiki/spaces/AKP/pages/1057882353

  • Study Description in wiki

  • Methods description in each data folder

/wiki/spaces/EPD1/pages/2900819969

AMP-AD

...

https://dca-dev.app.sagebionetworks.org/
Abby's request for testing

https://sagebionetworks.slack.com/archives/C02A2FBN3G8/p1682116574295509https://github.com/adknowledgeportal/test-data-model/blob/main/model-ad/model-ad.data.model.jsonld

https://sagebiogithub.shinyapps.iocom/adknowledgeportal-/data-curatormodels/
https:blob/main/www.synapse.org/#!Synapse:syn33582398/wiki/619343
https://github.com/adknowledgeportal/data_curator
https://github.com/adknowledgeportal/test-data-model README.md#editing-data-models

AD data model → modular

repo:

branch: test-split-csvs

folders:

modules/

..biosopecimen/

..mouse/

Jira Legacy
serverSystem JIRA
serverIdba6fb084-9827-3160-8067-8ac7470f78b2
keyADM-836

Term = Attribute in the data model where Parent = DataProperty

test-split0csvs branch

MODEL-AD

ELITE

Annotate study folder with contentType = 'dataset'

Flattened file structure

Create Project

Maintain File permission access easily

Top level: assay folders

All data files of one type in assay folder

These assay folder names will be displayed

data_folder/

Schematic Configuration needed config.yml

master_file view ‘synID’

which refers to this:

Fileview - Files and Folders https://www.synapse.org/#!Synapse:syn36759435syn51753858/tables/
Add CSV + JSONLD to github – test-data-model

extract individual and specimen ID from filenames

http://regex101.com

...

https://github.com/adknowledgeportal/test-data-model
https://github.com/adknowledgeportal/Sage-Bionetworks/data_curator/blob/18dc00723f2e95a98525ff695401ac67e7785475/schematic_config.yml#L31
Data Model Validation Rules

/wiki/spaces/SCHEM/pages/2645262364

...

_config

needs to point to this fileview and the data model

fork repo

edit dca-template-config.json

add MODEL-AD folder and edit configuration as needed send a pull request

ADKP example

Fileview DCA Asset View that DCA uses

folder contentType = ‘dataset’

One project for all of AD

Templates

Lref gdrive file
urlhttps://

...

drive.google.com/

...

drive/

...

folders/

...

1M90FJX2seyb1s-QzKIHRrSCDuLC97NJO

https://

...

/wiki/spaces/SCHEM/pages/2473623559

...

dca-docs.scrollhelp.site/DCA/Working-version/Project-Agnostic/organize-your-data-upload#OrganizeyourDataUpload-FlattenedDataLayoutExample

https://dca-docs.scrollhelp.site/DCA/Working-version/Project-Agnostic/uploading-data

https://

...

dca-docs.scrollhelp.site/DCA/Working-version/ELITE/validate-and-submit-your-metadata


Resources

https://linkml.io/

...

schemasheets/#examples

https://

...

linkml.

...

io/

...

linkml/

...

intro/tutorial.html

Lref gdrive file
urlhttps://

...

docs.google.com/document/d/1nZGLRKW5LXpY-LBrtrgs4MyO-fb0kDDeouEOvW36xo0/edit#heading=h.o7ihd22lafi

https://

...

learnxinyminutes.com/

...

docs/yaml/

https://webprotege.stanford.edu/#projects/cb219a51-dd90-4921-bec4-c836bd96f680/edit/Properties?selection=ObjectProperty(%3Chttp://example.com/BallpointPenOntology/hasCharacteristic%3E)

ELITE

Lref gdrive file
urlhttps://drive.google.com/drive/folders/1M90FJX2seyb1s-QzKIHRrSCDuLC97NJO

...

Glossary

Template

Manifest - metadata table submitted for dataset