Skip to end of banner
Go to start of banner

Schematic + Data Curator App Workflow Notes

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 6 Next »

Schematic

Summary

SCHEMATIC is an acronym for Schema Engine for Manifest Ingress and Curation. The Python based infrastructure provides a novel schema-based, metadata ingress ecosystem, that is meant to streamline the process of biomedical dataset annotation, metadata validation and submission to a data repository for various data contributors.

Documentation

https://sagebionetworks.jira.com/wiki/spaces/SCHEM/pages/2967568387/Guide+How+to+use+Schematic+for+Data+Model+Development#About

Code in Github

https://github.com/Sage-Bionetworks/schematic

Installation

https://pypi.org/project/schematicpy/

pip install schematicpy

  • Requires Python version 3.10.*

Install for data curator app:

python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install schematicpy

Setup Python Environment

pyenv install 3.10.11
pyenv virtualenv 3.10.11 schematic_3_10_11
pyenv activate schematic_3_10_11
pyenv -m pip install schematic_3_10_11
pip install schematicpy

Edit Configuration

The following parameters need to be set in the config.yml

https://github.com/Sage-Bionetworks/schematic/blob/develop/config.yml

Command Line Reference

https://sage-schematic.readthedocs.io/en/develop/cli_reference.html

Need to run commands from ~/schematic

Data Model

Summary

A data model defines attributes (i.e. data elements) describing metadata associated with any given dataset type. The data model also describes relationships between these attributes.

Documentation

/wiki/spaces/SCHEM/pages/2473623559

Build a Data Model

The data model is defined in a table, then stored (i.e. serialized) in a JSON-LD schema.

The JSON-LD schema follows the Schema.org way of specifying attributes

https://sagebionetworks.jira.com/wiki/spaces/SCHEM/pages/2473623559/The+Data+Model+Schema#A.-Schema-properties-and-relationships

/wiki/spaces/SCHEM/pages/2967568387

https://linkml.io/linkml/intro/tutorial.html
https://docs.google.com/spreadsheets/d/1vDdcqt3Lgehyq1iCnlF1H9JZi63pLj-u/edit#gid=1939820452
https://portal.includedcc.org/dashboard
https://linkml.io/schemasheets/#examples
https://docs.google.com/spreadsheets/d/1w6zDfz3_yrCjjrqfpXBGNmd0LZL4B03gr1KfzJtk5Cs/edit#gid=674286209
https://docs.google.com/presentation/d/129pSx58qDm7Y1OQmSSHKDq6tsoD3pW_gDRNXiX2rd0w/edit#slide=id.g4d21a8c2ba_0_11

/wiki/spaces/SCHEM/pages/2453176326

/wiki/spaces/SCHEM/pages/2458419217

Install Schematic

Schematic will run on Python 3.10. We must control the Python Environment. PyEnv is one option., https://fathomtech.io/blog/python-environments-with-pyenv-and-vitualenv/

pyenv install 3.10.11
pyenv virtualenv 3.10.11 schematic_3_10_11
pyenv activate schematic_3_10_11
pyenv -m pip install schematic_3_10_11
pip install schematicpy

Data model visualizer?

Build a Data Model

/wiki/spaces/SCHEM/pages/2473623559

Data Model Workshop
https://docs.google.com/presentation/d/129pSx58qDm7Y1OQmSSHKDq6tsoD3pW_gDRNXiX2rd0w/edit#slide=id.g13aaf3b8358_0_0

Diagramming - draw out model

Lucid.app - can use templates like ERD example

Can reference diagram when building data model

Schema visualization tool ( data viz collaboration opportunity Rich!!)

Start small - skeleton --> schema

Definitions on /wiki/spaces/SCHEM/pages/2473623559

Manifest - metadata table submitted for datasets
Data Model -
Data Schema -

Start from single table

CSV with basic column set: Attribute, Description, ValidValues, DependsOn, required, source, parent, properties, dependsOnComponent, validationRules

Use schematic in dev mode to conver model to JSON-LD regularly to check for errors

https://ontofox.hegroup.org/

Namespace collisions - should use "Biothings" schema

graph modeling requires unique names. some protected names too. no underscores.

Schematic dev mode helps find and deal with erors by iteratively checking JSON-LD

Generate JSON-LDF from CSV: schematic schema convert data_model.csv

`schematic model --config config.hyml submit --manifest_path manifest.csv --datset_id synId -- manifest_record_type table

command line reference

JSON for Linking Data JSON-LD

JSON
https://cambridgesemantics.com/blog/semantic-university/learn-rdf/rdf-nuts-bolts-2/
One reason we use JSON-LD in schematic is its support by http://schema.org : https://schema.org/
And a reason for http://schema.org is dataset discoverability: https://datasetsearch.research.google.com/
JSON-LD useful for search engines (http://schema.org )
for anyone who wants to learn more about "linked data”, reading about ideas related to "semantic web" can be a fun rabbit hole to go down…w3 stds

Error Troubleshooting

SchemaHub Documentation on Confluence. This includes definitions of data model like validation rules

Github Tickets Sage-Bionetworks/schematic

Add ticket workflow

Click on Issue
Issue: Feature Request
Add title
Describe problem and a potential solution
Importance
Timeline
Additional Context
Attach any needed documents or screenshots

  • blah: [e.g. chrome, safari]

  • ca

http://regex101.com

Create a data model formatted as a CSV

Where is the reference to how data model needs to be formatted?

Convert data model from CSV to JSONLD

schematic schema convert input.csv output.jsonld

Guide to Developing Data Models in JSON-LD

JSON-LD, or JavaScript Object Notation for Linked Data, is a JSON-based format for serializing Linked Data. It extends JSON with additional functionality to represent linked data structures, such as contexts, @id, and @type. JSON-LD is a lightweight and flexible format that can be used to represent a variety of data models.

This guide provides an introduction to developing data models in JSON-LD. It covers the following topics:

  • JSON-LD syntax

  • JSON-LD contexts

  • Modeling entities and relationships

  • Using vocabularies

  • Best practices for developing JSON-LD data models

JSON-LD Syntax

JSON-LD documents are valid JSON documents. They consist of key-value pairs, where the keys are strings and the values can be strings, numbers, objects, arrays, or booleans. JSON-LD documents can also contain additional keywords that provide additional information about the data.

The following is an example of a simple JSON-LD document:

JSON{
  "@context": "https://schema.org/",
  "@id": "http://example.com/book1",
  "type": "Book",
  "name": "The Hitchhiker's Guide to the Galaxy",
  "author": "Douglas Adams"
}

This document describes a book with the following properties:

  • @context: The context URI specifies the vocabulary that is used to interpret the data. In this case, the vocabulary is http://Schema.org .

  • @id: The @id property uniquely identifies the resource. In this case, the resource is a book.

  • type: The type property specifies the type of the resource. In this case, the resource is a book.

  • name: The name property specifies the name of the book.

  • author: The author property specifies the author of the book.

JSON-LD Contexts

JSON-LD contexts are used to map IRIs (Internationalized Resource Identifiers) to human-readable names. Contexts can also be used to define prefixes for IRIs. This can make JSON-LD documents easier to read and write.

The @context property in a JSON-LD document specifies a context URI. When a JSON-LD processor encounters an IRI in a document, it uses the context to resolve the IRI to a human-readable name.

For example, the following context defines a prefix for the http://Schema.org vocabulary:

JSON{
  "@context": {
    "schema": "https://schema.org/"
  }
}

Using this context, the following JSON-LD document can be interpreted:

JSON{
  "@context": {
    "schema": "https://schema.org/"
  },
  "@id": "http://example.com/book1",
  "type": "schema:Book",
  "name": "The Hitchhiker's Guide to the Galaxy",
  "author": "Douglas Adams"
}

The type property is now prefixed with schema:. This makes the document easier to read and understand.

Modeling Entities and Relationships

Entities in a JSON-LD data model are represented by objects. Relationships between entities are represented by properties. For example, the following JSON-LD document describes a book and a person:

JSON{
  "@context": {
    "schema": "https://schema.org/"
  },
  "@id": "http://example.com/book1",
  "type": "schema:Book",
  "name": "The Hitchhiker's Guide to the Galaxy",
  "author": {
    "@id": "http://example.com/douglas-adams",
    "type": "schema:Person",
    "name": "Douglas Adams"
  }
}

The author property in the book object refers to the person object. This indicates that Douglas Adams is the author of The Hitchhiker's Guide to the Galaxy.

Using Vocabularies

Vocabularies are collections of terms and definitions that are used to describe data. JSON-LD data models can use vocabularies to provide a common understanding of the data.

There are many different vocabularies available. Some popular vocabularies include:

  • http://Schema.org

  • Dublin Core

  • Friend of a Friend (FOAF)

  • GoodRelations

  • GeoNames

  • MusicBrainz

When developing a JSON-LD data model, it is important to choose the appropriate vocabulary. The vocabulary should be relevant to the type of data that you are modeling.

**Best Practices

Upload Data

https://dca-docs.scrollhelp.site/DCA/Working-version/Project-Agnostic/organize-your-data-upload#OrganizeyourDataUpload-FlattenedDataLayoutExample

https://dca-docs.scrollhelp.site/DCA/Working-version/Project-Agnostic/uploading-data

https://dca-docs.scrollhelp.site/DCA/Working-version/ELITE/validate-and-submit-your-metadata

AD Data Models https://github.com/adknowledgeportal/data-models
DCA app development version

https://dca-dev.app.sagebionetworks.org/
Abby's request for testing

https://sagebionetworks.slack.com/archives/C02A2FBN3G8/p1682116574295509

https://github.com/adknowledgeportal/test-data-model/blob/main/model-ad/model-ad.data.model.jsonld

https://sagebio.shinyapps.io/adknowledgeportal-data-curator/
https://www.synapse.org/#!Synapse:syn33582398/wiki/619343
https://github.com/adknowledgeportal/data_curator
https://github.com/adknowledgeportal/test-data-model

Annotate study folder with contentType = 'dataset'

https://www.synapse.org/#!Synapse:syn36759435/tables/
Add CSV + JSONLD to github – test-data-model

https://github.com/adknowledgeportal/test-data-model
https://github.com/adknowledgeportal/data_curator/blob/18dc00723f2e95a98525ff695401ac67e7785475/schematic_config.yml#L31
Data Model Validation Rules

/wiki/spaces/SCHEM/pages/2645262364


RegEx - extract individual and specimen ID from filenames

Data Model

  • https://github.com/Sage-Bionetworks/1kD-model

  • /wiki/spaces/SCHEM/pages/2473623559

  • OWL Tutorial

  • https://schema.org/

  • https://linkml.io/linkml/

  • https://learnxinyminutes.com/docs/yaml/

  • https://json-ld.org/

  • http://vowl.visualdataweb.org/webvowl.html

  • https://medium.com/wallscope/understanding-linked-data-formats-rdf-xml-vs-turtle-vs-n-triples-eb931dbe9827

  • https://webprotege.stanford.edu/#projects/cb219a51-dd90-4921-bec4-c836bd96f680/edit/Properties?selection=ObjectProperty(%3Chttp://example.com/BallpointPenOntology/hasCharacteristic%3E)

  • https://sagebionetworks.jira.com/wiki/spaces/CDC

ELITE

ELITE

  • No labels

0 Comments

You are not logged in. Any changes you make will be marked as anonymous. You may want to Log In if you already have an account.