Document toolboxDocument toolbox

Proposal for Entity Schemas

Introduction

Most of the basic objects that Synapse currently supports are Entities. Each Entity has first class data that makes up the fields of an entity. All Entities also have Annotations that store additional data about an entity.

The following are the current Synapse Entities:

  • Project
  • Folder
  • Dataset
  • Layer
  • Location
  • EULA

Currently, all Entities are defined by "hard-coded" Java objects. The fields of these Java objects define the first class data of each entity. The only mechanism we have for constraining data of an Entity is to write Java code to do the validation. We also lack a mechanism to constrain or define annotations.

While defining entities using Java allowed us to quickly get a first version of Synapse built, we always planed on supporting a more dynamic approach to object definitions. Ideally we would like our users to define entities without writing Java code. As it stands now if our users want to add a field to an entity, an engineering task must be scheduled to get the change implemented. In theory, if we used a schema like JSON Schema 03, for both entity definitions and data constraints, we could make changes to schema with little or no engineering effort. Engineering would no longer be the bottle-neck for the evolution of Synapse Entities and data.

Proposal

We are proposing to use JSON Schema 03 to define both an Entity type. The JSON Schema breaks an object definition into two major categories; properties and additional properties.

An example JSON Schema that describes products might look like:

{
     "name":"Product",
     "properties":{
       "id":{
         "type":"number",
         "description":"Product identifier",
         "required":true
       },
       "name":{
         "description":"Name of the product",
         "type":"string",
         "required":true
       },
       "price":{
         "required":true,
         "type": "number",
         "minimum":0,
         "required":true
       },
       "tags":{
         "type":"array",
         "items":{
           "type":"string"
         }
       },
       "releaseStatus":{
         "type":"string",
         "description":"The release status of a product",
         "enum":[ "PROTOTYPE", "RELEASED", "RECALLED", "DEPRECIATED"]
       }
     },
     # not used...
     "additionalProperties":{
     }
   }

In the above example, we can seen an how various types of data can be defined for a Product using the JSON Schema. For example, "id" is a number and required, while "releaseStatus" is an enumeration of strings.

We are proposing to use the "properties" to define the primary fields of a Synapse Entity. These primary fields can be considered the expected data of all instances of a given entity. Using the Product example from above, this implies that all instances of Product would have "id", "name", "price" and "tags.

Initially we were planning to use "additinalProperties" to define the Annotations of a Synapse Entity, but this raised a fundamental issue. If the Annotations of an entity are provided for ad-hock user data, then formally defining them in the entity schema for all instances of a type seems like a poor fit. That said, we still have many use cases where we want to constrain the data of an annotation when they are added to an instance of an entity. Therefore, we are positioning that these annotation types are set on a per-instances basis rather than at the entity schema level. Annotation types are covered in a separate document: Proposal for Annotation Types

Schema Life-cycle

For the initial implementation we are proposing that an Entity Schema can only be defined and edited as part of the compile of synapse. This means run-time edits or additions to each schema will not be possible. The reason for this limitation is to keep the Life-cycle of the schema as simple as possible. As we will see, the life-cycle is already complicated even with this limitation.

Define Entities

A new entity will be created by first creating a new JSON text file in the lib-auto-generated project's src/main/resources folder. Folder hierarchies should be used to represent the equivalent of "packages" for each entity.
The following example show where an Example entity might be created:

/lib-auto-generated/src/main/resource/org/sagebionetworks/entity/type/Example.json

Lets say we also want to define an Annotation type and use it to help define our Example.json. This annotation type definition JSON text file might be created in the following location:

/lib-auto-generated/src/main/resource/org/sagebionetworks/annotation/types/VertebrateOrganType.json

Before we look at the definition of our Example.json let's first look at the definition of our new VertebrateOrganType.json. For this example we want to use the Basic Vertebrate Anatomy ontology to define the valid values for Organs:
VertebrateOrganType.json

{
    "type":"string",
    "format":"uri",
    "enum":[
        "XQUERY":"doc(http://rest.bioontology.org/bioportal/concepts/4531?conceptid=tbio:Organ&light=1&apikey=2fb9306a-7f3f-477a-821e-e3ccd7356a18)/success/data/classBean/relations/entry[string=Subclass]/list/classBean/fullId"
    ]
}

In this example, the enumeration values are defined by an XQuery that is used to get the "fullId" (URIs) of all Sub-classes of the Term "Organ" using the XML returned from NCBO's BioPortal Term services. Here is the XML returned by the term service for this exampl: http://rest.bioontology.org/bioportal/concepts/4531?conceptid=tbio:Organ&light=1&apikey=2fb9306a-7f3f-477a-821e-e3ccd7356a18.
Assuming the XQuery is setup correctly, the effective enum definition for this type would be"

"enum":[
	"http://www.co-ode.org/ontologies/basic-bio/basic-vertebrate-gross-anatomy.owl#Heart",
	"http://www.co-ode.org/ontologies/basic-bio/basic-vertebrate-gross-anatomy.owl#Pericardium",
	"http://www.co-ode.org/ontologies/basic-bio/basic-vertebrate-gross-anatomy.owl#Brain",
	"http://www.co-ode.org/ontologies/basic-bio/basic-vertebrate-gross-anatomy.owl#Stomach",
	"http://www.co-ode.org/ontologies/basic-bio/basic-vertebrate-gross-anatomy.owl#Lung",
	"http://www.co-ode.org/ontologies/basic-bio/basic-vertebrate-gross-anatomy.owl#Liver",
]

Now that we have defined an Annotation Type for Organ using the ontology we can use this type in the definition of the entity.
Here is our definition of our example Entity:
Example.json

{
    "extends":"org/sagebionetworks/entity/type/Entity.json""name":"Product",
    "properties":{
        "id":{
            "type":"number",
            "description":"Example identifier",
            "required":true
        },
        "name":{
            "description":"Name of the Example",
            "type":"string",
            "required":true
        },
        "organ":{
            "$ref":"org/sagebionetworks/annotation/types/VertebrateOrganType.json"
        }
    }
}

The first thing to point out about our Example.json is that it extends Entity.json, which makes it a Synapse Entity. This implies it inherits all of its values from the base Entity. The second thing to point out is that the "organ" property is defined using the annotation type we created earlier.

Compile JPJOs (first time)

Since we still want Java POJOs to represent all entities, we will use the schema-to-pojo-maven-plugin to build these POJOs. This is done by simply added the following to the lib-auto-generated/pom.xml file:

<plugin>
	<groupId>org.sagebionetworks</groupId>
	<artifactId>schema-to-pojo-maven-plugin</artifactId>
	<version>${schema-to-pojo.version}</version>
	<executions>
		<execution>
			<goals>
				<goal>generate</goal>
			</goals>
			<configuration>
				<sourceDirectory>src/main/resources</sourceDirectory>
				<packageName>org.sagebionetworks</packageName>
				<outputDirectory>target/auto-generated-pojos</outputDirectory>
			</configuration>
		</execution>
	</executions>
</plugin>

The plugin will automatically create a POJOs class for each JSON schema found in the resource directory. These POJOs will be placed in the target/auto-generated-pojos directory.

Synapse Deploy (first time)

The first time Synapse is deployed after creating Entities, the org.sagebionetworks.repo.model.bootstrap.EntityBootstrapper will read all JSON schema files found in the lib-auto-generated.jar file and create a Synapse SchemaEntity (to be defined) for each using the directory structure create each path. All schema entities will be placed in the folder:

root/schemas

The resulting SchemaEntity objects from the two examples above would have the following paths:

root/schemas/org/sagebionetworks/entity/type/Example.json
root/schemas/org/sagebionetworks/annotation/types/VertebrateOrganType.json

Folder entities will be created as need to create each path. By giving each SchemaEntity a unique path, we can use this path to reference a schema before we have an entity to represent it.

The API user will be able to get the SchemaEntity objects but they will be READ-ONLY copies. This is important, because the "truth" of each entity is the JSON text file from the auto-generated-pojos project. Hopefully, this will make more sense as the rest of the life-cycle is outlined.

Edit of an Schema

Imagine that we want to add a new primary field to our Example.json Entity. To do this we need to modify the original JSON file in the lib-auto-generated

/lib-auto-generated/src/main/resource/org/sagebionetworks/entity/type/Example.json

We want to add a new required primary field called "status". Since "status" is required, we must provide a default value. This is a requirement because we already have instances of Example entities deployed to Synapse, and each of these must be given a default value. We will cover how these default values are applied shortly. Here is our new Example.json:
Example.json

{
    "extends":"org/sagebionetworks/entity/type/Entity.json""name":"Product",
    "properties":{
        "id":{
            "type":"number",
            "description":"Example identifier",
            "required":true
        },
        "name":{
            "description":"Name of the Example",
            "type":"string",
            "required":true
        },
        "organ":{
            "$ref":"org/sagebionetworks/annotation/types/VertebrateOrganType.json"
        },
        "status":{
            "type":"string",
            "required":true,
            "enum":[
                "PROTOTYPE",
                "RELEASED",
                "RECALLED",
                "DEPRECIATED"
            ],
            "default":"PROTOTYPE"
        }
    }
}

Compile POJOs (Nth Time)

This time when we compile the new Example.java POJO, the resulting POJO will have a new field called "status" with a default value of "PROTYPE".

Backup Deployed Synapse

Before we can deploy our update schema we must create a backup of the deployed Synapse. See: Repository+Administration
This is an important step. We will use this backup to deploy our changes to the repository.

Synapse Deploy (Nth Time)

Just like before, the bootstrap system will per-populate all SchemaEntites on the new empty repository. At this point we have an empty Synapse that is up-to-date with regard to the current schema.

Restore Synapse from Backup

After we have a clean repository, we can restore the backup from the earlier step. See: Repository+Administration

The restore daemon will start off by deleting all of the data in Synapse. It will then restore all entities including the SchemaEntites. One of the main tasks of the restore Daemon is to migrate data to the current version during the restoration process. This means we need to detect that a new property was added to the Example.json schema, and ensure that migrated Example entities have this new field with the default value.

Once all data has been migrated to the current schema the old EntitySchema entities can be replaced using the new JSON schemas from the lib-auto-generated.jar