Introduction

Annotating files with metadata is a critical part of the data curation process. The metadata identifies key aspects of the file such its type, origination, who and what the data is about. A properly annotated files can be categorized for discovery and navigation. The file metadata also serves as guide posts for data consumers such as data processing pipelines and scientific analysis.

The Synapse platform has had APIs for managing file metadata in the form of value-key-pair annotations since its first release. With the addition of the Synapse File Views, data managers can setup tabular views of their annotated files. These file views serve as the core of the faceted file navigation in most of the Synapse portals and UIs. Data manager can set the schemas of the their views by defining columns with names, types and identifying navigation critical columns as facets. When Synapse builds a view, it will attempt to match the annotation value-key-pairs on the files to the defined columns in the views. As long as the files are annotated as expected, the view can be built. However, Synapse does not have a system that allows data managers to control how files are annotated. They can set a schema on a view but they cannot set the schema of the file metadata. (This is a key limitation for a Data Coordination Center, DCC, using Synapse as the data contribution platform as a DCC cannot today constrain data contributors to correctly annotate files they upload.) Currently, a data manager must develop external processes for any type of file metadata control. The purpose of this project is to bridge this gap and provide data managers with API services for controlling the file metadata on their projects.

Use cases

The metadata working group created the following document that outlines and priorities the main use cases: https://docs.google.com/spreadsheets/d/1UgY3JR6WlHVgHkGTzs1wUpV5E7QFuOpwC3ph1e8-zl4/edit#gid=1371550427

User Types

Note: Each user types represent a broad category of Metadata Platform Personas:

User Type

Possible Personas

Project Designers

DCC Data Curator / Community Manager, DCC Project Owner, DCC Governance, Data Portal Owner

Metadata Contributors

Center Data Liaison, Center Principal Investigator, Wet Lab Scientist, Clinical Research Coordinator, Data Processor (Bioinformatician), DCC Working Group (Participant)

Metadata Consumers

Data Engineer, Data Consumer - Programmatic, Data Consumer - Non-Programmatic, Funder. Note: All personas are metadata consumes.

Phases

While the uses cases cover a board range of desired features and functionality, all of the uses cases depend on a common foundation:

All other features/functionality from the uses cases will be extensions/additions to these foundational elements. Note: Each subsequent foundational element depends on the previous element. Therefore, the plan is to implement the three foundational element in phases, starting with phase one: defining and binding schemas. Metadata management/validation will be implemented in phase two, while navigation/view query will be implemented in phase three.

Phase One

What is a schema in this context?

The basic building block of a schema is a single field. The field’s definition includes a name and a data type. For example, the schema of a person’s birthday might have a name=”birthday” and a type of “date”. These two pieces of information could be enough to construct a simple form to ask a user to provide their birthday. These simple fields are analogues to ColumnModel used to define the schema of Synapse tables/views. The simplest representation of a schema would then be a flat list of these fields, again, like the schemas of Synapse tables/views.

However, a flat list of fields is not a great fit for real world metadata. We will use an example to illustrate the subtle complexity inherent in real world metadata.

In this example a project designer would like to define a schema for annotating photographs of people’s pets. The following diagram shows the basic relationship of our schema elements:

In this example, any pet is expected to have a field for name, birthday, and a type. A photo of a cat or dog would be expected to have have all of the fields common to any pet (name, birthday, type) but would also have a field for breed (note breed options differ between cats and dogs). You could also imagine additional fields specific to cats or dogs. There might even be sub-types based on breed each with their own set of additional fields.

If we do not allow project designers to model class-like relationships in their schema and instead only support flat schemas we create several problems:

Finally, even though types need to be defined and maintained as class-like relationships, the types can still be presented to the contributors and consumers as a flat list. For example, a contributor providing metadata for a cat photo would see the flat list of: name, birthday, type, and breed (cat breed). The contributor does not need to understand that Cat extends Pet in order to provide a flat list of values.

Which schema specification should we adopt?

We selected the JSON schema specification for two main reasons:

Note: We adopted the older draft-07 instead of the latest (draft 2019-09), as most implementations support the older version.

Name spacing

When defining types it is often desirable to give a type a short human readable name. The name ‘breed’ from above is an example is a nice, short name. However, in this example we have two different definitions for breed, one for cat and another for dogs. It is also possible that another user will want to use the same short name. Most programming languages solve this type of problem with a name space scheme. The name spacing adopted for Synapse model objects is composed of two parts:

{organization_name}-{schema_name}

For example, here is the full name of a Synapse FileEntity:

org.sagebionetworks-repo.model.FileEntity

Note: The schema name can include additional path information (separated by dots not slashes), and not just the simple name of the file.

Since each full name is prefixed by a globally unique organization name (within Synapse), the full name is also globally unique and is the key used to reference other types. Note: Each type will be visible to everyone. We will cover create/update type permissions in the next section. Type reference will be covered in more detail in a later section.

Organizations

The first step in creating new type extensions in Synapse will be setup and configure an organization. The name of the organization will server as the root for each type managed by that organization. The organization name ‘org.sagebionetworks’ is reserved for the core Synapse model objects. Each organization will also have an Access Control List (ACL) that will control who can add type definitions to an organization. All types created under an Organization will be considered publicly readable and reference-able.

Organization REST APIs

Most of the Organization related APIs have now been implemented, see: JSON Schema Services.

Type Definitions

We will continue to use the pet photo example to illustrate how to define and extend types in Synapse. In this example we assume we have already created our Organization object (see above), with an organization name of ‘my.organization’.

In our photo example, we have three types; Pet, Cat, and Dog. Since both Cat, and Dog, extend the definition of Pet, we will start with the defining the Pet type:

{
	"$schema": "http://json-schema.org/draft-07/schema",
	"$id": "my.organization-pets.Pet-1.0.3",
	"description": "Base type information shared by all pet types.",
	"allOf": [
		{
			"$ref": "org.sagebionetworks-repo.model.FileEntity-1.0.0"
		}
	],
	"properties": {
		"petName": {
			"type": "string",
			"description": "The name of the pet shown in the photo."
		},
		"birthday": {
			"description": "The birthday of the pet shown in the photo.",
			"type": "string",
			"format": "date-time"
		},
		"petType": {
			"$ref": "my.organization-pets.PetType-1.0.1"
		}
	}
}

Pet.json

The first attribute of Pet.json, is ‘$id' which is the unique identifier for our pet schema definition. We will need the ‘$id’ value anytime we want to reference our pet schema. According to the specification, the ‘$id’ must be a valid URN. The value of 'my.organization-pets.Pet-1.0.3' is an shortcut for the full URL using the following pattern:

https://repo-prod.prod.sagebase.org/repo/v1/schema/type/registered/{organizationName}-{schemaName}-{semanticVersion}

Following this pattern the full URL then becomes:

https://repo-prod.prod.sagebase.org/repo/v1/schema/type/registered/my.organization-pets.Pet-1.0.3

The full psudo-BNF for the $id can be found here. We will use this $id value to reference this type when we extend it to define both the Cat and Dog types shortly. Note: The $id ends with the semantic version ‘1.0.3’, which we will cover in more detail in the version section.

The next attribute, ‘title’, is what will be displayed to metadata providers when they work with this type. The title is not part of the identifier and does not need to match any part of the identifier. Unlike the $id, the title can contain any characters including spaces.

The ‘description’ attribute is used to provide helpful information to metadata providers and metadata consumers.

The next section is an array attribute named ‘allOf’ that we are using to define the one or more types schemas that we wish to extend. In this example, the Pet type extends a Synapse FileEntity identified by the:

"$ref": "org.sagebionetworks-repo.model.FileEntity-1.0.0"

Designers will be able to extend types that they define or types defined by others. Initially, all user defined types must either directly or indirectly extend Synapse a FileEntitity, Folder, or Project. In the future, we plan to support the extension of other types of Synapse objects including, but not limited to; Evaluation Submissions, and Access Requirements.

The next section of the Pet.json file is the ‘properties’ attribute. This properties array is where each field of a type is defined. Each property has a name, type, and description. The name of a property determines the final key of the resulting annotation key-value-pair.

The type of the property controls the allowable values for the field. For example, a type of ‘number’ can only contain characters that represent a number such as digits, dot, signs, and exponential expressions.

For all all of the supported types see the API docs: Types.

In the Pet.json file the property ‘petType’ is an example of a reference to another type. The following is the full definition of the PetType enumeration:

{
	"$schema": "http://json-schema.org/draft-07/schema",
	"title": "Pet Type",
	"$id": "my.organization-pets.PetType-1.0.1",
	"type": "string",
	"description": "Identifies the type of pet shown in photo.",
	"enum": [
		"cat",
		"dog",
		"fish"
	]
}

PetType.json

Specifically, the ‘petType’ is an enumeration, that limits values to those listed in its definition, which includes both ‘cat’ and ‘dog’. Since the ‘petType’ is a type definition it also has an ‘$id’, and can also be referenced in other types.

Now that we have defined the abstract Pet type we can start to define the concrete types of Cat and Dog. A key part of both Cat and Dog is the ‘breed’ enumeration, each with their own distinct value. We will define both of these separately.

{
	"$schema": "http://json-schema.org/draft-07/schema",
	"title": "Cat Breed",
	"$id": "my.organization-pets.cat.Breed",
	"description": "Enumeration of possible cat breeds.",
	"type": "string",
	"enum": [
		"Siamese",
		"Persian",
		"Maine Coon",
		"Ragdoll",
		"American Shorthair"
	]
}

CatBreed.json

{
	"$schema": "http://json-schema.org/draft-07/schema",
	"title": "Dog Breed",
	"$id": "my.organization-pets.dog.Breed",
	"description": "Enumeration of possible dog breeds.",
	"type": "string",
	"enum": [
		"Labrador Retriever",
		"German Shepherd",
		"Golden Retriever",
		"Bulldog",
		"Beagle"
	]
}

DogBreed.json

In each case both enumeration have the same name: ‘Breed’ but different paths values, so the IDs of each enumeration is still unique. With these enumerations types defined we can define the both the Cat and Dog types:

{
	"$schema": "http://json-schema.org/draft-07/schema",
	"title": "Cat",
	"$id": "my.organization-pets.cat.Cat",
	"description": "...",
	"allOf": [
		{
			"$ref": "my.organization-pets.Pet"
		}
	],
	"properties": {
		"breed": {
			"$ref": "my.organization-pets.cat.Breed"
		},
		"petType": {
			"const": "cat"
		}
	}
}

Cat.json

{
	"$schema": "http://json-schema.org/draft-07/schema",
	"title": "Dog",
	"$id": "my.organization-pets.dog.Dog",
	"description": "...",
	"allOf": [
		{
			"$ref": "my.organization-pets.Pet-1.0.3"
		}
	],
	"properties": {
		"breed": {
			"$ref": "my.organization-pets.dog.Breed"
		},
		"petType": {
			"const": "dog"
		}
	}
}

Dog.json

Both the definitions for Cat and Dog implement our previously defined Pet type ($ref : my.organization.pets.Pet). This means both Cat and Dog will include all of the attributes defined in Pet. Note: The Dog.json implementation reference includes a semantic version (1.0.3) , while the Cat.json does not. We will discuss versions in a later section.

Note: Both Cat.json and Dog.json override the “petyType” attribute (from Pet.json) as a constant to lock down the type that is appropriate. This means a valid cat must have a “petType”:”cat” and a valid dog must have a “petType”:”dog”.

Finally, both Cat and Dog include an attribute named ‘breed’ but which each referencing the enumeration appropriate to their type.

Now that we have defined our supported pet schemas (cat and dog), we are ready to put all of the parts together in a single schema definition:

{
	"$schema": "http://json-schema.org/draft-07/schema",
	"title": "Pet Photo",
	"$id": "my.organization-pets.PetPhoto",
	"description": "...",
	"oneOf": [
		{
			"$ref": "my.organization-pets.cat.Cat"
		},
		{
			"$ref": "my.organization-pets.dog.Dog"
		}
	]
}

PetPhoto.json

The first new term in the PetPhoto.json is the ‘oneOf’ array. This array definition states that in order for a pet photo to be valid it must be valid against one of the listed sub-schemas.

In summary, the PetPhoto.json schema encapsulates everything we have defined thus far and even outlines which sub-schema should be used under various conditions.

Alternative: Conditional Logic

One alternative to the base types (Pet.json) and extensions types (Cat.json & Dog.json) is the use of the JSON schema conditional logic: if/then/else. The following example semantically equivalent to the combination of Pet.json, Cat.json, Dog.json, and PetPhoto.json all in a single JSON schema:

{
	"$schema": "http://json-schema.org/draft-07/schema",
	"$id": "my.organization-pets.ConditionalAlternative",
	"description": "Conditional Alternative to base types with extensions.",
	"properties": {
		"petName": {
			"type": "string",
			"description": "The name of the pet shown in the photo."
		},
		"birthday": {
			"description": "The birthday of the pet shown in the photo.",
			"type": "string",
			"format": "date-time"
		},
		"petType": {
			"enum": [
				"cat",
				"dog",
				"fish"
			]
		}
	},
	"allOf": [
		{
			"$ref": "org.sagebionetworks-repo.model.FileEntity-1.0.0"
		},
		{
			"if": {
				"properties": {
					"petType": {
						"const": "cat"
					}
				}
			},
			"then": {
				"properties": {
					"breed": {
						"description": "Enumeration of possible cat breeds.",
						"type": "string",
						"enum": [
							"Siamese",
							"Persian",
							"Maine Coon",
							"Ragdoll",
							"American Shorthair"
						]
					}
				}
			}
		},
		{
			"if": {
				"properties": {
					"petType": {
						"const": "dog"
					}
				}
			},
			"then": {
				"properties": {
					"breed": {
						"description": "Enumeration of possible dog breeds.",
						"type": "string",
						"enum": [
							"Labrador Retriever",
							"German Shepherd",
							"Golden Retriever",
							"Bulldog",
							"Beagle"
						]
					}
				}
			}
		}
	]
}

ConditionalAlternative.json

Schema Versions

A new JSON schema is created by calling: POST /schema/type/create/async/start. The same call is also used to make changes to an existing JSON schema. However, the exact impact of a JSON schema change depends on the inclusion or exclusion of the optional semantic version suffix in the schema $id.

When a semantic version is included in a schema $id, its its value serves as a human readable reference to a specific schema change. The semantic version values must follow the rules of Semantic Versioning, with a major, minor, and patch number. If a semantic version is included in a JSON schema $id, its value must be unique within the schema $id space.

We will use the examples from above to illustrate exactly what it means when a semantic version is included in a schema $id.

We include a semantic version suffix in the $id of the Pet.json (1.0.3) but we did not include it in the other types. By including a semantic version in the Pet.json we made it possible for other schemas to reference that specific version. For example, the Dog $ref to the Pet schema includes the 1.0.3 semantic version while the Cat $ref to Pet excludes the semantic version. This means the Dog type is locked to an explicit version (1.0.3) of the Pet type, while the Cat type will always reference the latest version of the Pet schema. For example, if we were to add a new property to the Pet schema and we were to bumped its semantic version to 1.0.4, the Cat type would include the new property but the Dog type would not (Dog is locked to Pet version 1.03). If we wanted to include the new Pet attribute in Dog we would need to update the Dog type definition to either reference Pet 1.0.4 or remove the semantic version from the reference.

If a JSON schema is created with a semantic version in its $id, that version is immutable. Any attempt to update that version of a schema will be rejected. However, it is possible to delete a specific semantic version of a schema using DELETE /schema/type/registered/{organizationName}-{schemaName}-{semanticVersion} if there are no references to that version.

Creating a schema without a semantic version means only the latest version is maintained. Each update to a schema that excludes a semantic version will simply replace any existing non-versioned copy of that schema. If you wish to maintain the full history of your schemas you will need to provide semantic version in the $id of the schema for all updates.

With the completion of phase one, the basic create, read, update, delete (CRUD) APIs have been implemented. See: JSON Schema Services for more details.

Schema to Entity Binding

Once a JSON schema has been created, it can be bound to a Project, Folder or File Entity. The bound schema will be used to validate the Entity's metadata (annotations). Any child Entity will inherit its parent’s bound schema, unless a schema is explicitly bound to that child.

A simple algorithm is used to determine which, if any, schema is bound to an Entity:

For example, if you bind a schema to a Project, and there are no other schemas bound to any Entity in that project, then all Folders and Files in that Project will inherit the schema bound to the Project. The project level binding can be overridden by binding a additional schema to a Folder, then the Folder and all of children (Files and Folders) will inherit the schema bound the Folder instead of the schema bound the parent Project.

Only a single schema can be bound to an Entity at a time. If you have more than one schema to bind to an Entity you will need to create and bind a single composition schema using keywords such as ‘anyOf', 'allOf' or 'oneOf' that defines how the schemas should be used for validation. The PetPhoto.json from above is an example of composition schema that could be bound to an Entity. The PetPhoto.json schema states that valid a Entity must match either Cat or Dog (from the ‘oneOf’ designation). In the PetPhoto.json example, both Cat and Dog transitively extend FileEntity (via Pet) and would therefore apply to FileEntities. To declare a schema for Folders, we would create a schema that extends Folder, and add that schema to the 'oneOf’ definition.

There are three APIs for binding schemas to Entities:

Phase Two

The goal of phase two is to support the validation on Entities based on bound JSON schemas.

Basic validation involves two parts, the JSON schema that defines how an object should be validated, and the actual JSON of the object to validated. A JSON schema validation library will take both parts as input parameters and will produce validation information. When the JSON object is found to be invalid, typically additional information will be provided to help the users understand why it is invalid.

Continuing with the Pets example above, lets assume that PetPhoto.json schema has been bound to a sample project. Since there are no other bound schemas in the sample project, the PetPhoto schema will apply to all files and folders in the project.

For example, a photo of the cat uploaded to this example project might have the following metadata represented as JSON object:

{
	"name": "Charity-jumping.png",
	"description": "A cat with things to do! She is too busy to stand still for more than a few seconds.",
	"id": "syn123",
	"etag": "some-etag",
	"createdOn": "2020-05-20T20:20:39+00:00",
	"modifiedOn": "2020-05-20T20:20:39+00:00",
	"createdBy": "123456789",
	"modifiedBy": "123456789",
	"parentId": "syn444",
	"versionLabel": "one",
	"versionComment": "leaving blank",
	"versionNumber": 1,
	"dataFileHandleId": "98765",
	"fileNameOverride": "",
	"concreteType": "org.sagebionetworks.repo.model.FileEntity",
	"petName": "Charity",
	"birthday": "2016-09-10T20:20:39+00:00",
	"petType": "cat",
	"breed": "American Shorthair"
}

Charity.json

Validation Schema

Our goal is to determine if this file metadata for Charity.json is valid according to the PetPhoto.json schema, which states that it must match either the Cat or Dog schemas. The validation code must consider each schema and all of its direct and indirect dependencies. Before Synapse can validate a schema against an object it must first compile a ‘validation’ schema. The process of compiling a validation schema involves de-referencing all references ($ref) such that each reference resolves locally. Specifically, a copy of each referenced schema is added to the ‘definitions’ section and all $refs are changed to point to the local appropriate local copy. The resulting ‘validation’ schema is completely self-contained, making it easy for any 3rd party validation library to consume. You can fetch the compiled ‘validation’ schema for any JSON schema registered with Synapse using the POST /schema/type/validation/async/start asynchronous API. The following is an example of the complied ‘validation’ schema for the PetPhoto schema.

{
     "$schema": "http://json-schema.org/draft-07/schema",
     "$id": "my.organization-pets.PetPhoto",
     "title": "Pet Photo",
     "description": "...",
     "oneOf": [
          {"$ref": "#/definitions/my.organization-pets.cat.Cat"},
          {"$ref": "#/definitions/my.organization-pets.dog.Dog"}
     ],
     "definitions": {
          "my.organization-pets.cat.Breed": {
               "$schema": "http://json-schema.org/draft-07/schema",
               "$id": "my.organization-pets.cat.Breed",
               "type": "string",
               "title": "Cat Breed",
               "description": "Enumeration of possible cat breeds.",
               "enum": [
                    "Siamese",
                    "Persian",
                    "Maine Coon",
                    "Ragdoll",
                    "American Shorthair"
               ]
          },
          "my.organization-pets.PetType": {
               "$schema": "http://json-schema.org/draft-07/schema",
               "$id": "my.organization-pets.PetType",
               "type": "string",
               "title": "Pet Type",
               "description": "Identifies the type of pet shown in photo.",
               "enum": [
                    "cat",
                    "dog",
                    "fish"
               ]
          },
          "org.sagebionetworks-repo.model.Entity": {
               "$schema": "http://json-schema.org/draft-07/schema#",
               "$id": "org.sagebionetworks-repo.model.Entity-1.0.0",
               "type": "object",
               "properties": {
                    "name": {
                         "type": "string",
                         "title": "Name",
                         "description": "The name of this entity.  Must be 256 characters or less."
                    },
                    "description": {
                         "type": "string",
                         "title": "Description",
                         "description": "The description of this entity.  Must be 1000 characters or less."
                    },
                    "id": {
                         "type": "string",
                         "description": "The unique immutable ID for this entity.  A new ID will be generated for new Entities.  Once issued, this ID is guaranteed to never change or be re-issued"
                    },
                    "etag": {
                         "type": "string",
                         "description": "Synapse employs an Optimistic Concurrency Control (OCC) scheme to handle concurrent updates. Since the E-Tag changes every time an entity is updated it is used to detect when a client's current representation of an entity is out-of-date."
                    },
                    "createdOn": {
                         "type": "string",
                         "title": "Created On",
                         "description": "The date this entity was created.",
                         "format": "date-time"
                    },
                    "modifiedOn": {
                         "type": "string",
                         "title": "Modified On",
                         "description": "The date this entity was last modified.",
                         "format": "date-time"
                    },
                    "createdBy": {
                         "type": "string",
                         "title": "Created By",
                         "description": "The ID of the user that created this entity."
                    },
                    "modifiedBy": {
                         "type": "string",
                         "title": "Modified By",
                         "description": "The ID of the user that last modified this entity."
                    },
                    "parentId": {
                         "type": "string",
                         "description": "The ID of the Entity that is the parent of this Entity."
                    },
                    "concreteType": {
                         "type": "string",
                         "description": "Indicates which implementation of Entity this object represents. It should be set to one of the following: org.sagebionetworks.repo.model.Project, org.sagebionetworks.repo.model.Folder, or org.sagebionetworks.repo.model.FileEntity."
                    }
               },
               "description": "This is the base interface that all Entities implement."
          },
          "org.sagebionetworks-repo.model.Versionable": {
               "$schema": "http://json-schema.org/draft-07/schema#",
               "$id": "org.sagebionetworks-repo.model.Versionable-1.0.0",
               "type": "object",
               "properties": {"versionNumber": {
                    "type": "number",
                    "description": "The version number issued to this version on the object."
               }},
               "description": "JSON schema for Versionable interface"
          },
          "org.sagebionetworks-repo.model.VersionableEntity": {
               "$schema": "http://json-schema.org/draft-07/schema#",
               "$id": "org.sagebionetworks-repo.model.VersionableEntity-1.0.0",
               "type": "object",
               "properties": {
                    "versionLabel": {
                         "type": "string",
                         "title": "Version",
                         "description": "The version label for this entity"
                    },
                    "versionComment": {
                         "type": "string",
                         "title": "Version Comment",
                         "description": "The version comment for this entity"
                    }
               },
               "description": "JSON schema for Versionable interface",
               "allOf": [
                    {
                         "$ref": "#/definitions/org.sagebionetworks-repo.model.Entity",
                    },
                    {
                         "$ref": "#/definitions/org.sagebionetworks-repo.model.Versionable",
                    }
               ]
          },
          "org.sagebionetworks-repo.model.FileEntity": {
               "$schema": "http://json-schema.org/draft-07/schema#",
               "$id": "org.sagebionetworks-repo.model.FileEntity-1.0.0",
               "properties": {
                    "dataFileHandleId": {
                         "type": "string",
                         "description": "ID of the file associated with this entity."
                    },
                    "fileNameOverride": {
                         "type": "string",
                         "description": "An optional replacement for the name of the uploaded file.  This is distinct from the entity name.  If omitted the file will retain its original name."
                    },
                    "concreteType": {
                         "type": "string",
                         "const": "org.sagebionetworks.repo.model.FileEntity"
                    }
               },
               "title": "File",
               "description": "JSON schema for File POJO",
               "allOf": [{
                    "$ref": "#/definitions/org.sagebionetworks-repo.model.VersionableEntity",
                    "properties": {"concreteType": {"type": "string"}}
               }]
          },
          "my.organization-pets.Pet": {
               "$schema": "http://json-schema.org/draft-07/schema",
               "$id": "my.organization-pets.Pet-1.0.3",
               "properties": {
                    "petName": {
                         "type": "string",
                         "description": "The name of the pet shown in the photo."
                    },
                    "birthday": {
                         "type": "string",
                         "description": "The birthday of the pet shown in the photo.",
                         "format": "date-time"
                    },
                    "petType": {"$ref": "#/definitions/my.organization-pets.PetType"}
               },
               "description": "Base type information shared by all pet types.",
               "allOf": [{"$ref": "#/definitions/org.sagebionetworks-repo.model.FileEntity"}]
          },
          "my.organization-pets.cat.Cat": {
               "$schema": "http://json-schema.org/draft-07/schema",
               "$id": "my.organization-pets.cat.Cat",
               "properties": {
                    "breed": {"$ref": "#/definitions/my.organization-pets.cat.Breed"},
                    "petType": {"const": "cat"}
               },
               "title": "Cat",
               "description": "...",
               "allOf": [{"$ref": "#/definitions/my.organization-pets.Pet"}]
          },
          "my.organization-pets.dog.Breed": {
               "$schema": "http://json-schema.org/draft-07/schema",
               "$id": "my.organization-pets.dog.Breed",
               "type": "string",
               "title": "Dog Breed",
               "description": "Enumeration of possible dog breeds.",
               "enum": [
                    "Labrador Retriever",
                    "German Shepherd",
                    "Golden Retriever",
                    "Bulldog",
                    "Beagle"
               ]
          },
          "my.organization-pets.Pet-1.0.3": {
               "$schema": "http://json-schema.org/draft-07/schema",
               "$id": "my.organization-pets.Pet-1.0.3",
               "properties": {
                    "petName": {
                         "type": "string",
                         "description": "The name of the pet shown in the photo."
                    },
                    "birthday": {
                         "type": "string",
                         "description": "The birthday of the pet shown in the photo.",
                         "format": "date-time"
                    },
                    "petType": {"$ref": "#/definitions/my.organization-pets.PetType"}
               },
               "description": "Base type information shared by all pet types.",
               "allOf": [{"$ref": "#/definitions/org.sagebionetworks-repo.model.FileEntity"}]
          },
          "my.organization-pets.dog.Dog": {
               "$schema": "http://json-schema.org/draft-07/schema",
               "$id": "my.organization-pets.dog.Dog",
               "properties": {
                    "breed": {"$ref": "#/definitions/my.organization-pets.dog.Breed"},
                    "petType": {"const": "dog"}
               },
               "title": "Dog",
               "description": "...",
               "allOf": [{"$ref": "#/definitions/my.organization-pets.Pet-1.0.3"}]
          }
     }
}

PetPhoto.json-Validation Schema

When we provide both the Charity.json and above validation schema to a JSON schema validation library we find that Charity.json is valid.

If we change Charity.json such that “petType” is set to “dog”, and run the same validation check we find that it is invalid, because the breed “American Shorthair” is only valid for “petType”:”cat”. If Charity, where a dog we would need to select a valid dog breed.

Validation Schema APIs

Response

URL

Request

Description

JSONObject

GET /entity/<id>/json

Get the JSON object representing the given entity. For example, see Charity.json from above.

JSONObject

POST /entity/<id>/json

JSONObject

Update an Entity providing a JSONObject similar to the Charity.json from above.

JobToken

POST /schema/type/validation/async/start

$id of the JSON schema

Start a job to create the self-contained validation schema for a given JSON Schema. For example, the PetPhoto.json-Validation Schema was generated for PetPhoto.json.

JsonSchema

GET /schema/type/validation/async/get/{asyncToken}

JobToken

Get the results of the asynchronous job to create a validation schema.

Automatic Eventually Consistent Entity Validation

For any Entity with a bound JSON Schema, Synapse will automatically validate the Entity against the bound schema and record the results. For example, if a new JSON schema is bound to a Project with no other schema binding, Synapse will automatically start an asynchronous process to validate every Entity within the Project against the bound schema. In addition, any Entity added or update to the Project will also be automatically validated.

The validation results will be eventually consistent after any schema or Entity change. For example, a Project that has a schema bound at the project level only, it will take time to re-validate all Entities in the Project when there is a schema change. So validation results of each Entity in the Project will be stale until the re-validation process completes.

Validation Results APIs

Response

URL

Request

Description

ValidationResults

GET /entity/<id>/schema/validation

Get the ValidationResutls for the provided Entity.

ValidationSummaryStatitics

GET /entity/<id>/schema/validation/statistics

Get the ValidationSummaryStatitics for an Entity Container such as a Project of Folder. This method provides statistics about the validation status of the containers children.

PaginatedListOfIds

GET /entity/<id>/schema/invalid/children

Get a single page of Entity IDs that are invalid according to their schema for the given Entity container.

ValidationResults

Type

Name

Description

String

objectId

The identifier of the object that was validated.

ObjectType

objectType

The type of object that was validated.

String

etag

The Etag of the object at the time validation was performed. If the etag does not match the current etag then the validation is out-of-date.

Date

validatedOn

The date-time when this object was validated.

Boolean

isValid

Is this object valid according to its schema?

String

validationErrorMessage

If the object is not valid according to the schema, a simple one line error message will be provided.

List<String>

validationErrorMessageList

If the object is not valid according to the schema, a the flat list of error messages will be provided with one error message per sub-schema.

ValidationException

validationException

If the object is not valid according to the schema, a recursive ValidationException will be provided that describes all violations in the sub-schema tree.

ValidationException

Type

Name

Description

String

keyword

The JSON schema keyword which was violated.

String

pointerToViolation

A JSON Pointer denoting the path from the input document root to its fragment which caused the validation failure.

String

message

The description of the validation failure.

String

schemaLocation

A JSON Pointer denoting the path from the schema JSON root to the violated keyword.

ValidationException

causingExceptions

An array of sub-exceptions.

With the example from above where we changed Charity.json such that ‘petType’ equals ‘dog’ the following ValidationResult will be produced:

{
     "objectId": "syn123",
     "objectType": "entity",
     "objectEtag": "some-etag",
     "isValid": false,
     "validatedOn": "2020-06-30T18:56:03.959-07:00",
     "validationErrorMessage": "#: 0 subschemas matched instead of one",
     "allValidationMessages": [
          "#/petType: ",
          "#/breed: American Shorthair is not a valid enum value"
     ],
     "validationException": {
          "keyword": "oneOf",
          "pointerToViolation": "#",
          "message": "#: 0 subschemas matched instead of one",
          "schemaLocation": "#",
          "causingExceptions": [
               {
                    "keyword": "allOf",
                    "pointerToViolation": "#",
                    "message": "#: only 1 subschema matches out of 2",
                    "schemaLocation": "#/definitions/my.organization-pets.cat.Cat",
                    "causingExceptions": [{
                         "keyword": "const",
                         "pointerToViolation": "#/petType",
                         "message": "",
                         "schemaLocation": "#/definitions/my.organization-pets.cat.Cat/properties/petType",
                         "causingExceptions": []
                    }]
               },
               {
                    "keyword": "allOf",
                    "pointerToViolation": "#",
                    "message": "#: only 1 subschema matches out of 2",
                    "schemaLocation": "#/definitions/my.organization-pets.dog.Dog",
                    "causingExceptions": [{
                         "keyword": "allOf",
                         "pointerToViolation": "#/breed",
                         "message": "#: only 1 subschema matches out of 2",
                         "schemaLocation": "#/definitions/my.organization-pets.dog.Breed",
                         "causingExceptions": [{
                              "keyword": "enum",
                              "pointerToViolation": "#/breed",
                              "message": "American Shorthair is not a valid enum value",
                              "causingExceptions": []
                         }]
                    }]
               }
          ]
     }
}

Example ValidationResults

ValidationSummaryStatitics

Type

Name

Description

String

containerId

The syn ID of the Container

Date

updatedOn

The date-time when the statistics were last updated.

int

totalNumberOfChildren

The total number of children Entities contained in this container.

int

numberOfValidChildren

The number of children that are valid according to their schema

int

numberOfInvalidChildren

The number of children in this container that are invalid according to their schema.

Automatic Validation Conclusion

We finished the implementation of the automatic validation of Entities in August of 2020. The following set of APIs are available for fetching the automatic validation results and statistics for Entities:

Phase Three

Use Cases

Goal

The goal for phase three is to provide services for defining Synapse Views using JSON schemas. We also want to make some of the schema information available for both display and filtering in Views. This could include information about the schema bound to the Entity, the type of the file, and the validation state of the Entity. For example, is the Entity valid according its bound schema? Users will likely also want to filter by these additional fields.

As a stretch goals, we want to leverage the schema relationships in faceted view navigation. For example, it would be useful to filter all Entities in a project that extend our Pet.json schema, to discover which new pet types have been added. This could also be useful for discovering similar data other projects. This might require new UI for navigating the tree-like relationships that can be built with JSON schemas.

View Building Blocks

The basic building blocks of any Synapse View is its scope, type of Entities to include, and its schema. These building blocks determine which columns and rows will be included in the view.

How do we want to define the columns and rows of a view driven by a JSON schema? It seems fairly obvious that we would want to use the JSON schema to drive a View’s columns, but how should we use the schema to define what rows should be include in the view? There are a few details we must consider before we can answer this question. In the next section we will setup an example that builds on all of the main JSON schema features we added in the first two phases.

All Pets Example

The following is a basic example that we will use to throughout the rest of the document. See the following screen shot:

The above screen shot shows a folder named “All Pets” that contains four files: Alpha, Bravo, Charlie, and Delta. Each file represents a pet photo. We have bound the PetPhoto.json schema (defined above) to the “All Pets” folder. Recall that the Pets.json schema states that each entity must be a Synapse FileEntity and that the metadata annotations must be valid against either the Cat.json or Dog.json schema.

Example Views of “All Pets”

The following table represents a possible view of the “All Pets” example folder using the Pet.json schema as a driver:

name

id

parentId

petName

birthday

petType

Alpha.png

syn22691957

syn22691893

Alpha

11/16/2013

cat

Bravo.png

syn22691978

syn22691893

Bravo

12/03/2014

dog

Charlie.png

syn22691991

syn22691893

Charlie

05/13/2018

cat

Delta.png

syn22692010

syn22691893

Delta

05/18/2016

dog

View of “All Pets” using the Pet.json schema as a driver.

Some of the columns from FileEntity are excluded in the above example for brevity including: etag, createdOn, createdBy, modifiedOn, modifiedBy, versionLabel, versionComment, versionNumber, fileHandleId, and concreteType. In addition to the columns from FileEntity, the three columns defined in Pet.json are included: petName, birthday, and petType. It is important to note that the view does not include any columns that are specific to the Cat.json or Dog.json schema, both of which extend the Pet.json schema.

The following two example of a views of the “All Pets” folder, one using the Cat.json as a driver and the other using the Dog.json as a driver:

name

id

parentId

petName

birthday

petType

Cat Breed

Alpha.png

syn22691957

syn22691893

Alpha

11/16/2013

cat

Maine Coon

Charlie.png

syn22691991

syn22691893

Charlie

05/13/2018

cat

American Shorthair

View of “All Pets” using the Cat.json schema as a driver

name

id

parentId

petName

birthday

petType

Dog Breed

Bravo.png

syn22691978

syn22691893

Bravo

12/03/2014

dog

Golden Retriever

Delta.png

syn22692010

syn22691893

Delta

05/18/2016

dog

Beagle

View of “All Pets” using the Dog.json schema as a driver

The Cat.json driven view includes all columns defined in the: org.sagebionetworks-repo.model.FileEntity, my.organization-pets.Pet-1.0.3, and my.organization-pets.cat.Cat schemas. In addition, it only includes rows for files that represent cats (Alpha & Charlie). The Dog.json driven view includes all columns defined in the: org.sagebionetworks-repo.model.FileEntity, my.organization-pets.Pet-1.0.3, and my.organization-pets.cat.Dog schemas. In addition, it only includes rows for files that represent dogs (Bravo & Delta).

The final type of view we might want to create would be driven by the PetPhoto.json schema:

name

id

parentId

petName

birthday

petType

Cat Breed

Dog Breed

Alpha.png

syn22691957

syn22691893

Alpha

11/16/2013

cat

Maine Coon

Bravo.png

syn22691978

syn22691893

Bravo

12/03/2014

dog

Golden Retriever

Charlie.png

syn22691991

syn22691893

Charlie

05/13/2018

cat

American Shorthair

Delta.png

syn22692010

syn22691893

Delta

05/18/2016

dog

Beagle

View of “All Pets” driven by the PetPhoto.json schema

The PetPhoto.json driven schema is somewhat a “sparse matrix” where some of the columns have no meaning for some of the rows. Do we want to support “sparse matrix” type views?

Here is an example of additional columns we might want to support for views driven by JSON schemas:

petName

birthday

petType

Validated Against Schema

isValid

Validation Error Message

Matches Schema

Alpha

11/16/2013

cat

my.organization-pets.PetPhoto

true

my.organization-pets.cat.Cat

Bravo

12/03/2014

guppy

my.organization-pets.PetPhoto

false

“guppy is not a valid petType, must be one of: [cat, dog, fish]

????

Charlie

05/13/2018

cat

my.organization-pets.PetPhoto

true

my.organization-pets.cat.Cat

Delta

05/18/2016

dog

my.organization-pets.PetPhoto

true

my.organization-pets.cat.Dog

Additional schema related columns that can be useful for navigation

Missing a Step?

In the “All Pets” folder example above, what can we concluded from the automatic JSON schema validation that Synapse will perform for that folder? If all four files are marked as valid, then we know that each file represents either a Cat or Dog. The validation process does not tell us which schema was matched, only that it matched one or the other. This means we cannot tell if Alpha represents a Cat or a Dog, only that it matches either Cat or Dog. Even when the validation process marks a file as invalid, we only know it does not match Cat and does not match Dog. How do we determine the value of the “Matches Schema” columns for each Entity in the view? Note: Would also need this value to correctly build the Cat.json driven view and the Dog.json driven view, as each view needs to exclude all rows that are not of that type.

Derive the Type from petType constant?

Can we leverage the constant definitions from the Cat.json and Dog.json schema to determine? For example, Cat.json includes:

		"petType": {
			"const": "cat"
		}

And Dog.json includes:

		"petType": {
			"const": "dog"
		}

A JSON schema might include many constant such as petType. How do we know that petType is special? Maybe a type is defined by combining all constants in the schema? How do we determine the type if the user has not yet provided values for the constants?

Add a special schemaType attribute to an Entity?

Should we require a users to explicitly provide a “schemaType” on each Entity within the scope of a bound schema? This would require an extra step on the user’s part but it would remove all ambiguity.

Using a JSON schema to drive the columns of a View

How do we use a JSON schema to drive the columns of a View? The following is a list of possible requirements:

Proposed Services

Service to Generate ColumnModels from a JSON Schema

We propose adding a new REST API service that given the $id of a JSON schema will generate a List of ColumnModels that contain all of the details captured in the JSON schema. Project designers could then use this service to transfer the details of their JSON schemas to their views schemas. The clients could help the project designer apply these new ColumnModels to their views, similar to how the “Add All Annotations” button works in the exiting view editor user interface.

It is important to note, that this does not change the fundamental nature of views in any way. The new service is simply a tool to help transfer details of a JSON schema into a view schema.

If the project designer changes their JSON schema, the changes will not automatically, propagate to their views. Instead, if a project designer wishes to apply JSON schema changes to their existing views, they would need to manually, re-run the service to generate new ColumnModel. In theory, the view editor user interface would help the user manage the ColumnModel delta to apply to their view. This means the project designer will be in full control of what changes propagate to their views.

In addition to the service to generate ColumnModels from JSON schemas, we propose adding a new field to ColumnModel to allow the column to be linked to a JSON schema. Specifically, we propose adding a string field to ColumnModel called “derivedFrom$id”. The value of this field would be the $id of the JSON schema from which the column was derived. The service to generate ColumnModels from JSON schemas would automatically provide a value for this new field based on the $id of the parameter of the service.

REST API Additions

Response

URL

Request

Description

JobToken

POST /schema/type/columnmodel/async/start

$id of the JSON schema

Start a job to create a list of ColumnModel objects that represent the properties of the given JSON schema. One ColumnModel will be returned for each property of the JSON schema.

ColumnModelResponse

GET /schema/type/columnmodel/async/get/{asyncToken}

JobToken

Get the results of the asynchronous job to create create a list of ColumnModels from a JSON schema.

Object Models

ColumnModel

Name

Type

Description

All of the existing fields of ColumnModel

derivedFrom$id

String

The $id of the JSON schema that was used to define this ColumnModel. Note: The name of the ColumnModel will match the name of the property from the JSON schema.

ColumnModelResponse

Name

Type

Description

columnModels

List<ColumnModel>

The resulting list of ColumnModels

$id

String

The $id of the JSON schema used to define the list of ColumnModels