Document toolboxDocument toolbox

Schemas for extending Synapse Objects

Introduction

Annotating files with metadata is a critical part of the data curation process. The metadata identifies key aspects of the file such its type, origination, who and what the data is about. A properly annotated files can be categorized for discovery and navigation. The file metadata also serves as guide posts for data consumers such as data processing pipelines and scientific analysis.

The Synapse platform has had APIs for managing file metadata in the form of value-key-pair annotations since its first release. With the addition of the Synapse File Views, data managers can setup tabular views of their annotated files. These file views serve as the core of the faceted file navigation in most of the Synapse portals and UIs. Data manager can set the schemas of the their views by defining columns with names, types and identifying navigation critical columns as facets. When Synapse builds a view, it will attempt to match the annotation value-key-pairs on the files to the defined columns in the views. As long as the files are annotated as expected, the view can be built. However, Synapse does not have a system that allows data managers to control how files are annotated. They can set a schema on a view but they cannot set the schema of the file metadata. (This is a key limitation for a Data Coordination Center, DCC, using Synapse as the data contribution platform as a DCC cannot today constrain data contributors to correctly annotate files they upload.) Currently, a data manager must develop external processes for any type of file metadata control. The purpose of this project is to bridge this gap and provide data managers with API services for controlling the file metadata on their projects.

Use cases

The metadata working group created the following document that outlines and priorities the main use cases: https://docs.google.com/spreadsheets/d/1UgY3JR6WlHVgHkGTzs1wUpV5E7QFuOpwC3ph1e8-zl4/edit#gid=1371550427

User Types

  • Project Designers - The group of users that will design and maintain the metadata schemas for project/consortium.

  • Metadata Contributors - The group of users that provide the actual metadata values.

  • Metadata Consumers - The group of users that consume metadata through faceted navigation, views or directly from the object annotations.

Note: Each user types represent a broad category of Metadata Platform Personas:

User Type

Possible Personas

User Type

Possible Personas

Project Designers

DCC Data Curator / Community Manager, DCC Project Owner, DCC Governance, Data Portal Owner

Metadata Contributors

Center Data Liaison, Center Principal Investigator, Wet Lab Scientist, Clinical Research Coordinator, Data Processor (Bioinformatician), DCC Working Group (Participant)

Metadata Consumers

Data Engineer, Data Consumer - Programmatic, Data Consumer - Non-Programmatic, Funder. Note: All personas are metadata consumes.

Phases

While the uses cases cover a board range of desired features and functionality, all of the uses cases depend on a common foundation:

  • A set of APIs that project designers will use to both define schema and bind the schemas to project/folder. This phase was completed in June 2020.

  • A life-cycle for metadata contributors to add/update metadata values within a project/folder based the bound schema. This would include validation based on the schema, managing schema change propagation, and finding and fixing files that do not comply with the schema. This phase was completed in August 2020.

  • Faceted navigation and view creation based on bound schemas. This is the current phase.

All other features/functionality from the uses cases will be extensions/additions to these foundational elements. Note: Each subsequent foundational element depends on the previous element. Therefore, the plan is to implement the three foundational element in phases, starting with phase one: defining and binding schemas. Metadata management/validation will be implemented in phase two, while navigation/view query will be implemented in phase three.

Phase One

What is a schema in this context?

The basic building block of a schema is a single field. The field’s definition includes a name and a data type. For example, the schema of a person’s birthday might have a name=”birthday” and a type of “date”. These two pieces of information could be enough to construct a simple form to ask a user to provide their birthday. These simple fields are analogues to ColumnModel used to define the schema of Synapse tables/views. The simplest representation of a schema would then be a flat list of these fields, again, like the schemas of Synapse tables/views.

However, a flat list of fields is not a great fit for real world metadata. We will use an example to illustrate the subtle complexity inherent in real world metadata.

In this example a project designer would like to define a schema for annotating photographs of people’s pets. The following diagram shows the basic relationship of our schema elements:

In this example, any pet is expected to have a field for name, birthday, and a type. A photo of a cat or dog would be expected to have have all of the fields common to any pet (name, birthday, type) but would also have a field for breed (note breed options differ between cats and dogs). You could also imagine additional fields specific to cats or dogs. There might even be sub-types based on breed each with their own set of additional fields.

If we do not allow project designers to model class-like relationships in their schema and instead only support flat schemas we create several problems:

  • Flat lists force the designer to duplicate data. For example, a flat list for both cats and dogs would have duplicate definitions for name, birthday, and type. This quickly become unwieldy as the number of types grows.

  • How does the metadata contributor choose which schema to use for a given situation when presented with a long list of types? In theory, with more information about a type, the easier it will be to guide contributor to the appropriate types. In the above example, it is clear that we must first ask the contributor for a pet type before presenting any further details. [Comment: I imagine that the schema will initially be not Pet but Cat. Later someone will want Dog and we will have to guide them to refactor Cat into Pet+Cat so Dog can extend Pet.]

Finally, even though types need to be defined and maintained as class-like relationships, the types can still be presented to the contributors and consumers as a flat list. For example, a contributor providing metadata for a cat photo would see the flat list of: name, birthday, type, and breed (cat breed). The contributor does not need to understand that Cat extends Pet in order to provide a flat list of values.

Which schema specification should we adopt?

We selected the JSON schema specification for two main reasons:

Note: We adopted the older draft-07 instead of the latest (draft 2019-09), as most implementations support the older version.

Name spacing

When defining types it is often desirable to give a type a short human readable name. The name ‘breed’ from above is an example is a nice, short name. However, in this example we have two different definitions for breed, one for cat and another for dogs. It is also possible that another user will want to use the same short name. Most programming languages solve this type of problem with a name space scheme. The name spacing adopted for Synapse model objects is composed of two parts:

{organization_name}-{schema_name}

For example, here is the full name of a Synapse FileEntity:

org.sagebionetworks-repo.model.FileEntity
  • Organization name : ‘org.sagebionetworks’

  • Schema name : repo.model.FileEntity

Note: The schema name can include additional path information (separated by dots not slashes), and not just the simple name of the file.

Since each full name is prefixed by a globally unique organization name (within Synapse), the full name is also globally unique and is the key used to reference other types. Note: Each type will be visible to everyone. We will cover create/update type permissions in the next section. Type reference will be covered in more detail in a later section.

 

Organizations

The first step in creating new type extensions in Synapse will be setup and configure an organization. The name of the organization will server as the root for each type managed by that organization. The organization name ‘org.sagebionetworks’ is reserved for the core Synapse model objects. Each organization will also have an Access Control List (ACL) that will control who can add type definitions to an organization. All types created under an Organization will be considered publicly readable and reference-able.

Organization REST APIs

Most of the Organization related APIs have now been implemented, see: JSON Schema Services.

Type Definitions

We will continue to use the pet photo example to illustrate how to define and extend types in Synapse. In this example we assume we have already created our Organization object (see above), with an organization name of ‘my.organization’.

In our photo example, we have three types; Pet, Cat, and Dog. Since both Cat, and Dog, extend the definition of Pet, we will start with the defining the Pet type:

{ "$schema": "http://json-schema.org/draft-07/schema", "$id": "my.organization-pets.Pet-1.0.3", "description": "Base type information shared by all pet types.", "allOf": [ { "$ref": "org.sagebionetworks-repo.model.FileEntity-1.0.0" } ], "properties": { "petName": { "type": "string", "description": "The name of the pet shown in the photo." }, "birthday": { "description": "The birthday of the pet shown in the photo.", "type": "string", "format": "date-time" }, "petType": { "$ref": "my.organization-pets.PetType-1.0.1" } } }

Pet.json

The first attribute of Pet.json, is ‘$id' which is the unique identifier for our pet schema definition. We will need the ‘$id’ value anytime we want to reference our pet schema. According to the specification, the ‘$id’ must be a valid URN. The value of 'my.organization-pets.Pet-1.0.3' is an shortcut for the full URL using the following pattern:

Following this pattern the full URL then becomes:

The full psudo-BNF for the $id can be found here. We will use this $id value to reference this type when we extend it to define both the Cat and Dog types shortly. Note: The $id ends with the semantic version ‘1.0.3’, which we will cover in more detail in the version section.

The next attribute, ‘title’, is what will be displayed to metadata providers when they work with this type. The title is not part of the identifier and does not need to match any part of the identifier. Unlike the $id, the title can contain any characters including spaces.

The ‘description’ attribute is used to provide helpful information to metadata providers and metadata consumers.

The next section is an array attribute named ‘allOf’ that we are using to define the one or more types schemas that we wish to extend. In this example, the Pet type extends a Synapse FileEntity identified by the:

Designers will be able to extend types that they define or types defined by others. Initially, all user defined types must either directly or indirectly extend Synapse a FileEntitity, Folder, or Project. In the future, we plan to support the extension of other types of Synapse objects including, but not limited to; Evaluation Submissions, and Access Requirements.

The next section of the Pet.json file is the ‘properties’ attribute. This properties array is where each field of a type is defined. Each property has a name, type, and description. The name of a property determines the final key of the resulting annotation key-value-pair.

The type of the property controls the allowable values for the field. For example, a type of ‘number’ can only contain characters that represent a number such as digits, dot, signs, and exponential expressions.

For all all of the supported types see the API docs: Types.

In the Pet.json file the property ‘petType’ is an example of a reference to another type. The following is the full definition of the PetType enumeration:

PetType.json

Specifically, the ‘petType’ is an enumeration, that limits values to those listed in its definition, which includes both ‘cat’ and ‘dog’. Since the ‘petType’ is a type definition it also has an ‘$id’, and can also be referenced in other types.

Now that we have defined the abstract Pet type we can start to define the concrete types of Cat and Dog. A key part of both Cat and Dog is the ‘breed’ enumeration, each with their own distinct value. We will define both of these separately.

CatBreed.json

 

DogBreed.json

In each case both enumeration have the same name: ‘Breed’ but different paths values, so the IDs of each enumeration is still unique. With these enumerations types defined we can define the both the Cat and Dog types:

 

Cat.json

 

Dog.json

Both the definitions for Cat and Dog implement our previously defined Pet type ($ref : my.organization.pets.Pet). This means both Cat and Dog will include all of the attributes defined in Pet. Note: The Dog.json implementation reference includes a semantic version (1.0.3) , while the Cat.json does not. We will discuss versions in a later section.

Note: Both Cat.json and Dog.json override the “petyType” attribute (from Pet.json) as a constant to lock down the type that is appropriate. This means a valid cat must have a “petType”:”cat” and a valid dog must have a “petType”:”dog”.

Finally, both Cat and Dog include an attribute named ‘breed’ but which each referencing the enumeration appropriate to their type.

Now that we have defined our supported pet schemas (cat and dog), we are ready to put all of the parts together in a single schema definition:

 

PetPhoto.json

The first new term in the PetPhoto.json is the ‘oneOf’ array. This array definition states that in order for a pet photo to be valid it must be valid against one of the listed sub-schemas.

In summary, the PetPhoto.json schema encapsulates everything we have defined thus far and even outlines which sub-schema should be used under various conditions.

Alternative: Conditional Logic

One alternative to the base types (Pet.json) and extensions types (Cat.json & Dog.json) is the use of the JSON schema conditional logic: if/then/else. The following example semantically equivalent to the combination of Pet.json, Cat.json, Dog.json, and PetPhoto.json all in a single JSON schema:

ConditionalAlternative.json

 

Schema Versions

A new JSON schema is created by calling: POST /schema/type/create/async/start. The same call is also used to make changes to an existing JSON schema. However, the exact impact of a JSON schema change depends on the inclusion or exclusion of the optional semantic version suffix in the schema $id.

When a semantic version is included in a schema $id, its its value serves as a human readable reference to a specific schema change. The semantic version values must follow the rules of Semantic Versioning, with a major, minor, and patch number. If a semantic version is included in a JSON schema $id, its value must be unique within the schema $id space.

We will use the examples from above to illustrate exactly what it means when a semantic version is included in a schema $id.

We include a semantic version suffix in the $id of the Pet.json (1.0.3) but we did not include it in the other types. By including a semantic version in the Pet.json we made it possible for other schemas to reference that specific version. For example, the Dog $ref to the Pet schema includes the 1.0.3 semantic version while the Cat $ref to Pet excludes the semantic version. This means the Dog type is locked to an explicit version (1.0.3) of the Pet type, while the Cat type will always reference the latest version of the Pet schema. For example, if we were to add a new property to the Pet schema and we were to bumped its semantic version to 1.0.4, the Cat type would include the new property but the Dog type would not (Dog is locked to Pet version 1.03). If we wanted to include the new Pet attribute in Dog we would need to update the Dog type definition to either reference Pet 1.0.4 or remove the semantic version from the reference.

If a JSON schema is created with a semantic version in its $id, that version is immutable. Any attempt to update that version of a schema will be rejected. However, it is possible to delete a specific semantic version of a schema using DELETE /schema/type/registered/{organizationName}-{schemaName}-{semanticVersion} if there are no references to that version.

Creating a schema without a semantic version means only the latest version is maintained. Each update to a schema that excludes a semantic version will simply replace any existing non-versioned copy of that schema. If you wish to maintain the full history of your schemas you will need to provide semantic version in the $id of the schema for all updates.

With the completion of phase one, the basic create, read, update, delete (CRUD) APIs have been implemented. See: JSON Schema Services for more details.

 

Schema to Entity Binding

Once a JSON schema has been created, it can be bound to a Project, Folder or File Entity. The bound schema will be used to validate the Entity's metadata (annotations). Any child Entity will inherit its parent’s bound schema, unless a schema is explicitly bound to that child.

A simple algorithm is used to determine which, if any, schema is bound to an Entity:

  • Has a schema been bound directly to an Entity?

    • Yes - Then that schema applies to the Entity.

    • No - Recursively check the Entity’s parent for a bound schema. The first recursive schema found in this manner will apply to the Entity. If no schema is found in a Entity’s hierarchy then the Entity does not have a bound schema.

For example, if you bind a schema to a Project, and there are no other schemas bound to any Entity in that project, then all Folders and Files in that Project will inherit the schema bound to the Project. The project level binding can be overridden by binding a additional schema to a Folder, then the Folder and all of children (Files and Folders) will inherit the schema bound the Folder instead of the schema bound the parent Project.

Only a single schema can be bound to an Entity at a time. If you have more than one schema to bind to an Entity you will need to create and bind a single composition schema using keywords such as ‘anyOf', 'allOf' or 'oneOf' that defines how the schemas should be used for validation. The PetPhoto.json from above is an example of composition schema that could be bound to an Entity. The PetPhoto.json schema states that valid a Entity must match either Cat or Dog (from the ‘oneOf’ designation). In the PetPhoto.json example, both Cat and Dog transitively extend FileEntity (via Pet) and would therefore apply to FileEntities. To declare a schema for Folders, we would create a schema that extends Folder, and add that schema to the 'oneOf’ definition.

There are three APIs for binding schemas to Entities:

Phase Two

The goal of phase two is to support the validation on Entities based on bound JSON schemas.

Basic validation involves two parts, the JSON schema that defines how an object should be validated, and the actual JSON of the object to validated. A JSON schema validation library will take both parts as input parameters and will produce validation information. When the JSON object is found to be invalid, typically additional information will be provided to help the users understand why it is invalid.

Continuing with the Pets example above, lets assume that PetPhoto.json schema has been bound to a sample project. Since there are no other bound schemas in the sample project, the PetPhoto schema will apply to all files and folders in the project.

For example, a photo of the cat uploaded to this example project might have the following metadata represented as JSON object:

 

Charity.json

Validation Schema

Our goal is to determine if this file metadata for Charity.json is valid according to the PetPhoto.json schema, which states that it must match either the Cat or Dog schemas. The validation code must consider each schema and all of its direct and indirect dependencies. Before Synapse can validate a schema against an object it must first compile a ‘validation’ schema. The process of compiling a validation schema involves de-referencing all references ($ref) such that each reference resolves locally. Specifically, a copy of each referenced schema is added to the ‘definitions’ section and all $refs are changed to point to the local appropriate local copy. The resulting ‘validation’ schema is completely self-contained, making it easy for any 3rd party validation library to consume. You can fetch the compiled ‘validation’ schema for any JSON schema registered with Synapse using the POST /schema/type/validation/async/start asynchronous API. The following is an example of the complied ‘validation’ schema for the PetPhoto schema.

 

PetPhoto.json-Validation Schema

When we provide both the Charity.json and above validation schema to a JSON schema validation library we find that Charity.json is valid.

If we change Charity.json such that “petType” is set to “dog”, and run the same validation check we find that it is invalid, because the breed “American Shorthair” is only valid for “petType”:”cat”. If Charity, where a dog we would need to select a valid dog breed.

Validation Schema APIs

Response

URL

Request

Description

Response

URL

Request

Description

JSONObject

GET /entity/<id>/json

 

Get the JSON object representing the given entity. For example, see Charity.json from above.

JSONObject

POST /entity/<id>/json

JSONObject

Update an Entity providing a JSONObject similar to the Charity.json from above.

JobToken

POST /schema/type/validation/async/start

$id of the JSON schema

Start a job to create the self-contained validation schema for a given JSON Schema. For example, the PetPhoto.json-Validation Schema was generated for PetPhoto.json.

JsonSchema

GET /schema/type/validation/async/get/{asyncToken}

JobToken

Get the results of the asynchronous job to create a validation schema.

 

Automatic Eventually Consistent Entity Validation

For any Entity with a bound JSON Schema, Synapse will automatically validate the Entity against the bound schema and record the results. For example, if a new JSON schema is bound to a Project with no other schema binding, Synapse will automatically start an asynchronous process to validate every Entity within the Project against the bound schema. In addition, any Entity added or update to the Project will also be automatically validated.

The validation results will be eventually consistent after any schema or Entity change. For example, a Project that has a schema bound at the project level only, it will take time to re-validate all Entities in the Project when there is a schema change. So validation results of each Entity in the Project will be stale until the re-validation process completes.

Validation Results APIs

Response

URL

Request

Description

Response

URL

Request

Description

ValidationResults

GET /entity/<id>/schema/validation

 

Get the ValidationResutls for the provided Entity.

ValidationSummaryStatitics

GET /entity/<id>/schema/validation/statistics

 

Get the ValidationSummaryStatitics for an Entity Container such as a Project of Folder. This method provides statistics about the validation status of the containers children.

PaginatedListOfIds

GET /entity/<id>/schema/invalid/children

 

Get a single page of Entity IDs that are invalid according to their schema for the given Entity container.

 

ValidationResults

Type

Name

Description

Type

Name

Description

String

objectId

The identifier of the object that was validated.

ObjectType

objectType

The type of object that was validated.

String

etag

The Etag of the object at the time validation was performed. If the etag does not match the current etag then the validation is out-of-date.

Date

validatedOn

The date-time when this object was validated.

Boolean

isValid

Is this object valid according to its schema?

String

validationErrorMessage

If the object is not valid according to the schema, a simple one line error message will be provided.

List<String>

validationErrorMessageList

If the object is not valid according to the schema, a the flat list of error messages will be provided with one error message per sub-schema.

ValidationException

validationException

If the object is not valid according to the schema, a recursive ValidationException will be provided that describes all violations in the sub-schema tree.

ValidationException

Type

Name

Description

Type

Name

Description

String

keyword

The JSON schema keyword which was violated.

String

pointerToViolation

A JSON Pointer denoting the path from the input document root to its fragment which caused the validation failure.

String

message

The description of the validation failure.

String

schemaLocation

A JSON Pointer denoting the path from the schema JSON root to the violated keyword.

ValidationException

causingExceptions

An array of sub-exceptions.

With the example from above where we changed Charity.json such that ‘petType’ equals ‘dog’ the following ValidationResult will be produced:

 

Example ValidationResults

 

ValidationSummaryStatitics

Type

Name

Description

Type

Name

Description

String

containerId

The syn ID of the Container

Date

updatedOn

The date-time when the statistics were last updated.

int

totalNumberOfChildren

The total number of children Entities contained in this container.

int

numberOfValidChildren

The number of children that are valid according to their schema

int

numberOfInvalidChildren

The number of children in this container that are invalid according to their schema.

Automatic Validation Conclusion

We finished the implementation of the automatic validation of Entities in August of 2020. The following set of APIs are available for fetching the automatic validation results and statistics for Entities:

Phase Three

Use Cases

  • Project designers will likely put a lot of details into their JSON schemas. Many of the same details will be required when defining the schemas of views in their project. Therefore, the project designer will need a way to transfer all of the relevant details from their JSON schemas to their view schemas. This includes both the creation of new views, and updating relevant view schemas with JSON schema changes. This needs to work for both version and non-schemas.

  • If a project designer creates a view of a specific type of object, such as Cats, then they expect that all other types of objects, such as Dogs, to be excluded from the view. This assumes that the object type is defined by a JSON schema.

  • Project Designers might choose to use if/then/else conditions in their JSON schemas instead of defining and extending classes/types. For such cases, each branch of the conditional logic might have its own set of unique columns. See: ConditionalAlternative.json above. Just like the previous use cases, project designers need to create a view that only includes rows for a particular branch, such as all cats, with dogs excluded.

  • It is currently possible to “break” a view by adding/updating an annotation on an object in a view that does not conform to the view’s schema. For example, if a view schema has a column named “foo” of type string, with a max length of 50 characters, adding an annotation to an object in the view with a value for “foo” that is 51 characters, will currently break the view. It would be nice if the JSON schemas somehow prevent this type of breakage.

  • A project designer wants to use a view to find all objects that do not conform to their bound JSON schema, both to identify and repair the issues.

  • Data consumes can potentially find JSON schemas associated with Files either in views or any other location a file might be viewed (such as the file’s Entity page). The data consumer would like discover similar views or files that are also associated with the same JSON schema.

  • A consumer of view data would like more information about a column of a view, such as its description and its origination. This information should help them better understand what they are looking at. For example, a view with a “Cat Breed” columns should provide the consumer with information about the possible cat breeds, the description of the columns, and maybe a link to the JSON schema where it was defined. The consumer might also want to discover other views that also use “Cat Breed”.

  • It is possible to create a snapshot of any view, including views created with JSON Schemas. If JSON schema without a semantic version in used to create a view, then a snapshot of that view should be immutable even it the JSON schema changes in the future.

  • When a base JSON schema, such as Pets.json, is used to create a view, which columns should be included? Should only the columns defined in the base JSON schema be used, or should all columns from everything that extends the base class be included?

Goal

The goal for phase three is to provide services for defining Synapse Views using JSON schemas. We also want to make some of the schema information available for both display and filtering in Views. This could include information about the schema bound to the Entity, the type of the file, and the validation state of the Entity. For example, is the Entity valid according its bound schema? Users will likely also want to filter by these additional fields.

As a stretch goals, we want to leverage the schema relationships in faceted view navigation. For example, it would be useful to filter all Entities in a project that extend our Pet.json schema, to discover which new pet types have been added. This could also be useful for discovering similar data other projects. This might require new UI for navigating the tree-like relationships that can be built with JSON schemas.

View Building Blocks

The basic building blocks of any Synapse View is its scope, type of Entities to include, and its schema. These building blocks determine which columns and rows will be included in the view.

  • Scope - The scope of a View is the first part of defining what rows should be considered for a view. Specifically, the scope of a view is a list of Entity containers (Projects and Folders). The View will only include Entities that are either direct children or indirect children of the scope containers. Note: There is a limit of 20K containers in a view’s scope.

  • Entity Type - The types of Entities to be include in a View is the second part of defining what rows should be include in the view. For example, a File View includes only Files. Projects and Folders are excluded from File Views.

  • Schema - The view schema is a simple flat list of name-type pairs that define the columns of a view. Specifically, a View schema is a list of ColumnModels. The schema defines which attributes/annotations of an Entity should be included in the view. The current view schema is much simpler than a JSON schema.

How do we want to define the columns and rows of a view driven by a JSON schema? It seems fairly obvious that we would want to use the JSON schema to drive a View’s columns, but how should we use the schema to define what rows should be include in the view? There are a few details we must consider before we can answer this question. In the next section we will setup an example that builds on all of the main JSON schema features we added in the first two phases.

All Pets Example

The following is a basic example that we will use to throughout the rest of the document. See the following screen shot:

The above screen shot shows a folder named “All Pets” that contains four files: Alpha, Bravo, Charlie, and Delta. Each file represents a pet photo. We have bound the PetPhoto.json schema (defined above) to the “All Pets” folder. Recall that the Pets.json schema states that each entity must be a Synapse FileEntity and that the metadata annotations must be valid against either the Cat.json or Dog.json schema.

Example Views of “All Pets”

The following table represents a possible view of the “All Pets” example folder using the Pet.json schema as a driver:

name

id

parentId

petName

birthday

petType

name

id

parentId

petName

birthday

petType

Alpha.png

syn22691957

syn22691893

Alpha

11/16/2013

cat

Bravo.png

syn22691978

syn22691893

Bravo

12/03/2014

dog

Charlie.png

syn22691991

syn22691893

Charlie

05/13/2018

cat

Delta.png

syn22692010

syn22691893

Delta

05/18/2016

dog

View of “All Pets” using the Pet.json schema as a driver.

Some of the columns from FileEntity are excluded in the above example for brevity including: etag, createdOn, createdBy, modifiedOn, modifiedBy, versionLabel, versionComment, versionNumber, fileHandleId, and concreteType. In addition to the columns from FileEntity, the three columns defined in Pet.json are included: petName, birthday, and petType. It is important to note that the view does not include any columns that are specific to the Cat.json or Dog.json schema, both of which extend the Pet.json schema.

The following two example of a views of the “All Pets” folder, one using the Cat.json as a driver and the other using the Dog.json as a driver:

name

id

parentId

petName

birthday

petType

Cat Breed

name

id

parentId

petName

birthday

petType

Cat Breed

Alpha.png

syn22691957

syn22691893

Alpha

11/16/2013

cat

Maine Coon

Charlie.png

syn22691991

syn22691893

Charlie

05/13/2018

cat

American Shorthair

View of “All Pets” using the Cat.json schema as a driver

 

name

id

parentId

petName

birthday

petType

Dog Breed

name

id

parentId

petName

birthday

petType

Dog Breed

Bravo.png

syn22691978

syn22691893

Bravo

12/03/2014

dog

Golden Retriever

Delta.png

syn22692010

syn22691893

Delta

05/18/2016

dog

Beagle

View of “All Pets” using the Dog.json schema as a driver

The Cat.json driven view includes all columns defined in the: org.sagebionetworks-repo.model.FileEntity, my.organization-pets.Pet-1.0.3, and my.organization-pets.cat.Cat schemas. In addition, it only includes rows for files that represent cats (Alpha & Charlie). The Dog.json driven view includes all columns defined in the: org.sagebionetworks-repo.model.FileEntity, my.organization-pets.Pet-1.0.3, and my.organization-pets.cat.Dog schemas. In addition, it only includes rows for files that represent dogs (Bravo & Delta).

The final type of view we might want to create would be driven by the PetPhoto.json schema:

name

id

parentId

petName

birthday

petType

Cat Breed

Dog Breed

name

id

parentId

petName

birthday

petType

Cat Breed

Dog Breed

Alpha.png

syn22691957

syn22691893

Alpha

11/16/2013

cat

Maine Coon

 

Bravo.png

syn22691978

syn22691893

Bravo

12/03/2014

dog

 

Golden Retriever

Charlie.png

syn22691991

syn22691893

Charlie

05/13/2018

cat

American Shorthair

 

Delta.png

syn22692010

syn22691893

Delta

05/18/2016

dog

 

Beagle

View of “All Pets” driven by the PetPhoto.json schema

The PetPhoto.json driven schema is somewhat a “sparse matrix” where some of the columns have no meaning for some of the rows. Do we want to support “sparse matrix” type views?

Here is an example of additional columns we might want to support for views driven by JSON schemas:

petName

birthday

petType

Validated Against Schema

isValid

Validation Error Message

Matches Schema

petName

birthday

petType

Validated Against Schema

isValid

Validation Error Message

Matches Schema

Alpha

11/16/2013

cat

my.organization-pets.PetPhoto

true

 

my.organization-pets.cat.Cat

Bravo

12/03/2014

guppy

my.organization-pets.PetPhoto

false

“guppy is not a valid petType, must be one of: [cat, dog, fish]

????

Charlie

05/13/2018

cat

my.organization-pets.PetPhoto

true

 

my.organization-pets.cat.Cat

Delta

05/18/2016

dog

my.organization-pets.PetPhoto

true

 

my.organization-pets.cat.Dog

Additional schema related columns that can be useful for navigation

Missing a Step?

In the “All Pets” folder example above, what can we concluded from the automatic JSON schema validation that Synapse will perform for that folder? If all four files are marked as valid, then we know that each file represents either a Cat or Dog. The validation process does not tell us which schema was matched, only that it matched one or the other. This means we cannot tell if Alpha represents a Cat or a Dog, only that it matches either Cat or Dog. Even when the validation process marks a file as invalid, we only know it does not match Cat and does not match Dog. How do we determine the value of the “Matches Schema” columns for each Entity in the view? Note: Would also need this value to correctly build the Cat.json driven view and the Dog.json driven view, as each view needs to exclude all rows that are not of that type.

Derive the Type from petType constant?

Can we leverage the constant definitions from the Cat.json and Dog.json schema to determine? For example, Cat.json includes:

And Dog.json includes:

A JSON schema might include many constant such as petType. How do we know that petType is special? Maybe a type is defined by combining all constants in the schema? How do we determine the type if the user has not yet provided values for the constants?

Add a special schemaType attribute to an Entity?

Should we require a users to explicitly provide a “schemaType” on each Entity within the scope of a bound schema? This would require an extra step on the user’s part but it would remove all ambiguity.

Using a JSON schema to drive the columns of a View

How do we use a JSON schema to drive the columns of a View? The following is a list of possible requirements:

  • Each property of a JSON schema is a candidate for a columns in a View. This includes both direct properties of a schema and all of its inherited properties. For example, ‘petName’, ‘birthday’, and ‘petType’ are all direct properties of the Pet.json schema. While ‘name’, ‘id’, ‘createdOn’, ‘createdBy’ ‘modifiedOn’, ‘modifiedBy’ ‘etag' etcetera, are properties that are inherited from FileEntity.

  • It is likely that view designers will want to limit the properties included from a JSON schema to a sub-set. They might not want all of the properties from a JSON schema included in the view.

  • It is likely that view designers will want to control the order of the columns in their view. Note: According to the JSON Specification, value-key pairs, such as the properties of a JSON Schema are “unordered”. While Synapse will attempt to maintain the provided order of the properties, we cannot expect that 3rd party libraries and clients will do the same. Therefore, we should consider the properties of a JSON schema to be unordered.

  • It is likely that view designers will want to include additional columns in their views that are not defined in the schema. The example view that includes schema validation information is an example of this.

Proposed Services

Service to Generate ColumnModels from a JSON Schema

We propose adding a new REST API service that given the $id of a JSON schema will generate a List of ColumnModels that contain all of the details captured in the JSON schema. Project designers could then use this service to transfer the details of their JSON schemas to their views schemas. The clients could help the project designer apply these new ColumnModels to their views, similar to how the “Add All Annotations” button works in the exiting view editor user interface.

It is important to note, that this does not change the fundamental nature of views in any way. The new service is simply a tool to help transfer details of a JSON schema into a view schema.

If the project designer changes their JSON schema, the changes will not automatically, propagate to their views. Instead, if a project designer wishes to apply JSON schema changes to their existing views, they would need to manually, re-run the service to generate new ColumnModel. In theory, the view editor user interface would help the user manage the ColumnModel delta to apply to their view. This means the project designer will be in full control of what changes propagate to their views.

In addition to the service to generate ColumnModels from JSON schemas, we propose adding a new field to ColumnModel to allow the column to be linked to a JSON schema. Specifically, we propose adding a string field to ColumnModel called “derivedFrom$id”. The value of this field would be the $id of the JSON schema from which the column was derived. The service to generate ColumnModels from JSON schemas would automatically provide a value for this new field based on the $id of the parameter of the service.

REST API Additions

Response

URL

Request

Description

Response

URL

Request

Description

JobToken

POST /schema/type/columnmodel/async/start

$id of the JSON schema

Start a job to create a list of ColumnModel objects that represent the properties of the given JSON schema. One ColumnModel will be returned for each property of the JSON schema.

ColumnModelResponse

GET /schema/type/columnmodel/async/get/{asyncToken}

JobToken

Get the results of the asynchronous job to create create a list of ColumnModels from a JSON schema.

Object Models

ColumnModel

Name

Type

Description

Name

Type

Description

 

All of the existing fields of ColumnModel

derivedFrom$id

String

The $id of the JSON schema that was used to define this ColumnModel. Note: The name of the ColumnModel will match the name of the property from the JSON schema.

ColumnModelResponse

Name

Type

Description

Name

Type

Description

columnModels

List<ColumnModel>

The resulting list of ColumnModels

$id

String

The $id of the JSON schema used to define the list of ColumnModels