Validation JSON Schema Index

We want to maintain an index for validation schemas. https://sagebionetworks.jira.com/browse/PLFM-6870 The intent of this index is to grab validation schemas for quick use. An example is https://sagebionetworks.jira.com/browse/PLFM-6811 where we use the validation schema to guide what the annotations will look like on a GET /entity/{id}/json. Currently validation schemas are built asynchronously and so it can take an unbounded amount of time to build these schemas for use in a synchronous API like the above case.

Proposing 2 methods in which we can maintain a Validation Json Schema Index.

Option A

We can do a lazy update of the index. When a JSON schema is created or updated, we do not update our index. Instead, we only update when the index is used and the target schema is missing from the index. This access to the index will tell us that our schema is not in the index (either outdated or absent). In this case we will build the validation schema for the current job and put it in our index. Any schemas that depend on this schema can be written to the index when they are needed.

Example: Suppose we create 2 schemas, where one schema is a child of the other schema (a dependency of the other). The index will not contain the validation schemas of them on creation of the schemas. Suppose we then ask for both of the validation schemas. Because the schemas are not present in the index, we will add both the validation schemas to the index. This involves building the validation schemas in a possibly synchronous API. Then if we change the child schema, we will not update the index until the validation schema is asked for. When the validation schema for the child is asked for, we will update the child schema in the index (adding the new version). However the parent schema will not be updated in the index until it is asked for.

Option B

In this option we will have 2 workers. The first is a worker that handles a single instance of a schema being updated in the index. It will build and index the validation schema.

The second worker will be broadcasting to the first worker, all dependant schemas. We define a dependant schema as a schema that has dependencies. In this case, we can imagine this worker taking as input a given schema, in which we will broadcast messages for all schemas that reference the given schema. We want to have all these schemas to eventually reflect the changes to the referenced schema in the index.

The idea is when a JSON schema is created or updated (an asynchronous job), we will do 2 things.

Build the validation JSON schema of this newly created/updated schema, and add it to the index.
Send a notification to the 2nd worker to broadcast changes.
This 2nd worker will find all dependant schemas and send a notification message for each schema to the 1st worker.

So what happens is that we will always have a validation schema to return on an access to the index. When a JSON schema is created or updated, it is immediately reflected in the index upon completion. However, the schemas that depend on this update of the index will not be immediately reflected in the index, but this will be the trade off. If someone creates a new schema and accesses the index, it will be there. However if someone updates a schema and accesses the index for a parent schema that depends on the updated schema, it may return an outdated schema. But the workers should eventually get around to updating the schemas.

Creating the validation schema in the asynchronous job for creating/updating the JSON schema is also a perfect place because the job already creates a validation schema as part of its routine to ensure that it can be created.

Pros/Cons

	Pros	Cons

	Pros	Cons
Option A	Simple, no additional workers	We still have to build a validation schema if it is not present in the index (possibly take longer than 30 seconds).
Option B	We will always have a validation schema to return on an access to the index with no wait. Create or Update of a schema is always consistent for the index for the updated schema.	The index may not be up to date immediately (may return an old validation schema if a dependency schema is recently changed). More complicated, 2 new workers.

Conclusion

Option B is the best option as we want to avoid building validation schemas during a synchronous API.