Annotations Query Tables

SQL Tables for Querying

SubmissionStatuses may have open-ended lists of Annotations, of type String, Long, or Double. These are stored in serialized blobs in the JDOSUBMISSION_STATUS table. To support querying, these annotations are copied into the following tables:

SUBSTATUS_ANNOTATIONS_OWNER – relates Submission ID to Evaluation ID for Submissions whose Annotations are in the query tables. Is the target of foreign key constraints from each of the following tables:

SUBSTATUS_ANNOTATIONS_BLOB – contains each SubmissionStatus's Annotations as a blob

SUBSTATUS_STRINGANNOTATION – triple-store of <SubmissionID, Attribute, Value> for string annotations

SUBSTATUS_DOUBLEANNOTATION – triple-store of <SubmissionID, Attribute, Value> for floating point annotations

SUBSTATUS_LONGANNOTATION – triple-store of <SubmissionID, Attribute, Value> for long (e.g. integer, time-stamp) annotations

Asynchronous Update of SubmissionStatuses

To copy annotations to the query tables, the system leverages the asynchronous worker mechanism described here. As shown in the diagram below, when creating, updating or deleting a SubmissionStatus object, the repository services locks the (Submission-specific) SubmissionStatus etag as well as the (Evaluation-specific) Evaluation-SubStatus etag for update, updates the SubmissionStatus content, and queues a Change message. The Change message contains the evaluation Id, but not the Submission ID, i.e. it does broadcast which SubmissionStatus has been changed.

The Change message triggers the Annotations worker, which runs asynchronously with respect to the Client's request. It checks that the etag in the message matches the one in the repository service, to prevent against acting on "stale" Change messages. If the etags match, it runs a diff between the true state of the Evaluation's submission annotations (stored in the SubmissionStatus tables) and the state within the annotation query tables. Any updates are made in the query tables in a single transaction.

Batch SubmissionStatus Updates

Batch SubmissionStatus updates use the same semaphores as individual SubmissionStatus updates. Additionally we use optimistic concurrency across batches: Each batch contains a list of <n> SubmissionStatuses (with embedded annotations), along with flags for 'isFirstBatch' and 'isLastBatch', plus a batch 'token', required on all but the first batch. The system locks on the <n> SubmissionStatus etags as well as the Evaluation-SubStatus etag. commits the new SubmissionStatus data, and returns the new Evaluation-SubStatus etag as the 'nextUploadToken' in the client response. The client is required to include the returned token with the subsequent batch. If, between batches, another client modifies any SubmissionStatus under the same Evaluation, then the batch token will not match the etag, and the client's request will be rejected with a PRECONDITION_FAILED (412) response. It is then the client's responsibility to restart the upload with the first batch. After the final batch the Change message is generated, triggering the asynchronous update of the Annotations query table.

Design Considerations and Trade-Offs

The design described above is conservative in handling concurrent access during multi-batch upload: Any single creation, update or delete invalidates the entire upload process and it's up to the client to restart from the first batch. This should be acceptable if there is just one Evaluation scoring process and if it performs its upload quickly, reducing the possibility of collision with the creation of a new Submission. If the second assumption is too restrictive we can introduce a separate semaphore for Creation versus Update or Delete.

This concurrency approach favors Evaluations in which coherence between all Submissions is required. (Batch upload was created to support rank-based scoring, in which the score of each Submission is based on the content of all other Submissions in the Evaluation.) For Evaluations in which Submissions are considered in isolation, scoring can be done one-at-a time, as they come in, obviating the need for batch status upload.

The triggering of the asynchronous worker is also meant to ensure data coherence across submissions since, (1) in a multi-batch upload no annotation update occurs until the last batch is successfully uploaded, (2) any C.U.D. activity occurring before the asynchronous worker runs will invalidate the change message and (3) the query table insertions occur in a single transaction.

Query results do not necessarily reflect the current state of the SubmissionStatuses: If a batch update has occurred but the Change message not yet processed, then query results will reflect the most recent update to the query tables. This is acceptable if reports that are internally consistent, but slightly old, are considered valid. One alternative is, for each submitted query, to check for differences between the versions of the two copies of the data, and return an alternate response (e.g. a 202 status code) if any discrepancies are seen. Under this alternative, query results are never out of date, but there are brief "outages" when no query results can be retrieved.

It is possible for the query tables to be updated while a client in in the midst of retrieving a series of pages of results. In this case the series of received pages, taken as a whole, may be considered invalid. One solution is to return with a each query result the "version" of the query tables (e.g. the Evaluation-SubStatus etag described above). When a client see the version change, it is its responsibility to restart retrieval from the first page.

PLFM-2741 - Getting issue details... STATUS addresses the last two issues described here.