Annotations Query Tables

SQL Tables for Querying

SubmissionStatuses may have open-ended lists of Annotations, of type String, Long, or Double. These are stored in serialized blobs in the JDOSUBMISSION_STATUS table. To support querying, these annotations are copied into the following tables:

SUBSTATUS_ANNOTATIONS_OWNER – relates Submission ID to Evaluation ID for Submissions whose Annotations are in the query tables. Is the target of Foreign key constraints on each of the following tables.

SUBSTATUS_ANNOTATIONS_BLOB – contains each SubmissionStatus's Annotations as a blob

SUBSTATUS_STRINGANNOTATION – triple-store of <SubmissionID, Attribute, Value> for string annotations

SUBSTATUS_DOUBLEANNOTATION – triple-store of <SubmissionID, Attribute, Value> for floating point annotations

SUBSTATUS_LONGANNOTATION – triple-store of <SubmissionID, Attribute, Value> for long (e.g. integer, time-stamp) annotations

Asynchronous Update of SubmissionStatuses

To copy annotations to the query tables, the system leverages the asynchronous worker mechanism described here. As shown in the diagram below, when creating, updating or deleting a SubmissionStatus object, the repository services locks the SubmissionStatus etag as well as the Evaluation-SubStatus etag for update, updates the SubmissionStatus content, and queues a Change message. The Change message contains the evaluation Id, but not the Submission ID, i.e. it does broadcast which SubmissionStatus has been changed.

The Change message triggers the Annotations worker, which runs asynchronously with respect to the Client's request. It checks that the etag in the message matches the one in the repository service, to prevent against acting on "stale" Change messages. If the etags match, it runs a diff between the true state of the Evaluation's submission annotations (stored in the SubmissionStatus tables) and the state within the annotation query tables. Any differences are corrected in a single transaction.

Batch SubmissionStatus Updates

Batch SubmissionStatus updates use the same semaphores as individual SubmissionStatus updates. Additionally we use optimistic concurrency across batches: Each batch contains a list of <n> SubmissionStatuses (with embedded annotations), along with flags for 'isFirstBatch' and 'isLastBatch', plus a batch 'token', required on all but the first batch. The system locks on the <n> SubmissionStatus etags as well as the Evaluation-SubStatus etag. commits the new SubmissionStatus data, and returns the new Evaluation-SubStatus etag as the 'nextUploadToken' in the client response. The client is required to include the returned token with the next batch. If, between batches, another client modifies any SubmissionStatus under the same Evaluation, then the batch token will not match the etag, and the client's request will be rejected with a PRECONDITION_FAILED (412) response. It is then the client's responsibility to restart the upload with the first batch. After the final batch the Change message is generated, triggering the asynchronous update of the Annotations query table.

Design Trade-Offs

The design described above is conservative in handling concurrent access during multi-batch upload: Any single creation, update or delete invalidates the entire upload process and it's up to the client to restart from the first batch. This should be acceptable if there is just one Evaluation scoring process and if it performs it's upload quickly, reducing the possibility of collision with the creation of a new Submission. If the second assumption is too restrictive we can introduce a separate semaphore for Creation versus Update or Delete.

This concurrency approach favors Evaluations in which coherence between all Submissions is required. (Batch upload was created to support rank-based scoring, in which the score of each Submission is based on the content of all other Submissions in the Evaluation.)

The triggering of the asynchronous worker is also meant to ensure data coherence since, (1) in a multi-batch upload no annotation update occurs until the last batch is successfully uploaded, (2) any C.U.D. activity occurring before the asynchronous worker runs will invalidate the change message and (3) the query table insertions occur in a single transaction.

Queries results do not necessarily reflect the current state of the SubmissionStatuses: If a batch update has occurred but the Change message not yet processed, then query results will reflect the most recent update to the query tables. This is acceptable if reports that are internally consistent, but slightly old, are considered valid. An alternative is, with each submitted query, to check for differences between the versions of the two copies of the data, and return an alternate response (e.g. a 202 status code) if any discrepancies are seen.