Eventing model to Execute External Workflows

1 Front Matter
- 1.1 Jira Epic
- 1.2 Glossary
- 1.3 Document Authors
2 Background
3 Problem Statement
- 3.1 Objective
4 Polling vs. Pushing
- 4.1 Polling
- 4.2 Pushing
- 4.3 Comparison of the Two
5 Security Considerations
6 Recommended Solution
- 6.1 Overview
  - 6.1.1 Handling Retries
  - 6.1.2 Recommendations for Event-Receivers on Handling Events
- 6.2 Event Registration
  - 6.2.1 Webhook Endpoints
  - 6.2.2 WebhookEvent Endpoints
- 6.3 Webhook Invocation
7 Note on Persistence of Events
8 Metrics of Success
9 Appendix
- 9.1 Other Discussed Potential Solutions
  - 9.1.1 Strictly Polling
  - 9.1.2 Long-Polling
  - 9.1.3 Server-Sent Events
  - 9.1.4 WebSockets
- 9.2 Webhook Schemas
  - 9.2.1 Webhook.json
  - 9.2.2 WebhookObjectType.json
  - 9.2.3 CreateOrUpdateWebhookRequest.json
  - 9.2.4 ValidateWebhookRequest.json
  - 9.2.5 ValidateWebhookResponse.json
  - 9.2.6 ListUserWebhooksRequest.json
  - 9.2.7 ListUserWebhooksResponse.json
- 9.3 Event Schemas
  - 9.3.1 WebhookEvent.json
  - 9.3.2 EventType.json
  - 9.3.3 GetWebhookEventsRequest.json
  - 9.3.4 GetWebhookEventsResponse.json

Front Matter

Jira Epic

PLFM-7081 - Getting issue details... STATUS

Glossary

CUD: Create / Update / Delete.

Event-Initiator: A Synapse user who initiates an event by performing a CUD operation in Synapse.

Event-Receiver: The Synapse user who would like to receive the event. An event-receiver will be a technical user who will almost certainly be writing code to handle the event.

Note: The Event-Initiator and the Event-Receiver can be the same Synapse user.

Event-Generator: The system that generates the events: Synapse.

End-User: A “non-technical” Synapse user, e.g. a scientific researcher.

Document Authors

@Luke Moenning

Background

Currently, Synapse users, particularly those who handle data workflows, have to rely on a polling mechanism to detect CUD operations. This approach requires users to repeatedly make API requests to determine if changes have occurred. While this method is straightforward and fault-tolerant, it suffers from significant drawbacks, including highly delayed notification and increased computational overhead, making it inefficient and unsuitable for real-time data management needs.

Problem Statement

The primary limitation of the current polling system within Synapse is its inefficiency. Event-Receivers are required to continuously query the database to detect changes, which consumes substantial computational resources for both the receiver and the event-generator. In addition, this method introduces significant delays between the occurrence of an event and its detection by the receiver. Such delays can severely hinder the ability to manage data dynamically when it is crucial to be able to do so.

The delays in notification not only affect the timeliness of data management but also translate directly into increased cost of operation. These costs stem from the increased server load necessary to handle frequent and repetitive polling (API requests). The delay in receiving updates can hinder timely decision-making in workflows, negatively affecting the event-receivers' Synapse experience and potentially delaying critical research outcomes. In the worst case, this may even push receivers away from Synapse and to use another platform that supports their need for timely notifications more easily.

Due to the issues discussed above, there is a need for a real-time event notification system in Synapse that can alert event-receivers immediately when CUD operations occur. Such a system would eliminate the need for repetitive polling, reduce server load, and provide real-time updates.

Objective

The objective is to design and implement a solution that enables real-time notifications of CUD operations in Synapse. This solution should be capable of directly informing event-receivers without the need for active monitoring. The initial delivery of the feature should allow for the eventing of Entities within Synapse, with a future goal of expanding to support further objects, e.g. webhooks for changes to Organizations/teams.

One key thing to note: this feature is not for end-users, and it is not a “push notification” service in the general sense. Instead, it will allow event-receivers who work directly with the Synapse REST API to more efficiently integrate their processes with Synapse. A prime example of an expected event-receiver would be someone on Sage’s Data & Tooling Team.

Polling vs. Pushing

Polling

Polling is a data synchronization strategy where the client repeatedly sends requests to a server to check for updates. This method is currently the only option in Synapse to detect CUDs in projects, folders, and files.

To give a specific use case for the problem statement - say we have a Synapse user (event-receiver) who would like to perform automated analysis on new data every time that data is added to a project, P, that already has N files. To determine whether there are any new files to analyze, in a polling approach, this user would have to intermittently ask Synapse for all of the files and compute if a change as occured. This would result in a runtime of O(N), as it would be necessary to iterate through all N files to determine if any new files have been added. In addition, this Synapse user would not know when the perfect time to ask for changes is, meaning there will be some amount of time, ∆t_polling, between when the change occurs and when this user asks for it.

Pushing

In a pushing approach, Synapse would send event directly to the event-receiver at the time of the event, without having to wait for a request. This allows for the event-receiver to see near real-time events.

Using the same example as in polling: to determine whether any files in a project with N files have changed, a pushing approach would have runtime of O(1) as the change would be noted directly and there would be no need to iterate through the project. In addition, it would also not be up to the user to try to “guess” when to ask for updates as Synapse would tell them directly. This would result in a delay, ∆t_pushing, between when the event occurs and when the receiver discovers it such that, on average, ∆t_pushing << ∆t_polling.As this is the case, a polling approach for this user will, on average, result in a longer delay between when an event occurs and when the event-receiver discovers it than using a pushing approach.

Comparison of the Two

	Pros of Polling	Cons of Polling	Pros of Pushing	Cons of Pushing

	Pros of Polling	Cons of Polling	Pros of Pushing	Cons of Pushing
Event-Generator	Simpler Implementation (nothing to implement if the APIs already exist) Don’t have to worry about fault tolerance issues from the receiver as they will simply re-ask for events when they are available	Doesn’t provide a way to efficiently and cost-effectively provide real-time events	Lower amortized load as there won’t be repeated, unnecessary calls on the server	Security Considerations More complex implementation Infrastructure needs to be able to handle spikes in computational requirements due to events that trigger hierarchical updates or events with a substantial number of receivers
Event-Receiver	Less overhead for utilization (intermittent API calls) Useful when the receiver wants data such that real time updates aren’t necessary	O(N) runtime to determine a change	Substantially less average delay between the time of an event and when the receiver learns about it O(1) runtime to determine a change	Security Considerations Needs to be able to properly receive a massive burst of events due to registration for an event that causes a hierarchical update, e.g. 100 events may occur in 1 second, the receiver need to be able to receive all 100 events

Security Considerations

Data security is a top priority for Synapse as we host sensitive data. This will require a little more vigilance when implementing an eventing model in order to maintain this data integrity. At a broad scope, our eventing model must consider the following:

The event-generator needs some way to validate the method of reception provided by the event-receiver
The event-generator needs some way to validate the identity of each event-receiver
Conversely, each event-receiver needs a way to validate the events originates from the generator*
Event-receivers should only be able to register for events they are authorized to read
Event-receivers should only receive events they are authorized to read†
When the generator sends an event, only the event receiver should be able to see it‡

_*_{How can the generator best set up the receiver to prevent Distributed Denial-of-Service (DDoS) attacks? What if an}_{event-initiator}_{set ups a project structure (intentionally or unintentionally) such that when they update the root, each receiver receives millions of events?}

_†_{We need to be careful with the hierarchy that Synapse has. For example, we could have an event-receiver with read access to a project P that contains folders A, B, C. If they only have read permission for A, they should only see events for A. They should still be able to register for events in P.}

_‡_{This one is quite hard to achieve in any pushing system (spoiler: the recommended solution will involve pushing🤫). A trivial utilization by the receiver that would make this impossible would be the usage of a 3rd party tool to receive the events. One solution would be we strictly forbid doing so (only allow whitelisted endpoints). Or, how can we best minimize the risks associated with the information sent in the initial event being publicly available? If the event contains information about an entity and what update occurred on it, this can be problematic at scale, e.g. the public being able to deduce large projects are deleted/added.}

Recommended Solution

The recommended solution is to implement webhooks to create an eventing model in Synapse. Webhooks are event-driven, “reverse” APIs that would allow for real-time, one-directional communication between Synapse and event-receivers.

A webhook design is not without its own set of complications, specifically centered around security and fault tolerance. In addition to the Security Considerations discussed above, our implementation needs to consider the following:

Message failures. What happens when the event-receiver is:
- Unavailable
- Overwhelmed
- Corrupted
There is no guarantee of message ordering
There is no easy way to prevent duplicate events
A lack of authentication upon sending events results in no way to guarantee the event-receiver and the event-receiver only received the event

As webhooks are common standard for solving the eventing problem, they will be the primary focus of this design. If you would like to read into other potential solutions that were discussed, they can be found in the appendix.

Overview

The webhook implementation on Synapse’s end will consist of two main stages:

Stage 1: Event Registration
Stage 2: Webhook Invocation

Updates to the registration can occur after initial registration, however, the receiver must first register to receive an event before they can actually receive it.

At a very high level, during registration event-receivers select an object in Synapse they would like to track as well as provide an endpoint they would like the event sent to. In order to prevent probing attacks on Synapse by bad actors, we will only allow whitelisted endpoints to be valid invokeEndpoints. As part of the registration process, receivers will also be prompted to validate that the endpoint they provided belongs to them. This will be through the form of a validation code sent to the invokeEndpoint of the webhook. The user must then go and tell Synapse this code their endpoint received by hitting the PUT /webhook/{webhookId}/valid registration service. It will be up to Synapse's discretion to decide if at any point that a provided endpoint is no longer valid and requires to be updated by the event-receiver and then re-validated by Synapse before it can receive events again. These checks will be performed periodically by Synapse in order to reconfirm validity of the invokeEndpoints. If a Webhook is set to invalid at any point, a notification with detailed reasoning will be sent to the email of the Synapse user associated with the Webhook.

Note: only the Synapse user who creates a Webhook has permission to modify it in any way. This may be changed in future releases of the feature but will be the initial design.

A sample request containing the validation code sent to the event-receiver’s invokeEndpoint is as follows:

POST /path HTTP/1.1
Host: event-receiver-apigateway.com
Content-Type: application/json
Authorization: Bearer JWT_TOKEN

{
  "verificationCode": "SOME_VERIFICATION_CODE",
}

As for the webhook invocation process, upon an event occurring, Synapse will bundle information about said event into a WebhookEvent. This WebhookEvent will persist in Synapse for 2 days before being deleted. Synapse will send the ID associated with the event through an HTTP POST request to the provided endpoint associated with the webhook upon the event occurring. It will then be up to the user to make a request through the GET /webhook/events service to retrieve the events by their IDs.

Handling Retries

Synapse will be adamant about preventing slow downs on the event-receivers side from slowing down Synapse as well. One of the actions we will take to ensure this is we will not have automatic retries upon message failures, similar to how GitHub's webhooks handle failure.

Note: for a similar reason, Synapse needs to be strict with how long receivers have to respond to events with a successful status code. Receivers will have 5 seconds to respond to an event with a status code of 2XX, otherwise the message will be considered to have failed.

Recommendations for Event-Receivers on Handling Events

As Synapse will not be attempting retries or be throttling the rates of events in any way, it is important for event-receivers to be able to receive any number of events at any time. We will leave it up to receivers as to how they would like to accomplish that specifically, however, some recommendations we have for doing so are as follows:

Follow a similar reception model using AWS SQS and AWS API Gateway as explained well here https://medium.com/@andy.tarpley/webhook-processing-with-api-gateway-and-sqs-f8309411913a. A basic overview is:

INITIATION: webhook invocation

→ event-receiver’s API Gateway endpoint

→ event-receiver’s SQS Queue

→ event-receiver polls events from their queue and makes Synapse RESTful API requests for the events using the provided eventIds

→ event-receiver implements their specific use case

As also recommended by GitHub - using “services like Hookdeck or libraries like Resque (Ruby), RQ (Python), or RabbitMQ (Java)”.

Event Registration

Event-receivers will register for events they wish to receive through a Synapse RESTful API. The endpoints as well as their descriptions + requests/responses are provided in the tables below.

Webhook Endpoints

The following are the resources that are available on any given Webhook:

Resource	Description	Request Object	Response Object

Resource	Description	Request Object	Response Object
POST /webhook	Create a new Webhook object. This object serves as registration for a Synapse user to receive events for the specified objectId. The combination of the objectId and invokeEndpoint must be unique for each Webhook.	CreateOrUpdateWebhookRequest	Webhook
GET /webhook/{webhookId}	Get the Webhook corresponding to the provided webhookId.	None	Webhook
PUT /webhook/{webhookId}	Update the Webhook corresponding to the provided webhookId. Note: if the invokeEndpoint is changed upon update or the webhook is re-enabled by the user, the user will be required to re-validate the webhook. If Synapse disables the webhook due to an invalid endpoint, update the endpoint using this service, then re-validate with `PUT /webhook/{webhookId}/valid`. The combination of the objectId and invokeEndpoint must be unique for each Webhook.	CreateOrUpdateWebhookRequest	Webhook
DELETE /webhook/{webhookId}	Delete the Webhook corresponding to the provided webhookId.	None	None
PUT /webhook/{webhookId}/valid	Validate the Webhook of the corresponding ID by providing the validation code received by invokeEndpoint upon creation/updating. After successful validation, Synapse will set `isInvokeEndpointValid` to `true`.	ValidateWebhookRequest	ValidateWebhookResponse
GET /webhook/{webhookId}/test	Test the Webhook of the given webhookId. This will attempt to send the eventId of a generic event to the invokeEndpoint if the invokeEndpoint is valid. The user can then try retrieving the generic event through `GET /webhook/events`.	None	WebhookEvent
GET /webhook/{userId}/list	List all webhookIds for a Synapse user. Each call will return a single page of WebhookRegistrations. Forward the provided nextPageToken to get the next page.	ListUserWebhooksRequest	ListUserWebhooksResponse

WebhookEvent Endpoints

The following are the resources that are available on any given WebhookEvent:

Resource	Description	Request Object	Response Object

Resource	Description	Request Object	Response Object
GET /webhook/events	Retrieve a batch of events by their eventIds. A common use of this service is after the initial event invocation, a user can retrieve one/many of the events they were given IDs for.	GetWebhookEventsRequest	GetWebhookEventsResponse

Webhook Invocation

At the time of an event the webhook should be invoked. The process should be multi-step. Once Synapse notes the event initiation, a service should create an WebhookEvent to persist in the database layer, again, this object will be permanently deleted after 2 days.

An event-receiver will receive a create event when an object is created, an update event when an update occurs, and a delete event when a delete occurs. The receiver must still have access to the object after the event in order to receive the event. If the receiver no longer has access after the event occurs, they will not receive an event notification. Move events are a somewhat special case: the receiver must have access to the destination of the move in order to receive an update event.

Once the event is created, Synapse will then distribute each eventId to the corresponding event-receiver’s endpoint found in their corresponding webhook object. The eventId will be sent as an HTTP POST request with the eventId contained in the request body. The event-receiver will have the option to require the requests to be authorized with a JWT token. A new token will be exchanged anytime an event occurs and isAuthorizationEnabled is set to true. The token will contain information about the Webhook (i.e. the ID of the webhook, who the intended receiver is, expiration time, and Synapse’s signature) in order for the receiver to verify the request is uniquely for them. The receiver will use a public key provided by Synapse to verify the signature. This authorization would mitigate some of the security concerns with using webhooks as it will provide a way for the event-receiver to verify the event originated from the event-generator if the event-receiver so chooses.

A sample webhook invocation with isAuthorizationEnabled=true sent to the event-receiver is as follows:

POST /path HTTP/1.1
Host: event-receiver-apigateway.com
Content-Type: application/json
Authorization: Bearer JWT_TOKEN

{
  "eventId": "SOME_ID",
}

Note on aggregation: as of now, we will not be implementing aggregate or “summary events”. If the use case arises, this is a possible future extension of the feature.

Note on Persistence of Events

The current recommended design involves persisting events in Synapse for 2 days in order to provide an abstraction on the events. This prevents the information contained, specially the objectId and the type of event that occur, in the event from possibly being publicly available. This was discussed slightly more detailed in the footnotes of Security Considerations.

Is it necessary to provide this abstraction in order to protect the information contained in the events? The other option is to remove the abstraction, have zero state persisted, and directly send the event information upon invocation rather than the eventId. This drastically improves scalability as we will not write a row in the database for every event that occurs in Synapse and the receivers will not be required to make an API request to receive the event information upon every invocation. Again, these events could be in bursts in the millions depending on which project has events occur.

Metrics of Success

Number of event-receivers using the feature.
Direct feedback from event-receivers. How does the feature optimize the performance of their tasks?
Latency of the system - time between when the event occurs and when the receiver learns about it.
Reliability of the system - can Synapse successfully distribute every event that occurs, especially when large batches occur instantaneously. Are any events missed and not distributed to receivers?
Cost of persisting events?

Appendix

Webhook Schemas

Webhook.json

{
    "description": "An object that serves as registration for a Synapse user to receive events for the specified event.",
    "properties": {
        "webhookId": {
            "type": "string",
            "description": "The ID associated with the Webhook."
        },
        "objectId": {
            "type": "string",
            "description": "The ID of the Synapse object to receive events of."
        },
        "objectType": {
			"$ref": "org.sagebionetworks.repo.model.webhook.WebhookObjectType",
			"description": "Which type is Synapse object is associated with the Webhook."
		},
        "userId": {
            "type": "string",
            "description": "The ID of the Synapse user who has registered to receive this webhook event."
        },
        "invokeEndpoint": {
            "type": "string",
            "description": "The endpoint the Synapse user would like the webhook events sent to on invocation. Must be 256 Characters or less."
        },
        "isValid": {
            "type": "boolean",
            "description": "True if Synapse has confirmed the validity of the Webhook invokeEndpoint. False if Synapse has determined the invokeEndpoint is invalid."
        },
        "isWebhookEnabled": {
            "type": "boolean",
            "description": "True if the Synapse user has selected to receive events. If the user sets to false, events will be temporalily paused."
        }
        "isAuthorizationEnabled": {
            "type": "boolean",
            "description": "True if the Synapse user has opted in to require authorization to receive the webhook invocation POST request. In this case, a JWT token will be included in the invocation. False otherwise: no authorization will be required and no JWT token included in the invocation."
        },
        "etag": {
			"type": "string",
			"description": "Synapse employs an Optimistic Concurrency Control (OCC) scheme to handle concurrent updates. Since the E-Tag changes every time an Webhook is updated it is used to detect when a client's current representation of an Webhook is out-of-date."
		},
        "createdOn": {
			"type": "string",
			"format": "date-time",
			"description": "The date this webhook was created."
		},
		"modifiedOn": {
			"type": "string",
			"format": "date-time",
			"description": "The date this webhook was last modified."
		},
		"createdBy": {
			"type": "string",
			"description": "The ID of the user that created this webhook."
		},
		"modifiedBy": {
			"type": "string",
			"description": "The ID of the user that last modified this webhook."
		}
    }
}

Front Matter

Jira Epic

Glossary

Document Authors

Background

Problem Statement

Objective

Polling vs. Pushing

Polling

Pushing

Comparison of the Two

Security Considerations

Recommended Solution

Overview

Handling Retries

Recommendations for Event-Receivers on Handling Events

Event Registration

Webhook Endpoints

WebhookEvent Endpoints

Webhook Invocation

Note on Persistence of Events

Metrics of Success

Appendix

Other Discussed Potential Solutions

Strictly Polling

Long-Polling

Server-Sent Events

WebSockets

Webhook Schemas

Webhook.json

WebhookObjectType.json

CreateOrUpdateWebhookRequest.json

ValidateWebhookRequest.json

ValidateWebhookResponse.json

ListUserWebhooksRequest.json

ListUserWebhooksResponse.json

Event Schemas

WebhookEvent.json

EventType.json

GetWebhookEventsRequest.json

GetWebhookEventsResponse.json