Eventing model to Execute External Workflows
- 1 Front Matter
- 1.1 Jira Epic
- 1.2 Glossary
- 1.3 Document Authors
- 2 Background
- 3 Problem Statement
- 3.1 Objective
- 4 Polling vs. Pushing
- 4.1 Polling
- 4.2 Pushing
- 4.3 Comparison of the Two
- 5 Security Considerations
- 6 Recommended Solution
- 7 Note on Persistence of Events
- 8 Metrics of Success
- 9 Appendix
- 9.1 Other Discussed Potential Solutions
- 9.1.1 Strictly Polling
- 9.1.2 Long-Polling
- 9.1.3 Server-Sent Events
- 9.1.4 WebSockets
- 9.2 Webhook Schemas
- 9.3 Event Schemas
- 9.3.1 WebhookEvent.json
- 9.3.2 EventType.json
- 9.1 Other Discussed Potential Solutions
Front Matter
Jira Epic
- PLFM-7081Getting issue details... STATUS
Glossary
CUD: Create / Update / Delete.
Event-Initiator: A Synapse user who initiates an event by performing a CUD operation in Synapse.
Event-Receiver: The Synapse user who would like to receive the event. An event-receiver will be a technical user who will almost certainly be writing code to handle the event.
Note: The Event-Initiator and the Event-Receiver can be the same Synapse user.
Event-Generator: The system that generates the events: Synapse.
End-User: A “non-technical” Synapse user, e.g. a scientific researcher.
Document Authors
@Luke Moenning (Unlicensed)
Background
Currently, Synapse users, particularly those who handle data workflows, have to rely on a polling mechanism to detect CUD operations. This approach requires users to repeatedly make API requests to determine if changes have occurred. While this method is straightforward and fault-tolerant, it suffers from significant drawbacks, including highly delayed notification and increased computational overhead, making it inefficient and unsuitable for real-time data management needs.
Problem Statement
The primary limitation of the current polling system within Synapse is its inefficiency. Event-Receivers are required to continuously query the database to detect changes, which consumes substantial computational resources for both the receiver and the event-generator. In addition, this method introduces significant delays between the occurrence of an event and its detection by the receiver. Such delays can severely hinder the ability to manage data dynamically when it is crucial to be able to do so.
The delays in notification not only affect the timeliness of data management but also translate directly into increased cost of operation. These costs stem from the increased server load necessary to handle frequent and repetitive polling (API requests). The delay in receiving updates can hinder timely decision-making in workflows, negatively affecting the event-receivers' Synapse experience and potentially delaying critical research outcomes. In the worst case, this may even push receivers away from Synapse and to use another platform that supports their need for timely notifications more easily.
Due to the issues discussed above, there is a need for a real-time event notification system in Synapse that can alert event-receivers immediately when CUD operations occur. Such a system would eliminate the need for repetitive polling, reduce server load, and provide real-time updates.
Objective
The objective is to design and implement a solution that enables real-time notifications of CUD operations in Synapse. This solution should be capable of directly informing event-receivers without the need for active monitoring. The initial delivery of the feature should allow for the eventing of Entities within Synapse, with a future goal of expanding to support further objects, e.g. webhooks for changes to Organizations/teams.
One key thing to note: this feature is not for end-users, and it is not a “push notification” service in the general sense. Instead, it will allow event-receivers who work directly with the Synapse REST API to more efficiently integrate their processes with Synapse. A prime example of an expected event-receiver would be someone on Sage’s Data & Tooling Team.
Polling vs. Pushing
Polling
Polling is a data synchronization strategy where the client repeatedly sends requests to a server to check for updates. This method is currently the only option in Synapse to detect CUDs in projects, folders, and files.
To give a specific use case for the problem statement - say we have a Synapse user (event-receiver) who would like to perform automated analysis on new data every time that data is added to a project, P, that already has N files. To determine whether there are any new files to analyze, in a polling approach, this user would have to intermittently ask Synapse for all of the files and compute if a change as occured. This would result in a runtime of O(N), as it would be necessary to iterate through all N files to determine if any new files have been added. In addition, this Synapse user would not know when the perfect time to ask for changes is, meaning there will be some amount of time, ∆tpolling, between when the change occurs and when this user asks for it.
Pushing
In a pushing approach, Synapse would send an event directly to the event-receiver at the time of the event, without having to wait for a request. This allows for the event-receiver to see near real-time events.
Using the same example as in polling: to determine whether any files in a project with N files have changed, a pushing approach would have runtime of O(1) as the change would be noted directly and there would be no need to iterate through the project. In addition, it would also not be up to the user to try to “guess” when to ask for updates as Synapse would tell them directly. This would result in a delay, ∆tpushing, between when the event occurs and when the receiver discovers it such that, on average, ∆tpushing << ∆tpolling. As this is the case, a polling approach for this user will, on average, result in a longer delay between when an event occurs and when the event-receiver discovers it than using a pushing approach.
Comparison of the Two
| Pros of Polling | Cons of Polling | Pros of Pushing | Cons of Pushing |
---|---|---|---|---|
Event-Generator |
|
|
|
|
Event-Receiver |
|
|
|
|
Security Considerations
Data security is a top priority for Synapse as we host sensitive data. This will require a little more vigilance when implementing an eventing model in order to maintain this data integrity. At a broad scope, our eventing model must consider the following:
The event-generator needs some way to validate the method of reception provided by the event-receiver
The event-generator needs some way to validate the identity of each event-receiver
Conversely, each event-receiver needs a way to validate the events originates from the generator*
Event-receivers should only be able to register for events they are authorized to read
Event-receivers should only receive events they are authorized to read†
When the generator sends an event, only the event receiver should be able to see it‡
* How can the generator best set up the receiver to prevent Distributed Denial-of-Service (DDoS) attacks? What if an event-initiator set ups a project structure (intentionally or unintentionally) such that when they update the root, each receiver receives millions of events?
† We need to be careful with the hierarchy that Synapse has. For example, we could have an event-receiver with read access to a project P that contains folders A, B, C. If they only have read permission for A, they should only see events for A. They should still be able to register for events in P.
‡ This one is quite hard to achieve in any pushing system (spoiler: the recommended solution will involve pushing🤫). A trivial utilization by the receiver that would make this impossible would be the usage of a 3rd party tool to receive the events. One solution would be we strictly forbid doing so (only allow whitelisted endpoints). Or, how can we best minimize the risks associated with the information sent in the initial event being publicly available? If the event contains information about an entity and what update occurred on it, this can be problematic at scale, e.g. the public being able to deduce large projects are deleted/added.
Recommended Solution
The recommended solution is to implement webhooks to create an eventing model in Synapse. Webhooks are event-driven, “reverse” APIs that would allow for real-time, one-directional communication between Synapse and event-receivers.
A webhook design is not without its own set of complications, specifically centered around security and fault tolerance. In addition to the Security Considerations discussed above, our implementation needs to consider the following:
Message failures. What happens when the event-receiver is:
Unavailable
Overwhelmed
Corrupted
There is no guarantee of message ordering
There is no easy way to prevent duplicate events
There is no easy way to guarantee the event-receiver and the event-receiver only received the event
As webhooks are common standard for solving the eventing problem, they will be the primary focus of this design. If you would like to read into other potential solutions that were discussed, they can be found in the appendix.
Overview
The webhook implementation on Synapse’s end will consist of two main stages:
Stage 1: Event Registration*
Stage 2: Webhook Invocation
At a very high level, during registration event-receivers select an object in Synapse they would like to track as well as provide an endpoint they would like the event sent to. In order to prevent probing attacks on Synapse by bad actors, we will currently only allow whitelisted endpoints to be valid invokeEndpoints. This will also help to mitigate the risks of the metadata contained in the WebhookEvent(s) being exposed to the public. The discussion for this can be found here: Eventing model to Execute External Workflows | Note on Persistence of Events.
As part of the registration process, receivers will also be prompted to validate that the endpoint they provided belongs to them. This will be through the form of a validation code sent to the invokeEndpoint of the webhook. The user must then go and tell Synapse this code their endpoint received by hitting the PUT /webhook/{webhookId}/valid
registration service.†
A sample request containing the validation code sent to the event-receiver’s invokeEndpoint
is as follows:
POST /path HTTP/1.1
Host: event-receiver-apigateway.com
Content-Type: application/json
Authorization: Bearer JWT_TOKEN
{
"verificationCode": "SOME_VERIFICATION_CODE",
}
As for the webhook invocation process, upon an event occurring, Synapse will bundle information about said event into a WebhookEvent. The WebhookEvent will then be distributed to receivers who have registered to receive it according to the rules in the next section.
* Updates to the registration can occur after initial registration, however, the receiver must first register to receive an event before they can actually receive it. Also, only the Synapse user who creates a registration has permission to modify it in any way. This may be changed in future releases of the feature but will be the initial design.
† It will be up to Synapse's discretion to decide if at any point that a provided endpoint is no longer valid and requires to be updated by the event-receiver and then re-validated by Synapse before it can receive events again. These checks will be performed periodically by Synapse in order to reconfirm validity of the invokeEndpoints. If a Webhook is set to invalid at any point, a notification with detailed reasoning will be sent to the email of the Synapse user associated with the Webhook.
Events and EventTypes
When an Entity, Eevent, has a create, update, or delete web services called on it, Synapse will walk up the hierarchy tree starting with Eevent and find all ancestor entities that have been registered for to receive webhooks. For each webhook registration found on any ancestor Entity, if the receiver of that webhook is authorized to read Eevent after the event occurs, they will receive a corresponding event. This traversal algorithm will visit the original Entity, Eevent, and all entities above it in the hierarchy.*
For example, if a project contains a folder which contains a file - an update of the file will result in Synapse checking for webhooks registering the file, the folder, and the project. If a receiver who registered for at least one of the three (file, folder, or project) has read permission on the file after the update, they will receive a WebhookEvent.
* A move is considered an update where a new parentId is set. After a move, the algorithm will be followed using the updated parentId.
Handling Retries
Synapse will be adamant about preventing slow downs on the event-receivers side from slowing down Synapse as well. One of the actions we will take to ensure this is we will not have automatic retries upon message failures, similar to how GitHub's webhooks handle failure.
Note: for a similar reason, Synapse needs to be strict with how long receivers have to respond to events with a successful status code. Receivers will have 5 seconds to respond to an event with a status code of 2XX, otherwise the message will be considered to have failed.
Recommendations for Event-Receivers on Handling Events
As Synapse will not be attempting retries or be throttling the rates of events in any way, it is important for event-receivers to be able to receive any number of events at any time. We will leave it up to receivers as to how they would like to accomplish that specifically, however, some recommendations we have for doing so are as follows:
Follow a similar reception model using AWS SQS and AWS API Gateway as explained well here https://medium.com/@andy.tarpley/webhook-processing-with-api-gateway-and-sqs-f8309411913a. A basic overview is:
INITIATION: webhook invocation
→ event-receiver’s API Gateway endpoint
→ event-receiver’s SQS Queue
→ event-receiver polls events from their queue and makes Synapse RESTful API requests for the events using the provided eventIds
→ event-receiver implements their specific use case
As also recommended by GitHub - using “services like Hookdeck or libraries like Resque (Ruby), RQ (Python), or RabbitMQ (Java)”.*
* As we will only allow for whitelisted invokeEndpoints to be valid, if a receiver would like to request a service to be whitelisted they should submit a Jira ticket to do so.
Event Registration
Event-receivers will register for events they wish to receive through a Synapse RESTful API. The endpoints as well as their descriptions + requests/responses are provided in the tables below.
Webhook Endpoints
The following are the resources that are available on any given Webhook:
Resource | Description | Request Object | Response Object |
---|---|---|---|
POST /webhook | Create a new Webhook object. This object serves as registration for a Synapse user to receive events for the specified objectId. The combination of the objectId and invokeEndpoint must be unique for each Webhook. | ||
GET /webhook/{webhookId} | Get the Webhook corresponding to the provided webhookId. | None | |
PUT /webhook/{webhookId} | Update the Webhook corresponding to the provided webhookId. Note: if the invokeEndpoint is changed upon update or the webhook is re-enabled by the user, the user will be required to re-validate the webhook. If Synapse disables the webhook due to an invalid endpoint, update the endpoint using this service, then re-validate with | ||
DELETE /webhook/{webhookId} | Delete the Webhook corresponding to the provided webhookId. | None | None |
PUT /webhook/{webhookId}/verify | Verify the Webhook of the corresponding ID by providing the verification code received by invokeEndpoint upon creation/updating. After successful verification, Synapse will set | ||
GET /webhook/{webhookId}/test | Test the Webhook of the given webhookId. This will attempt to send the eventId of a generic event to the invokeEndpoint if the invokeEndpoint is valid. The user can then try retrieving the generic event through | None | |
GET /webhook/{userId}/list | List all webhookIds for a Synapse user. Each call will return a single page of WebhookRegistrations. Forward the provided nextPageToken to get the next page. |
Webhook Invocation
At the time of an event the webhook should be invoked. The process should be multi-step. Once Synapse notes the event initiation, a service will create an WebhookEvent to be distributed to each receiver who has registered for it. Synapse will then distribute each event to the corresponding event-receiver’s endpoint found in their corresponding webhook object. The event will be sent as an HTTP POST request with the eventId contained in the request body. The event-receiver will have the option to require the requests to be authorized with a JWT token. A new token will be exchanged anytime an event occurs and isAuthenticationEnabled
is set to true.
The token will contain information about the Webhook (i.e. the ID of the webhook, who the intended receiver is, expiration time, and Synapse’s signature) in order for the receiver to verify the request is uniquely for them. The receiver will use a public key provided by Synapse to verify the signature. This authorization would mitigate some of the security concerns with using webhooks as it will provide a way for the event-receiver to verify the event originated from the event-generator if the event-receiver so chooses.
A sample webhook invocation with isAuthorizationEnabled=true
sent to the event-receiver is as follows:
POST /path HTTP/1.1
Host: event-receiver-apigateway.com
Content-Type: application/json
Authorization: Bearer JWT_TOKEN
{
"eventId": "SOME_ID",
}
Note on aggregation: as of now, we will not be implementing aggregate or “summary events”. If the use case arises, this is a possible future extension of the feature.
Note on Persistence of Events
The current recommended design involves persisting events in Synapse for 2 days in order to provide an abstraction on the events. This prevents the information contained, specially the objectId and the type of event that occur, in the event from possibly being publicly available. This was discussed slightly more detailed in the footnotes of Security Considerations.
Is it necessary to provide this abstraction in order to protect the information contained in the events? The other option is to remove the abstraction, have zero state persisted, and directly send the event information upon invocation rather than the eventId. This drastically improves scalability as we will not write a row in the database for every event that occurs in Synapse and the receivers will not be required to make an API request to receive the event information upon every invocation. Again, these events could be in bursts in the millions depending on which project has events occur.
Metrics of Success
Number of event-receivers using the feature.
Direct feedback from event-receivers. How does the feature optimize the performance of their tasks?
Latency of the system - time between when the event occurs and when the receiver learns about it.
Reliability of the system - can Synapse successfully distribute every event that occurs, especially when large batches occur instantaneously. Are any events missed and not distributed to receivers?
Cost of persisting events?
Appendix
Other Discussed Potential Solutions
Strictly Polling
One potential “solution” to this problem is to not add any new features and tell event-receivers to continue to use the existing polling mechanisms to accomplish their tasks. This is not recommended at the moment, as the upside of implementing a new solution will outweigh the cost of doing so.
Long-Polling
Long-polling is a technique where the client makes a request to the server, and the server holds the request open until it has data to send. In Synapse, implementing long-polling could reduce the delay between an event occurrence and its discovery by a client compared to traditional polling.
Why It Wasn't Chosen:
Resource Intensive: Long-polling requires the server to keep connections open for an extended period, which can significantly consume server resources, especially under heavy load or with a large number of concurrent clients.
Limited Scalability: It could introduce scalability issues as the number of Synapse users grows, requiring a more robust infrastructure to manage many persistent connections.
Complex Client Logic: Clients must handle various edge cases such as timeouts and connection drops, which adds complexity to the client-side implementation.
Unnecessary Speed: The near-instantaneous speed of long-polling is overkill for our problem statement.
Server-Sent Events
Server-Sent Events (SSEs) is a standard technology enabling servers to push data to the web browser (or other clients) over a single, long-lived HTTP connection. It is particularly well-suited for scenarios where the server needs to send real-time updates to the clients, such as notifications, live status updates, or more recently AI chatbots e.g. ChatGPT.
Why It Wasn't Chosen:
Connection Persistence Overhead: SSE requires the server to maintain an open connection for each client. This can be resource-intensive, leading to increased server load and potential scalability issues as the number of simultaneous connections grows. For a platform like Synapse, which may serve a large user base, managing these connections could become problematic.
Handling High Volumes of Messages: SSE can struggle with efficiently managing large volumes of messages that are sent in quick succession. In environments where event updates are frequent and voluminous, SSE might not be able to maintain performance, leading to potential delays in message delivery or issues with the chronological order of messages.
Infrastructure Complexity for High Availability: Implementing SSE in a highly available and fault-tolerant manner requires sophisticated infrastructure setups, such as load balancers capable of handling long-lived connections and gracefully managing connection persistence across server reboots or failures.
WebSockets
WebSockets provide full-duplex communication channels over a single long-lived connection. It is an ideal solution for interactive applications requiring real-time data flow in both directions.
Why It Wasn't Chosen:
Bi-Directional Overkill: Synapse's requirements center around one-way communication from server to client. WebSockets, while powerful, offer more than what's needed and could thus be considered over-engineered for this context.
Infrastructure Demands: Maintaining WebSocket connections requires a persistent state on the server and could potentially lead to high memory usage, especially if the volume of events is high.
Security Complexity: Implementing secure WebSocket connections requires careful management, including proper authentication and encryption, which adds to the complexity of the system.