Document toolboxDocument toolbox

Synapse AI Integration

Related JIRA

  • PLFM-8485 - Getting issue details... STATUS

  • PLFM-8484 - Getting issue details... STATUS

  • DESIGN-1463 - Getting issue details... STATUS

  • IBCDPE-1014 - Getting issue details... STATUS

Introduction

To support searching across the content of Synapse the backend employs Amazon Cloudsearch as its index and query engine. Cloudsearch is a fully managed solution that supports both unstructured text and structured data, synapse today makes use of both types of data for its search capabilities and additionally makes use of row level filtering to exclude results that users might not have read access to.

Over the years a common theme that emerged is that the current search functionality does not seem to fulfill the user needs. Cloudsearch is a product that wasn’t updated in several years and didn’t keep up with the latest search advancements (e.g. semantic search) and alternative solutions became available in AWS itself (e.g. Open Source Search Engine - Amazon OpenSearch Service - AWS ).

Most recently, with the current increased interest and advancement in AI, new technologies became available that could potentially better aid the users in finding content of interest. In particular large language models (LLM) emerged as the best available technique to process and generate natural language text.

Large Language Models

LLM are deep learning models adopting a transformer architecture that are pre-trained on vast amounts of data. In the last few years there have been several iterations of this kind of models due to their capabilities of general-purpose language generation, and more recently multimodal models have become available that expands input and output to different types of media (e.g. images, video).

Several companies started providing API access to foundation models (LLMs are a specific type of FM that is used for language generation. In this document we will refer to LLM for simplicity) that are pre-trained on vast amounts of data and are good for general purpose knowledge. Due to their ability to provide a human-like interaction further research went into providing better performance for domain specific knowledge.

LLMs have several challenges when used as is with unseen data, in particular due to the nature of their architecture they tend to “hallucinate” and are limited to the static knowledge that they were trained on, since training this kind of models can take months or years this can be a problem if the models are used to reason about “recent” or “unseen” data.

The power of LLMs comes from their ability to adapt and take context into consideration, this allows to increase the performance for domain specific knowledge. There are a few techniques available today to provide the models up-to-date information that they can adopt to more precisely answer questions:

  • Zero-shot learning: This is the “base” capability of an LLM, by interacting with the LLM in a context window a user can give “instructions” to the LLM

  • Few-shot learning: The LLM is provided a few concrete examples as part of the input, this allows the LLM to reason based on the of the provided examples

  • Fine-tuning: An LLM can undergo a further training session (called fine-tuning) on domain specific knowledge. While this is usually effective and faster than the pre-training phase is still a very expensive process

Both zero-shot and few-shot learning are based on the concept of prompt-engineering.

Retrieval-Augmented Generation

RAG techniques have been introduced to further optimize the performance of LLM on domain specific knowledge with lower chances of hallucinations. The idea behind it is to provide as input of the LLM both a user query, engineered prompts and an enhanced context that is the results of an external authoritative source.

This is important in the context of finding documents, a system can use a search index to first find relevant documents and feed the information to the LLM to enable a domain specific interaction.

Technically, the search index is built using a vector database of embeddings for the documents so that the results can be effectively injected into the context of an LLM. Note that computing embeddings is a crucial part of these systems, often different techniques will be employed to extract embeddings from a user query and from a text in the actual indexed document.

A Synapse Chatbot

We could envision a feature in Synapse where the user can search for datasets chatting with synapse itself and ask questions about the results, potentially further enabling chatting with the data itself (e.g. several systems allow to “chat” with documents syuch as word documents or pdf, images etc). This is exactly what a RAG system enables. Building a RAG system requires a complex architecture with many different types of components and choices. Companies such as AWS, Microsoft and Google provide access to managed solutions to build such systems in a simplified form. There is a plethora of companies providing solutions and frameworks (e.g. langchain) to implement such an architecture, but we will focus on a few options from companies that have the ability and resources to back this up in the long term.

In particular there are various options we could experiment with:

Here is a brief overview that can aid in the choice of a potential platform:

Solution

Integration Complexity

Fully Managed

Comment

Solution

Integration Complexity

Fully Managed

Comment

OpenAI

Very High

No

Provide access to separate APIs that would allow us to build a RAG system, but it’s not a managed solution and would require a consistent amount of effort and maintenance to build.

Custom GPTs would allow to build an application that integrates external knowledge for example from a search API. This might be worth investigating for “chatting with a project/folder”.

Azure AI Search

High

No

See RAG and generative AI - Azure AI Search. Similar to the Open AI solution, Azure provides all the tools needed (plus the search index management) to build a RAG solution but it would nevertheless be a big effort.

AWS Bedrock

Medium

Yes

Provides a fully managed solution that is integrated with AWS, provide access to several open source models and does most of the heavy lifting. The knowledge base component is what enables the RAG system managing the storage and retrieval of external data. The default vector database is an AWS OpenSearch index.

AWS Kendra

High

No

It’s a managed semantic search index (starts from $2k/month for 100K documents) that can potentially be used as part of a RAG system. Kendra can potentially be integrated into Bedrock as the backing document index.

Vertex AI Search

High

Yes

Provides a (presumably) fully managed solution to deploy a RAG system using their own models.

Recommendation

From the various options above our recommendation is to build a prototype based on AWS Bedrock, creating a knowledge base with at least public content (one idea might be to initially search only through projects, and in the context of a project/folder further customize the search/chat experience to the files/wiki/discussions contained in the project) and release it under a feature flag for the internal users to try out with a feedback collection mechanism (e.g. question/answer plus satisfaction score).

The choice is driven by the fact that:

  1. We are already on AWS, we know how to use it and we do not have to learn another cloud platform/api

  2. There is no easy way to test this out today quickly and compare across solutions. Ideally with a first prototype we could build a static knowledge base that can be reused, including the indexed content and the questions/answers plus feedback.

  3. Bedrock seems to be the most comprehensive and easier to integrate solution out there, with several tools that include model evaluation (https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation.html ), prompt management (https://docs.aws.amazon.com/bedrock/latest/userguide/prompt-management.html ), content filtering (See https://docs.aws.amazon.com/bedrock/latest/userguide/guardrails.html ) etc.

  4. Bedrock provide access to several different foundation models that can be tried (See https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html )

  5. Bedrock has native support for metadata filtering (Knowledge Bases for Amazon Bedrock now supports metadata filtering) that could be used for access control in a similar way we do with CloudSearch today.

  6. We could potentially reuse OpenSearch for both normal search and the backing vector database

Pricing is a bit of a complex topic as in all the above solutions it follows a pay-for-what-you-use model that depends on the user input, number of tokens, type of model used and amount of data. Given my recommendation I can estimate the cost of bedrock to be around $1k-$2k/month depending on usage (the default vector database is AWS OpenSearch, which should cost between $400 and $800/month).

Note that Thomas Yu already setup a similar experiment with public wikis, see https://sagebionetworks.jira.com/wiki/spaces/DPE/pages/3492282370