Related JIRA
Introduction
To support searching across the content of Synapse the backend employs Amazon Cloudsearch as its index and query engine. Cloudsearch is a fully managed solution that supports both unstructured text and structured data, synapse today makes use of both types of data for its search capabilities and additionally makes use of row level filtering to exclude results that users might not have read access to.
Over the years a common theme that emerged is that the current search functionality does not seem to fulfill the user needs. Cloudsearch is a product that wasn’t updated in several years and didn’t keep up with the latest search advancements (e.g. semantic search) and alternative solutions became available in AWS itself (e.g. https://aws.amazon.com/opensearch-service/ ).
Most recently, with the current increased interest and advancement in AI, new technologies became available that could potentially better aid the users in finding content of interest. In particular large language models (LLM) emerged as the best available technique to process and generate natural language text.
Large Language Models
LLM are deep learning models adopting a transformer architecture that are pre-trained on vast amounts of data. In the last few years there have been several iterations of this kind of models due to their capabilities of general-purpose language generation, and more recently multimodal models have become available that expands input and output to different types of media (e.g. images, video).
Several companies started providing API access to foundation models (LLMs are a specific type of FM that is used for language generation. In this document we will refer to LLM for simplicity) that are pre-trained on vast amounts of data and are good for general purpose knowledge. Due to their ability to provide a human-like interaction further research went into providing better performance for domain specific knowledge.
LLMs have several challenges when used as is with unseen data, in particular due to the nature of their architecture they tend to “hallucinate” and are limited to the static knowledge that they were trained on, since training this kind of models can take months or years this can be a problem if the models are used to reason about “recent” or “unseen” data.
The power of LLMs comes from their ability to adapt and take context into consideration, this allows to increase the performance for domain specific knowledge. There are a few techniques available today to provide the models up-to-date information that they can adopt to more precisely answer questions:
Zero-shot learning: This is the “base” capability of an LLM, by interacting with the LLM in a context window a user can give “instructions” to the LLM
Few-shot learning: The LLM is provided a few concrete examples as part of the input, this allows the LLM to reason based on the of the provided examples
Fine-tuning: An LLM can undergo a further training session (called fine-tuning) on domain specific knowledge. While this is usually effective and faster than the pre-training phase is still a very expensive process
Both zero-shot and few-shot learning are based on the concept of prompt-engineering.
Retrieval-Augmented Generation
RAG techniques have been introduced to further optimize the performance of LLM on domain specific knowledge with lower chances of hallucinations. The idea behind it is to provide as input of the LLM both a user query, engineered prompts and an enhanced context that is the results of an external authoritative source.
This is important in the context of finding documents, a system can use a search index to first find relevant documents and feed the information to the LLM to enable a domain specific interaction.
Technically, the search index is built using a vector database of embeddings for the documents so that the results can be effectively injected into the context of an LLM. Note that computing embeddings is a crucial part of these systems, often different techniques will be employed to extract embeddings from a user query and from a text in the actual indexed document.
A Synapse Chatbot
We could envision a feature in Synapse where the user can search for datasets chatting with synapse itself and ask questions about the results, potentially further enabling chatting with the data itself (e.g. several systems allow to “chat” with documents syuch as word documents or pdf, images etc). This is exactly what a RAG system enables. Building a RAG system requires a complex architecture with many different types of components and choices. Companies such as AWS, Microsoft and Google provide access to managed solutions to build such systems in a simplified form. There is a plethora of companies providing solutions and frameworks (e.g. langchain) to implement such an architecture, but we will focus on a few options from companies that have the ability and resources to back this up in the long term.
In particular there are various options we could experiment with:
OpenAI: The leading company providing access to the most common models available (ChatGPT). They provide APIs access to various tools that would enable us to implement a RAG system. Provides separate APIs to generate embeddings, calls to ChatGPT and potentially enabling chatting with “files” with their Assistant API (in beta) and file search.
Open AI has a program for non-profits: https://openai.com/index/introducing-openai-for-nonprofits/ .
Custom GPTs could also be used with the integration with an external API over a search index.Azure AI Search: Microsoft provides managed components to build a RAG system based on Open AI models. Note that this is different from Azure OpenAI which provide access and tools to fine-tune the models from OpenAI on Microsoft Azure computing platform.
Knowledge Bases for AWS Bedrock: The AWS Bedrock platform provide access to different types of proprietary and open source foundation models for different use cases and provides a fully managed RAG solution.
AWS Kendra: Provides a semantic search index that can potentially be used to build a RAG solution using their retrieve API.
Vertex AI Search: Google provides a managed RAG solution based on their own models.
Here is a brief overview that can aid in the choice of a potential platform:
Solution | Integration Complexity | Fully Managed | Comment |
---|---|---|---|
Open AI | Very High | No | Provide access to separate APIs that would allow us to build a RAG system, but it’s not a managed solution and would require a consistent amount of effort and maintenance to build. Custom GPTs would allow to build an application that integrates external knowledge for example from a search API. This might be worth investigating for “chatting with a project/folder”. |
Azure AI Search | High | No | See https://learn.microsoft.com/en-us/azure/search/retrieval-augmented-generation-overview. Similar to the Open AI solution, Azure provides all the tools needed (plus the search index management) to build a RAG solution but it would nevertheless be a big effort. |
AWS Bedrock | Medium | Yes | Provides a fully managed solution that is integrated with AWS, provide access to several open source models and does most of the heavy lifting. The knowledge base component is what enables the RAG system managing the storage and retrieval of external data. The default vector database is an AWS OpenSearch index. |
AWS Kendra | High | No | It’s a managed semantic search index (starts from $2k/month for 100K documents) that can potentially be used as part of a RAG system. Kendra can potentially be integrated into Bedrock as the backing document index. |
Vertex AI Search | High | Yes | Provides a (presumably) fully managed solution to deploy a RAG system using their own models. |
Recommendation
From the various options above our recommendation is to build a prototype based on AWS Bedrock, creating a knowledge base with at least public content (one idea might be to initially search only through projects, and in the context of a project/folder further customize the search/chat experience to the files/wiki/discussions contained in the project) and release it under a feature flag for the internal users to try out with a feedback collection mechanism (e.g. question/answer plus satisfaction score).
The choice is driven by the fact that:
We are already on AWS, we know how to use it and we do not have to learn another cloud platform/api
There is no easy way to test this out today quickly and compare across solutions. Ideally with a first prototype we could build a static knowledge base that can be reused, including the indexed content and the questions/answers plus feedback.
Bedrock seems to be the most comprehensive and easier to integrate solution out there, despite that it might not have the best models available
Bedrock has native support for metadata filtering (https://aws.amazon.com/about-aws/whats-new/2024/03/knowledge-bases-amazon-bedrock-metadata-filtering/) that could be used for access control in a similar way we do with CloudSearch today.
Pricing is a bit of a complex topic as in all the above solutions it follows a pay-for-what-you-use model that depends on the user input, number of tokens, type of model used and amount of data. Given my recommendation I can estimate the cost of bedrock to be around $1k-$2k/month depending on usage (the default vector database is AWS OpenSearch, which should cost between $400 and $800/month).
Note that Thomas Yu already setup a similar experiment with public wikis: https://sagebionetworks.slack.com/archives/C04JWS00LJ1/p1716958351028619