Picture of a futuristic library generated by Adobe Firefly
Picture of a Futuristic Library Generated by Adobe Firefly

Secrets to Optimizing RAG LLM Apps for Better Performance, Accuracy and Lower Costs!

Madhukar Kumar
madhukarkumar
Published in
7 min readSep 23, 2023

--

It has been a few months since Retrieval Augmented Generation (RAG) was introduced as a pattern to build Large Language Model (LLM) apps. If you are unfamiliar with this pattern, I suggest you read this article first which lists out the pattern as one of the steps in building an enterprise LLM app.

In short, RAG, also known as in-context or real-time learning, allows querying a corpus of data (for instance, a corpus of enterprise data behind a firewall), finding matches relevant to a user query, and using these results to enrich the context passed to the LLM.

Retrieval Augmented Generation (RAG) Pattern
Retrieval Augmented Generation (RAG) Pattern

As an increasing number of enterprises and developers begin to build apps using this pattern, several best practices have emerged, helping RAG evolve into a more mature framework for building production apps. This article discusses these best practices and tools, drawing primarily from conversations with other developers and teams, as well as my own experience in building a production-ready application using the RAG pattern.

Building an app to prove a concept is one thing, but developing a service or an app for production use by multiple users is an entirely different endeavor. When creating an application for production, considerations range from quality, scalability, and reliability to cost.

In this context, for a RAG pattern, the questions to consider include:
1. How fast is the overall application? In other words, how long does it take from the user typing the input query to the app replying with the response?
2. How can the cost and complexity of creating and storing vectors be reduced?
3. How accurate are the results from the RAG pipeline that are being fed to the LLM? Is there a way to further improve this quality?

Most teams and developers I speak to use some sort of vector database to create and store embeddings aka vectors. They typically have an asynchronous job that reads data from multiple data sources, calls an embedding model to convert data into vectors and then insert/update vector indexes in a vector store or library. When the user query comes in, it is converted to an embedding in real time and then matched with the vector indexes using similarity aka semantic search. Let’s dissect this process into the three major steps of vectorization, storage, and retrieval parts of the RAG pipeline and explore optimizations for each step.

Vector/Embeddings Creation
This step offers many opportunities for optimization based on the overall goals. Adding metadata to your embedding chunks is a good practice, but bear in mind that some vector-only databases only allow up to 40Kb of metadata. Therefore, consider using a full-stack real-time database, for example SingleStore, that allows you to store vectors alongside structured data, giving you the added advantage of joining different types of data with vectors to retrieve results.
There are also other methodologies for creating embeddings. For instance, if you’re optimizing for space, you can chunk the data, summarize each chunk, concatenate all the summarizations, then create an embedding for the final summary.

Creating Embeddings Optimized for Storage
Creating Embeddings Optimized for Storage

If you’re optimizing for accuracy, a good practice is to first summarize the entire document, then store the summary text and the embedding together. For the rest of the document, you can simply create overlapping chunks and store the embedding and the chunk text together. The richer your metadata, the faster you can pre-filter the record set before running your vector function.

Creating Embeddings Optimized for Accuracy

Batch Jobs and Upserts — Consider building vector indexes as a batch job and performing upserts only when you have new relevant data. In SingleStore, you can do this by adding a timestamp column that can also serve as a metadata filter for downstream retrieval. If you’re using a vector-only database, remember that when you perform an upsert, the vector isn’t immediately available for reading and retrieval.

Fine-Tuning an Embedding Model — Similar to LLMs, embedding models are trained on datasets that may not be aware of the vocabulary or concepts specific to your company. For instance, if your company has a project named “Bluebird” with a unique taxonomy, an off-the-shelf embedding model may not understand the context. To overcome this, you can fine-tune an embedding model against your own data. This is a complex topic with various methods of fine-tuning a model, but for now, you can refer to the Llamaindex library and steps to generate synthetic queries and responses from your data, and then use it to fine-tune a model.

Vector Storage
Most vector-only databases have compute and storage tightly coupled, which means if your number of dimensions or vectors increases, you have to upgrade to increasingly larger pods, which can become extremely expensive very quickly. Consider solutions where compute is disaggregated from storage and explore options for storing indexes in a compressed format, so they occupy less storage without affecting performance. For example, in SingleStore, you can store vectors as Binary objects within BLOBs and use native functions to pack and unpack the data at insertion and retrieval into JSON.

Resources — You can use Llamaindex for incremental updates to the indexes. If you’re using SingleStore, a simple update statement can make the data immediately available to everyone (within a few milliseconds).

Vector/Context Retrieval
The retrieval part is arguably the most crucial step of the RAG pipeline and can be optimized for both performance and accuracy.

Performance
Traditional caching systems use various techniques to store data or queries so that when another user asks the same or similar query, you don’t have to make a full round trip to generate the same context. However, traditional caching systems use an exact keyword match, which doesn’t work with LLMs where the queries are in natural language. So, how do you ensure you’re not performing a full retrieval each time when the queries are similar?

This is where CacheGPT comes in. CacheGPT uses semantic search to match queries against previously asked queries, and if there’s a match, you simply return the last context instead of performing a full retrieval. CacheGPT is an open-source library, and you can refer to its documentation to configure it to your requirements.

Another factor to consider in a RAG pattern is how often you call the RAG pipeline versus directly sending the result to the LLM. For instance, if a user is simply saying, “Hi there,” or “What is the weather like in Seattle right now?” you may not want to trigger a context retrieval chain. Instead, you can optimize by either responding directly or handing over the query directly to the LLM. An efficient way to do this is using a library called NeMo-Guardrails. You can use this library to configure what kinds of queries should trigger a RAG function and reply to all other queries in a canned manner or by handing the query over to the LLM. Using this library also helps ensure you’re filtering out any offensive or undesirable queries going to the LLMs (for example, hateful or any other objectionable content defined by your corporate policies).

Finally, instead of using a vector only store or library, consider using an enterprise database that can do hybrid searches (keyword match + vector based) and allows you to join different kinds of data in one SQL query vs moving data around wasting valuable compute cycles and time by moving data around. For example, the SQL below demonstrates how you can mix and match meta data and do a semantic search and re-ranking all in one single query that is executed in a few milliseconds.

Single SQL Statement to do JOINS and Semantic Search in one Query in SingleStore
Single SQL Statement to do JOINS and Semantic Search in one Query in SingleStore

Accuracy
Your RAG pipeline doesn’t have to be a black box when it comes to retrieving context before handing it over to LLMs. So, how do you know if the context being retrieved is accurate and that users aren’t getting frustrated by receiving incorrect or incomplete answers?

You can use RAGAS, another open-source library, to evaluate the context retrieved from your enterprise corpus of data.

The steps are straightforward. You create a dataset of the query, response, context, and ground truth (what the answer should have been), feed it to the RAGAS library, and it will return with scores for accuracy, etc. Once you evaluate the scores, you can continue to tweak and iterate on the overall pipeline to improve the overall context retrieval accuracy.

Finally, it’s worth mentioning that LangSmith, introduced by LangChain, is a highly effective tool for monitoring and examining the responses between the app and the LLM. Similar to the developer console on a web browser, once you embed LangSmith in your app, you can view the query, the response, and other useful metrics associated with the query and response that can help you further optimize the application.

Screenshot of LangSmith Showing Debug Information
Screenshot of LangSmith Showing Debug Information

Conclusion
As RAG evolves into an architectural framework for creating production-grade LLM apps, there are several ways to improve the accuracy and performance of creating, storing, and retrieving context for LLMs. Open-source libraries like GPTCache, Nemo-Guardrails, Llamaindex, and RAGAS can help developers and teams build highly performant, accurate, and cost-efficient applications. Ultimately, what matters is how you manage the corpus of data and retrieve accurate results in a few milliseconds to contextualize the LLMs.

--

--

Madhukar Kumar
madhukarkumar

CMO @SingleStore, tech buff, ind developer, hacker, distance runner ex @redislabs ex @zuora ex @oracle. My views are my own