Agent-based RAG (A-RAG) without Vector Store

Pablo Guzmán
6 min readJun 7, 2024

--

Disclaimer: I’m an AWS Sr Solution Architect and this post represents my own opinions.

This blog will present you a different strategy to do Retrieval Augmented Generation (RAG) that shifts away from most of the trends of using a vector database for its implementation.

Introduction: why would you not use vector based RAG?

Much has been said about RAG lately and I’m going to presume you already heard or read what RAG is and how it’s being used, otherwise you may want to stop reading here and get some context before coming back.

The most popular approach for RAG is to use a Vector store for RAG. There are certainly use cases where it has been successful, this usually happens when

  1. The question matches semantically the answer
  2. The answer fits within the chunking size.

However.. those two premises aren’t that common. You can probably find hundreds of blogs in this site of people writing how to improve rag, or why it doesn’t work, or how to do it differently… That’s a very strong signal that something is not working. My personal experience matches it. The Chunking + Embedding + Vector Store has been a major pain in RAG implementations, where the accuracy of finding the right chunks given the user query has yielded below 60% accuracy in some very optimized scenarios. There are also some cases where just splitting the text in chunks destroys the needed context and chunking isn’t viable at all.

But how do we fix Vector/Embedding/Chunking issue? Well, we just go another route.

Cost per token is decreasing in LLMs

We see the trend of the cost per token in LLMs consistently decreasing, while at the same time increasing reasoning power. For example, Claude 3 Haiku decreased 3.2x the token cost while increasing power. Same as GPT-4o with 50% decrease over GPT 4-turbo. This means that every step we take into the future, it is becoming increasingly more viable to just feed the LLMs the whole damn pie instead of chunking it and dealing with the pieces. I’m convinced that the future will shift away from chunking going in favor of this approach.

it is becoming increasingly more viable to just feed the LLMs the whole damn pie instead of chunking it and dealing with the pieces.

But you may ask: Hey, what if I have 100 documents? That’s way too many tokens to feed to the LLM to answer a simple question that’s answered by reading just one of those documents, do you want me to burn my money with hundreds of thousands of tokens per query?… No. Feeding all the documents to answer one simple query makes no sense and it isn’t cost effective. The whole point of RAG it’s to feed only the needed text to answer the user’s question, that’s why I propose an Agent-based approach.

Agent-Based RAG with a document catalog

When you are an employee and you have to find an answer to some question, what do you do? I usually search my bookmarks seeking the right manual, I open it, and read through to answer my questions. Why should LLMs do it differently?

A-RAG flow
  1. Users query the agent
  2. The agent looks the document catalog and decides which is the document that has the answer
  3. The agent retrieves the specific document
  4. Using the retrieved document it answers the user question

Even more, AI Agents have the ability to ask questions out of the user to clarify their intentions. If it isn’t clear for the AI Agent which document it needs to pull, it can query the user to be more specific or desambiguate between the different manuals.

Implementation with Amazon Bedrock

In the following paragraphs I will demonstrate with a practical sample using Amazon Bedrock Agents and Claude 3 Sonnet, how to implement and the obtained result, including how much it costs (at the end).

AWS Architecture

The first step it’s the create the bedrock agent and give it the specified instructions. My prompt looks like this

You are an agent that will receive an user’s query and you will use the available tools to execute these tasks in order
1. You will call the list-documents tool to query all the available documents, to select the most appropiate document_id that could have the answer of the user’s query. If you can’t determine the document with certainty, ask the user clarifying questions in order to be certain which document should have the answer.
2. Having chosen the document, you must use the get-document-text tool to retrieve the specific document
3. Having retrieved the specific document, you must answer the user’s question only using the information available in the document.

We then create two action groups, which have two lambda that implement both functions we want to call, list-document-catalog, get-document-text

Action group configuration

Lastly, we add the document data to DynamoDB. I used a simple structure for this proof of concept

DynamoDB Table Structure

How does it work?

Awesome it’s just how we can describe it

Step 1: The agent decides to list the documents to find the correct one
Step 2: The agent decides which document has the information
Step 3: The agent answers the user’s query

From my empirical tests the answers with this approach produce, compared to using the same documents in RAG, more comprehensive answers without missing crucial information that’s spread through the manuals.

What about the cost?

In my last proof of concept, I had a document catalog worth of 5k tokens (I may have been lazy and used Claude to generate not-so-small summaries) and about 3k tokens per document. The cost for queries was about 10 USD per 100 queries using Sonnet, and lowered to 1 USD per 100 queries with Haiku.

Another awesome part of this implementation is that’s a pure pay-per-use solution with no static costs like you would have running an OpenSearch or Kendra implementation. No queries? no cost!

Are there any caveats?

One caveat I noticed it’s that you have to be very careful about the agent implementation. If the user does many questions in a single session, some agent implementations retain all the outputs for every interaction it has done. This means that if I first ask a question that I answer with document#1, and then I ask a question about document#2, the Prompt sent to the LLM to answer the 2nd question will include both document#1 and document#2. This can get costly real fast. This is a agent-history management problem that needs to be taken into account and fixed or worked around for production grade applications.

A simple workaround would be to add two buttons after answering the user “Do you want to ask a new question or clarify the current answer?”, and if they choose the 1st option, you create a new session.

Summary

Thanks for staying with me till the end! I hope this blogpost helps you try a different approach for RAG that’s not based on the highly controversial vector approaches out there. Any thoughts? leave it in the comments below!

--

--

Pablo Guzmán

AWS Senior Solutions Architect with 12+ years of financial services industry experience