The RAG System
As I am reading it out loud, “the RAG System”, it somehow gives me the same intense vibe as something like “The Terminator” — sounds powerful 🤣. But don’t worry, it’s not a rogue AI on a mission (Yet!) 🤖
RAG simply stands for Retrieval-Augmented Generation, and while it might sound like something super high-tech, it just extends the already powerful capabilities of LLMs to specific domains or an organization's internal knowledge base, all without the need to retrain the model.
In simple words, RAG systems provide the additional context and direction to LLMs that it doesn’t have. Lets use the below GIF to explain the purpose of it — imagine your LLM is like one of those robot vacuum cleaner — good at cleaning but sometimes confused about where to go and cleans only limited areas; and the cat is the RAG system that is guiding the vacuum cleaner in the new area with the right context. Simple as that!
And why is it needed?
I’m sure you’ve already heard of another famous term often used alongside AI: hallucination 😵💫. This happens when LLMs start spitting out irrelevant or, even worse — incorrect information. Yes, exactly — one of the main reasons why RAG is crucial is to prevent these hallucinations.
The second reason? To enhance and complement the information the LLM already has by pulling in the most relevant, real-time data. This way, LLMs can deliver responses that are not only accurate but also enriched with up-to-date insights.
How does it work?
Given a user query or prompt, RAG systems retrieve relevant and additional information based on that query. This extra information is then added to the LLM’s context window to help provide better responses. Hopefully the answers are better than the one below 😉
At its core, the system serves two main purposes:
Retrieve relevant documents based on user prompt.
Enhance the user prompt by adding these relevant documents as part of the LLM’s context window.
I hope that gives you a clear picture! As you can see, the most important task of the RAG system is to fetch relevant documents — and that’s all thanks to the Retrieval system! So now let’s discuss a bit about Retrieval systems.
Retrieval system
With the rise of RAG systems and Vector Databases, many people mistakenly think RAG systems are all about vector databases and embeddings. But no — RAG systems are not just about Vector DBs! It depends entirely on the size and type of the knowledge base, as well as the system’s performance and latency requirements. Let’s walk through the different ways we can implement a retrieval system, from the simplest to the more complex.
First the simplest one to give you some context. Think about it… What could be the simplest RAG system? Humans, of course!
When you enter a question into ChatGPT, don’t you sometimes copy-paste text or attach a document/photo and ask (well, maybe demand) ChatGPT to answer based on that context? In that moment, YOU, yes YOU, are forming the simplest RAG system! You’re the one providing the relevant context for retrieval. 🧠
Now say that you want to automate this whole copy pasting and involve some AI into it. Let’s dive into the possible different approaches for retrieval:
Intent Recognition/Classification based retrieval
Now, imagine you have 50 or more departments or services, each with corresponding documents that your chatbot powered with LLMs needs to handle. How do you efficiently direct the user’s query to the right context?
Here’s a simple solution: Add all the relevant documents to a basic SQL, NoSQL database, or even store them in an S3 bucket, each tagged with distinct labels based on their content or service type. Then, you can either train a basic classification model or leverage the power of an LLM to classify the incoming user prompts into those tags.
In real time, when a user prompt comes in, the classification model (or LLM) predicts which tag best matches the query. Based on this tag, the relevant documents are quickly retrieved from the most cost-effective storage (like SQL, NoSQL, or S3) → added to the context → and BAM! The response is ready to go, complete with the right information.
BM25 based retrieval
Now, let’s say the number of documents grows beyond what a simple classification system can handle, and you’re happy to retrieve documents purely based on the words used in the user’s prompt.
For example, imagine a user asks for “shipping policies for fragile items.” Instead of relying on classification, the system can retrieve documents that best match these specific words. BM25 or even TD-IDF excels at ranking documents based on the frequency and relevance of the words in the user’s query. For this scenario, a straightforward latent retrieval search engines like OpenSearch or ElasticSearch could be helpful.
This method is fast, relatively simple, and scales well with growing document numbers. No need to overcomplicate things!
Dense Vector Similarity based retrieval
When you need to move beyond basic keyword matching and retrieve documents based on the deeper context or meaning of a user’s prompt, dense vector similarity comes into play. This approach leverages embeddings—high-dimensional vector representations of data—allowing for the retrieval of semantically similar content. Vector DBs like Qdrant, Pinecone, or hybrid search engines such as OpenSearch can be used to store and query these embeddings efficiently.
Unlike traditional keyword-based methods, dense vector similarity enables context-aware search, retrieving semantically relevant documents even when exact words don’t match. It’s particularly useful for complex queries or when working with large unstructured dataset.
One of the key strengths of using dense vector similarity is its ability to handle multi-modal data. Vector databases can seamlessly store and retrieve embeddings not only for text but also for other data types like images, enabling cross-modal search capabilities. For instance, a query could include both textual and visual components, and the system would be able to retrieve results that align with the overall semantic meaning, whether it’s text, images, or a combination of both.
Hybrid methods
Relying on just one retrieval method may not always give you perfect results, especially as the complexity of the data and user queries grows. To get the best of all worlds, you can combine multiple methods in a hybrid approach. Here are some combinations that work well:
Intent Classification + BM25: First, classify the user’s intent to narrow down the relevant documents, then use BM25 to rank them based on keyword relevance.
BM25 + Vector Similarity: Start with BM25 to retrieve keyword-matching documents, then apply dense vector similarity to re-rank the results based on context and semantic similarity.
Intent Classification + BM25 + Vector Similarity: For high precision, you can stack all three methods in stages — intent classification to filter, BM25 for keyword relevance, and a vector similarity for deep semantic understanding.
Final thoughts
Listen, don’t get caught up in the marketing hype around RAG systems and Vector Databases. Start by understanding the problem you’re solving—what are your limitations and the type and scale of your data? Vector DBs won’t magically fix everything and can actually complicate things, especially when it comes to accurately ranking semantically relevant items.
It’s not just about the cost of the system, but also about keeping it simple. As the context window of LLMs increases, you might find it easier to simplify your RAG systems; although not necessarily helping the hallucinations.