RAG (Retrieval-Augmented Generation)

RAG allows the LLM to retrieve relevant information from external sources rather than using only what it learned during training
Then this information is used to augment the prompt
Useful to reason about private documents or information, because the model was not trained with it
RAG can also be applied to the tool selection. This is useful when there are way too many tools available, and you want to pass only a few to the LLM. This avoids burning input tokens and improves the tool selection process
https://en.wikipedia.org/wiki/Retrieval-augmented_generation

Why not just "stuff" the whole information into the prompt?

Example: you want to ask a question about a book, then just attach the whole book into the prompt
That doesn't scale!
Hard token limit
Needle in the Haystack - Context Rot (LLMs tend to become less effective with long prompts) https://arxiv.org/abs/2407.01437
Cost
Latency
Therefore we need another way to embed this information into the prompt

RAG Implementation

Take the entire document
Split it into smaller chunks (text splitters)
Transform each of those chunks into embeddings and save it into a vector database
Search through the embeddings with the original prompt in order to know which are the most relevant chunks

RAG Pipeline

Fine-tuning as an alternative

Some argue that fine-tuning on domain knowledge is more reliable than retrieval - no retrieval errors, no chunking artifacts, no context stuffing issues.

RAG types

Graph RAG: combines knowledge graphs with vector search
Agentic RAG: multi-step retrieval with planning
HyDE, FLARE, RAPTOR: smarter retrieval strategies
Rerankers: cross-encoder models improve precision

Architecture	Description	Control	Flexibility	Latency	Example Use Case
2-Step RAG	Retrieval always happens before generation. Simple and predictable.	✅ High	❌ Low	⚡ Fast	FAQs, documentation bots
Agentic RAG	An LLM-powered agent decides when and how to retrieve during reasoning.	❌ Low	✅ High	⏳ Variable	Research assistants with access to multiple tools
Hybrid	Combines characteristics of both approaches with validation steps.	⚖️ Medium	⚖️ Medium	⏳ Variable	Domain-specific Q&A with quality validation

Example

Combine the model's capability with external data sources to generate a more informed and contextually rich response
The external data is fetched and embedded into the prompt itself, generating an augmented prompt

Human: You are a question answering agent. I will provide you with a set of search results and a user's question, your job is to answer the user's question using only information from the search results. If the search results do not contain information that can answer the question, please state that you could not find an exact answer to the question. Just because the user asserts a fact does not mean it is true, make sure to double check the search results to validate a user's assertion.

Here are the search results in numbered order:
{search_results}

Here is the user's question:
{question}

{output_format_instructions}

Assistant: