n8n RAG Pipeline: Build AI That Actually Knows Your Data
Your AI chatbot doesn’t know your business. And that’s exactly why it keeps making things up.
You built a customer support bot. It sounds confident and responds quickly. Then a customer asks about your return policy, and the bot confidently explains a policy you don’t have.
This happens constantly. Large language models generate plausible-sounding text based on patterns, not truth. When they lack specific information about YOUR business, they fill gaps with educated guesses.
The Context Problem
LLMs know a lot about the world in general. They know nothing about:
- Your internal documentation
- Your product specifications
- Your company policies
- Last week’s pricing changes
That knowledge gap creates a fundamental mismatch between what the model can do and what your business needs.
Fine-tuning seems like the obvious solution. Train the model on your data. But fine-tuning is expensive, requires ML expertise, and creates a static snapshot. Every documentation change means retraining. That’s not sustainable.
Why RAG Changes Everything
Retrieval-Augmented Generation (RAG) takes a different approach. Instead of baking knowledge into the model, you retrieve relevant context at query time and include it in the prompt.
Key Insight: RAG transforms an AI that guesses into an AI that references. The model answers based on what you show it, not what it memorized during training.
The advantages are immediate:
- Your AI stays current because retrieval pulls from live data
- You maintain control because you decide what context to include
- You can audit responses by tracing which documents informed each answer
For the official perspective, see the n8n RAG documentation.
What You’ll Learn
- How RAG architecture works under the hood
- Step-by-step n8n RAG pipeline setup with working examples
- Chunking strategies that preserve context and meaning
- Vector database selection (Pinecone vs Qdrant vs Supabase)
- Production optimization and performance tuning
- Debugging retrieval failures before they reach users
- Real-world use cases with practical implementation patterns
Managed RAG vs Custom RAG
Before building a custom RAG pipeline, consider whether managed alternatives fit your needs. Major AI providers now offer fully managed RAG solutions that handle chunking, embedding, and retrieval automatically.
Managed RAG Solutions
Cloud Provider File Search APIs handle the entire RAG stack for you:
- Upload documents to the API
- Automatic chunking and embedding
- Built-in vector storage and retrieval
- Pay only for what you use
Check the documentation for Gemini, OpenAI, and similar providers for current offerings. These evolve rapidly.
Pros of managed RAG:
- Minutes to set up, not hours
- No vector database to maintain
- Automatic updates and optimization
- Great for prototyping and validation
Cons of managed RAG:
- Limited control over chunking strategies
- No metadata filtering or hybrid search
- Data leaves your infrastructure
- API rate limits and costs at scale
When to Choose Custom RAG (n8n)
Build custom RAG pipelines when you need:
| Requirement | Why Custom |
|---|---|
| Custom chunking | Your documents need domain-specific splitting |
| Metadata filtering | Filter by date, category, access level |
| Hybrid search | Combine vector and keyword search |
| Large collections | 1000+ documents with complex relationships |
| Data sovereignty | Documents cannot leave your infrastructure |
| Full control | Fine-tune every aspect of retrieval |
Practical approach: Start with managed RAG to validate your use case quickly. If you hit limitations, migrate to custom. The knowledge you gain transfers directly.
For the rest of this guide, we focus on building custom RAG with n8n, where you control the entire pipeline.
How RAG Actually Works
Understanding the architecture helps you make better decisions when things break. RAG pipelines have two distinct phases that work together.
The Two-Phase Architecture
Phase 1: Ingestion (Offline)
Before your chatbot can answer questions, you prepare your knowledge base:
- Load documents from various sources (files, databases, APIs)
- Split content into manageable chunks
- Generate embeddings for each chunk using an embedding model
- Store vectors in a database optimized for similarity search
This happens once per document (or when documents update). The result is a searchable index of your knowledge base.
Phase 2: Retrieval + Generation (Runtime)
When a user asks a question:
- Embed the query using the same embedding model
- Search the vector database for similar chunks
- Retrieve the top matching documents
- Augment the prompt with retrieved context
- Generate a response grounded in that context
The query and documents become vectors in the same mathematical space. Similar concepts cluster together, enabling semantic search that goes beyond keyword matching.
Why This Beats Fine-Tuning
| Aspect | Fine-Tuning | RAG |
|---|---|---|
| Cost | High (compute, expertise) | Low (API calls) |
| Update speed | Slow (requires retraining) | Instant (update documents) |
| Data control | Baked into model weights | External, auditable |
| Transparency | Black box | Traceable to source |
| Flexibility | Fixed after training | Dynamic retrieval |
Fine-tuning has its place for teaching models new behaviors or styles. But for grounding responses in factual, changing data, RAG is more practical for most use cases.
The Retrieval Flow
User Query → Embed Query → Vector Search → Retrieve Chunks → Augment Prompt → LLM → Response
Each step introduces potential failure points. The embedding model might not capture semantic meaning well. The vector search might return irrelevant chunks. The prompt augmentation might include too much or too little context. Understanding this flow helps you debug issues systematically.
Building Your First n8n RAG Pipeline
Let’s build a working RAG system from scratch. We’ll create two workflows: one for ingesting documents, another for handling queries.
Prerequisites
Before starting, you need:
- n8n instance (self-hosted or n8n Cloud)
- OpenAI API key (or Ollama for local inference)
- Vector database (we’ll use Supabase with pgvector, but Pinecone and Qdrant work similarly)
If you’re self-hosting n8n, our self-hosting guide covers the infrastructure considerations.
Part 1: Document Ingestion Workflow
This workflow loads documents, chunks them, generates embeddings, and stores everything in your vector database.
Step 1: Trigger and Load Documents
Start with a Manual Trigger for testing. In production, you might use a Schedule Trigger or webhook when documents update.
Add a Read Binary File node or Google Drive node to load your documents. For PDFs, the Extract from File node handles text extraction.
Step 2: Chunk the Content
Add the Recursive Character Text Splitter node (found under AI > Document Loaders). Configure it:
Chunk Size: 1000
Chunk Overlap: 200
This splits your documents into overlapping segments. The overlap ensures context isn’t lost at chunk boundaries.
Step 3: Generate Embeddings
Add the Embeddings OpenAI node. Select your embedding model:
- Smaller model for cost-effective production (fewer dimensions, faster)
- Larger model for maximum accuracy (more dimensions, higher quality)
Each chunk becomes a vector with dimensions typically ranging from 384 to 3072 depending on the model. Check the MTEB Leaderboard for current model benchmarks and select based on your accuracy vs. cost requirements.
Step 4: Store in Vector Database
Add the Supabase Vector Store node (or Pinecone/Qdrant equivalent). Configure:
- Mode: Insert
- Table Name: Your embeddings table
- Embedding Column: The column storing vectors
- Content Column: The column storing original text
Important: Use the same embedding model for ingestion and queries. Mismatched models produce incompatible vectors and break retrieval.
Part 2: Query Workflow
This workflow receives user questions, retrieves relevant context, and generates responses.
Step 1: Chat Trigger
Use the Chat Trigger node to create an interactive chat interface. This provides a built-in UI for testing and can be embedded in applications.
Step 2: Configure the AI Agent
Add the AI Agent node. This is the brain of your RAG system.
Under Tools, add your vector store as a retrieval tool:
- Add Vector Store Tool
- Select your vector store node (Supabase, Pinecone, etc.)
- Configure the tool description: “Search the knowledge base for relevant information about [your domain]”
- Set Top K to 5 (retrieve top 5 matching chunks)
Vector Stores as Tools: Recent n8n versions support adding vector stores directly as AI Agent tools. The agent decides when to search (not every query), reducing latency for simple questions. This is the recommended approach for agentic RAG.
Supported Vector Store Nodes:
| Vector Store | Hosting | Best For |
|---|---|---|
| Pinecone | Managed cloud | Zero-ops enterprise scale |
| Qdrant | Self-host or cloud | Cost-effective, flexible |
| Supabase | Managed | Postgres users, SQL + vectors |
| MongoDB Atlas | Managed | Existing MongoDB users |
| Weaviate | Self-host or cloud | Schema-based, hybrid search |
| Azure AI Search | Azure | Microsoft ecosystem |
| In-Memory | Local | Prototyping only |
Step 3: Connect the LLM
Add an LLM node (OpenAI, Anthropic, or Ollama for local) and connect it to the AI Agent.
Configure the system prompt:
You are a helpful assistant that answers questions based on the provided context.
Only answer based on information from the knowledge base.
If you don't find relevant information, say so clearly.
Always cite which document your answer comes from.
Step 4: Add Memory (Optional)
For multi-turn conversations, add a Window Buffer Memory node. This keeps recent conversation history in context.
Configure:
- Context Window Length: 10 (last 10 messages)
- Session ID: Use a unique identifier per user/conversation
Testing Your Pipeline
- Run the ingestion workflow to populate your vector database
- Open the Chat Trigger URL
- Ask questions about your documents
- Check that responses cite actual content
If responses seem off, check the AI Agent troubleshooting guide for common issues.
Chunking Strategies That Actually Work
Chunking is where most RAG pipelines fail silently. Bad chunking produces bad retrieval. Bad retrieval produces hallucinations. Users blame the AI when they should blame the preprocessing.
Why Chunking Matters
Consider a document explaining your return policy. The policy spans two paragraphs: conditions for returns in paragraph one, the process in paragraph two. If your chunker splits between those paragraphs, neither chunk contains the complete policy. When users ask about returns, retrieval might find only half the answer.
A discussion in the RAG community captured this frustration: “Once chunked, vector lookups lose adjacent chunks. Automated chunking is adhoc, cutoffs are abrupt. Chunking loses level 2 and level 3 insights present in the document.”
The problem is real. The solution requires understanding your content.
Chunking Methods Compared
| Strategy | Best For | Typical Size | Overlap | n8n Node |
|---|---|---|---|---|
| Fixed-size | Quick prototyping | 500-1000 chars | 100-200 | Character Text Splitter |
| Recursive | Most documents | 1000-2000 chars | 200-400 | Recursive Character Text Splitter |
| Markdown-aware | Technical docs | By headers | N/A | Recursive with separators |
| Token-based | LLM context limits | 256-512 tokens | 50-100 | Token Text Splitter |
Practical Recommendations
Start with Recursive Character Text Splitter. It’s the most versatile option and handles most document types well. Configure it with:
Chunk Size: 1000
Chunk Overlap: 200
Separators: ["\n\n", "\n", " ", ""]
The separators tell the splitter to prefer breaking at paragraph boundaries, then sentences, then words. This preserves semantic units better than arbitrary character cuts.
For technical documentation, increase chunk size and use markdown-aware separators:
Chunk Size: 2000
Chunk Overlap: 400
Separators: ["## ", "### ", "\n\n", "\n"]
This keeps entire sections together when possible.
For conversational content (chat logs, support tickets), smaller chunks work better:
Chunk Size: 500
Chunk Overlap: 100
Testing Chunk Quality
Don’t guess whether your chunking works. Test it:
- Ingest a sample of your documents
- Ask questions you know the answers to
- Log the retrieved chunks (not just the final answer)
- Check if the right chunks appear for each query
If retrieval returns incomplete or irrelevant chunks, adjust your strategy. Chunking is empirical, not theoretical.
Late Chunking (Advanced)
Traditional chunking has a fundamental problem: each chunk is embedded in isolation. When a chunk starts with “It also requires…” the embedding model has no idea what “it” refers to.
Late chunking flips the order:
- Embed the entire document first (requires long-context embedding models)
- Then split the embeddings into chunks
This preserves context across chunk boundaries. The embedding for “It also requires…” knows what “it” refers to because the full document was processed together.
When to use late chunking:
- Documents with many pronouns and references
- Legal contracts with cross-references
- Technical specifications with dependencies
- Any document where context flows across sections
Requirements:
- Embedding models with 8K+ token context windows
- More compute during ingestion (offset by better retrieval)
Check the Weaviate chunking guide for implementation details. Late chunking requires embedding model support, so verify your chosen model handles long contexts.
Choosing Your Vector Database
Vector databases store embeddings and enable similarity search. Your choice affects cost, performance, and operational complexity.
Vector Database Comparison for n8n
| Database | Hosting | Pricing | Best For | n8n Support |
|---|---|---|---|---|
| Pinecone | Managed | $$ | Enterprise scale, zero ops | Native node |
| Qdrant | Self-host or Cloud | $ | Cost-effective, flexible | Native node |
| Supabase | Managed | $ | Postgres users, SQL + vectors | Native node |
| In-Memory | Local | Free | Prototyping, small datasets | Native node |
Pinecone Setup
Pinecone is fully managed. No infrastructure to maintain.
- Create an account at pinecone.io
- Create an index:
- Dimensions: Match your embedding model (commonly 384, 768, 1024, or 1536)
- Metric: Cosine
- Pod type: s1.x1 for starting
- Copy your API key
- In n8n, create Pinecone credentials with your API key
- Configure the Pinecone Vector Store node with your index name
Pros: Zero maintenance, scales automatically, excellent documentation Cons: Higher cost at scale, data leaves your infrastructure
Qdrant Setup
Qdrant can run locally or in their cloud. Great for cost-conscious deployments.
Self-hosted with Docker:
docker run -p 6333:6333 qdrant/qdrant
Create a collection:
curl -X PUT 'http://localhost:6333/collections/my_collection' \
-H 'Content-Type: application/json' \
-d '{
"vectors": {
"size": 768, // Match your embedding model dimensions
"distance": "Cosine"
}
}'
In n8n, configure the Qdrant node with your URL and collection name.
Pros: Self-host option for data control, generous free tier on cloud, rich filtering Cons: Requires ops knowledge for self-hosting
Supabase Setup
If you already use Supabase, adding vectors is straightforward.
- Enable the pgvector extension:
create extension if not exists vector;
- Create your embeddings table:
-- Adjust vector dimensions to match your embedding model
create table documents (
id bigserial primary key,
content text,
embedding vector(768),
metadata jsonb
);
- Create a similarity search function:
-- Adjust vector dimensions to match your embedding model
create or replace function match_documents (
query_embedding vector(768),
match_count int default 5
) returns table (
id bigint,
content text,
similarity float
)
language plpgsql
as $$
begin
return query
select
documents.id,
documents.content,
1 - (documents.embedding <=> query_embedding) as similarity
from documents
order by documents.embedding <=> query_embedding
limit match_count;
end;
$$;
Pros: Unified backend (database + vectors), SQL familiarity, cost-effective Cons: Not optimized purely for vector search, requires more setup
Making the Choice
- Prototyping: Start with In-Memory or Supabase
- Production with budget: Qdrant Cloud or self-hosted
- Enterprise with zero ops: Pinecone
- Existing Postgres stack: Supabase with pgvector
For infrastructure setup guidance, see our self-hosted setup service.
Production Optimization
A working demo isn’t production-ready. Production RAG needs optimization for accuracy, speed, and reliability.
Improving Retrieval Quality
Hybrid Search
Pure vector search excels at semantic similarity but misses exact matches. “What’s the SKU for product X?” might fail if “SKU” and “product code” have different embeddings.
Hybrid search combines vector similarity with keyword matching. Some vector databases (Pinecone, Qdrant) support this natively. For Supabase, you can combine pgvector with full-text search:
-- Add full-text search
alter table documents add column fts tsvector
generated always as (to_tsvector('english', content)) stored;
create index on documents using gin(fts);
Metadata Filtering
Not all documents are equally relevant. A question about current pricing shouldn’t retrieve archived policies.
Add metadata during ingestion:
{
"source": "pricing_guide",
"category": "sales",
"updated": "current",
"status": "active"
}
Filter during retrieval to scope results:
metadata.status = "active" AND metadata.category = "sales"
This reduces noise and improves relevance without changing your embedding strategy.
Reranking
Initial retrieval casts a wide net. Reranking narrows it with a more accurate model.
Two-stage retrieval architecture:
| Stage | Model Type | Speed | Accuracy | Purpose |
|---|---|---|---|---|
| First | Bi-encoder (embeddings) | Fast | Good | Retrieve 50-100 candidates |
| Second | Cross-encoder (reranker) | Slower | Excellent | Filter to top 5-10 |
Why cross-encoders work better:
Bi-encoders (used for embedding) encode queries and documents separately. Cross-encoders process the query AND document together, seeing how they relate. This joint processing catches relevance that separate embeddings miss.
Practical impact:
- 20-35% improvement in retrieval accuracy
- Adds 200-500ms latency (worth it for accuracy-critical applications)
- Most valuable for ambiguous queries
Implementation in n8n:
- Retrieve top-50 chunks from your vector store
- Use an HTTP Request node to call a reranking API
- Parse the reranked results
- Pass only top-5 to the LLM
Check the Pinecone reranking guide for API options and benchmarks.
When to skip reranking:
- Simple, unambiguous queries
- Latency-critical applications
- When retrieval accuracy is already high
A/B test reranking with your actual queries. Poor reranking can hurt more than help if the reranker doesn’t understand your domain.
Performance Tuning
Caching Frequent Queries
Many users ask similar questions. Cache embeddings for common queries and their results.
In n8n, use Redis or the built-in caching to store:
- Query embeddings (avoid re-embedding identical questions)
- Top results for frequent queries
- Session context for returning users
Batch Processing for Ingestion
When ingesting large document sets, process in parallel:
- Split documents into batches
- Use sub-workflows for parallel embedding generation
- Batch inserts to your vector database
The Execute Sub-Workflow node enables this pattern cleanly.
Right-size Your Top-K
Retrieving too many chunks wastes tokens and can confuse the LLM. Retrieving too few risks missing critical information.
Start with top-5. If answers feel incomplete, increase to 10. If responses include irrelevant tangents, decrease to 3.
Monitor your token usage and response quality to find the sweet spot.
Debugging RAG Failures
A common frustration from the community: “RAG regressions were impossible to debug until we separated retrieval from generation.”
The Debug Checklist
- Log retrieved chunks for every query
- Score the chunks manually for relevance
- Check if the right information exists in your knowledge base
- Verify embedding alignment (same model for ingestion and query)
- Test the prompt with manually selected context
Most RAG failures are retrieval failures. If the right context never reaches the LLM, the LLM can’t give the right answer.
Use our workflow debugger to trace execution and identify where things break.
Advanced Patterns
Once basic RAG works, these patterns handle more complex scenarios. The field evolves quickly. Check the Pinecone RAG guide for current best practices.
Agentic RAG
Instead of always retrieving, let the AI Agent decide when retrieval is necessary.
Configure your AI Agent node with the vector store as a tool, not a fixed step. The agent can:
- Answer simple questions from its training
- Retrieve context for domain-specific questions
- Combine multiple tool calls for complex queries
- Decide HOW to retrieve (which store, what filters)
Multi-step retrieval takes this further. The agent:
- Retrieves initial context
- Evaluates if it has enough information
- Refines the query and retrieves again if needed
- Synthesizes the final answer
This reduces unnecessary retrievals and latency for straightforward interactions while improving accuracy for complex questions.
For deeper coverage of agent architectures, see our AI Agent vs LLM Chain comparison. You can also build simpler RAG patterns using the Basic LLM Chain node for straightforward question-answering without full agent capabilities.
Self-RAG and Corrective RAG
Traditional RAG retrieves context for every query. Self-RAG adds intelligence:
Self-RAG lets the model decide:
- Does this query need retrieval at all?
- Is the retrieved context sufficient?
- Should I retrieve again with a different query?
Corrective RAG adds self-critique:
- Generate initial answer
- Evaluate if the answer is grounded in retrieved context
- If not, retrieve additional context and regenerate
- Return only verified answers
Both patterns reduce hallucinations by adding reflection loops. In n8n, implement this with the AI Agent’s ability to call tools conditionally and evaluate outputs.
GraphRAG
Vector search finds similar chunks. GraphRAG finds connected concepts.
How it works:
- Build a knowledge graph from your documents (entities and relationships)
- Combine graph traversal with vector search
- Answer questions that span multiple documents
Best for:
- “What themes appear across all quarterly reports?”
- “How are these legal cases related?”
- Research synthesis across large document sets
GraphRAG requires more infrastructure (a graph database alongside your vector store) but excels at questions that need connecting dots across your corpus.
Multimodal RAG
Standard RAG handles text. Multimodal RAG handles images, charts, tables, and diagrams.
Two approaches:
Approach 1: Vision model extraction
- Extract images from documents
- Send images to a vision-capable model
- Get text descriptions of visual content
- Embed the descriptions alongside document text
In n8n: Extract from File → vision model API → Embeddings → Vector Store
Approach 2: Multimodal embeddings
- Use embedding models that handle both text and images
- Store everything in the same vector space
- Retrieve relevant content regardless of modality
Use cases:
- Technical documentation with diagrams
- Financial reports with charts
- Product catalogs with images
- Medical records with scans
For documents with significant visual content, multimodal RAG prevents losing critical information that text extraction misses.
Multi-Document RAG
Different document types need different treatment. Product specs require precision. Marketing content allows summarization. Legal documents need exact quotes.
Create separate vector stores for each document category:
products_storefor product documentationsupport_storefor support tickets and FAQspolicies_storefor legal and compliance
Route queries to the appropriate store based on intent classification:
User asks about product → Search products_store
User asks about returns → Search policies_store
User asks about past issues → Search support_store
For complex routing logic, multi-agent orchestration patterns help.
RAG with Conversation Memory
Single-turn RAG answers questions in isolation. Multi-turn RAG maintains context across a conversation.
Combine the Window Buffer Memory node with your retrieval:
- Store conversation history
- Include relevant history in the retrieval query
- Let the agent reference both retrieved documents and prior conversation
Configure memory conservatively. Too much history bloats the context window and increases costs. Keep the last 5-10 exchanges maximum.
Local/Private RAG
Sending data to external APIs raises privacy concerns. For sensitive data, run everything locally.
Local Stack:
- LLM: Ollama with any supported open-source model
- Embeddings: Ollama embeddings or open-source alternatives (check the MTEB Leaderboard for current top performers)
- Vector Store: Qdrant self-hosted
No data leaves your infrastructure. Responses might be slower and slightly less capable than frontier models, but you maintain complete data control.
This matters for GDPR compliance, healthcare data, financial records, and any scenario where data residency is non-negotiable.
Real-World Use Cases
Abstract architecture becomes concrete through examples.
Customer Support Bot
The Problem: Support agents answer the same questions repeatedly. Documentation exists but customers don’t read it.
The RAG Solution:
- Ingest knowledge base articles, FAQs, product docs
- Connect to your support widget or chat interface
- Retrieve relevant docs for each customer query
- Generate responses with links to full documentation
Expected Outcomes:
- 40-60% reduction in tier-1 tickets
- Faster response times (instant vs waiting for agents)
- Consistent answers across all interactions
For related automation, see our support automation workflows.
Internal Documentation Search
The Problem: Employees can’t find information across scattered wikis, shared drives, and outdated docs.
The RAG Solution:
- Ingest content from Confluence, Notion, SharePoint, Google Drive
- Create a unified search interface
- Return answers with source links for verification
Key Considerations:
- Implement access controls (not everyone should see everything)
- Handle document versioning (retrieve latest, not archived)
- Schedule regular re-ingestion to catch updates
Sales Enablement
The Problem: Sales reps need quick access to product specs, competitive intel, and pricing during calls.
The RAG Solution:
- Ingest product documentation, battle cards, pricing guides
- Build a chat interface for real-time lookups
- Include competitor comparisons and objection handling
Enhancement: Connect to your CRM to personalize responses based on the prospect’s industry and use case.
Code Documentation Assistant
The Problem: Developers waste time searching through repositories and outdated READMEs.
The RAG Solution:
- Ingest README files, API documentation, code comments
- Answer “how do I…” questions with actual code examples
- Link to relevant files in the repository
Technical Note: For code, smaller chunks (300-500 chars) often work better. Code snippets need to be complete enough to be useful.
Common Pitfalls and Solutions
Learning from others’ failures saves you time.
| Problem | Likely Cause | Solution |
|---|---|---|
| Hallucinations | Poor retrieval returning wrong context | Improve chunking, add metadata filtering |
| Missing context | Chunks too small, losing information | Increase chunk size, add overlap |
| Slow responses | Too much context, token bloat | Reduce top-K, implement reranking |
| Outdated answers | Stale embeddings | Build document update pipeline |
| Wrong documents | No source filtering | Add metadata, filter by recency/status |
| Inconsistent quality | Mixed document types | Separate stores, custom chunking per type |
When NOT to Use RAG
RAG isn’t always the answer.
Skip RAG for:
- Simple Q&A where the LLM’s training covers the topic
- Real-time data (stock prices, weather) where APIs are better
- Highly structured queries where SQL is more reliable
- Tasks requiring computation, not retrieval
Use RAG for:
- Domain-specific knowledge not in public training data
- Frequently changing information
- Scenarios requiring auditability and source attribution
- Reducing hallucinations about your specific content
Evaluating RAG Quality
Building RAG is straightforward. Knowing if it works is hard. Without measurement, you’re guessing.
Key Metrics
| Metric | What It Measures | Target | How to Measure |
|---|---|---|---|
| Context Precision | % of retrieved chunks that are relevant | >80% | Manual review or LLM scoring |
| Context Recall | % of needed info actually retrieved | >90% | Compare against known answers |
| Faithfulness | Does answer match retrieved context? | >95% | Check for unsupported claims |
| Answer Relevancy | Does answer address the question? | >90% | User satisfaction or LLM scoring |
Evaluation Frameworks
RAGAS (Retrieval-Augmented Generation Assessment) is an open-source framework that automates these evaluations using LLMs. Check the RAGAS documentation for setup.
Manual spot-checks remain valuable:
- Create a golden dataset of 50-100 queries with known answers
- Run your RAG pipeline on these queries
- Score both retrieval (were right chunks found?) and generation (was answer correct?)
- Identify patterns in failures
A/B testing with real users provides ground truth. Track:
- User satisfaction ratings
- Follow-up question rates
- Task completion success
Practical Evaluation Workflow
- Build a test set of 50-100 queries covering your key use cases
- Include edge cases (ambiguous queries, multi-hop questions)
- Log everything during retrieval (chunks retrieved, scores, final answer)
- Score retrieval first (right chunks found?)
- Score generation second (answer correct given chunks?)
- Identify failure patterns (chunking issue? embedding issue? prompt issue?)
- Fix one thing at a time and re-evaluate
Key insight: Most RAG failures are retrieval failures. If the right context never reaches the LLM, fix retrieval first. Only debug generation after confirming good retrieval.
Frequently Asked Questions
How do I handle table data in my RAG pipeline?
Tables are tricky because flattening rows into text loses structure. Two approaches work:
Option 1: Structured text conversion Convert tables to markdown format preserving headers and alignment. Chunk by table or section, not by character count.
Option 2: Hybrid storage Keep tabular data in a SQL database. Use RAG for unstructured content and SQL queries for structured lookups. Your AI Agent can access both tools.
For data-heavy workflows, option 2 typically performs better because SQL queries are deterministic while vector search is probabilistic.
How many chunks should I retrieve (top-K)?
Start with 3-5 chunks for focused answers. This provides enough context without overwhelming the prompt.
Increase to 10+ for comprehensive responses where completeness matters more than brevity. Decrease to 1-3 when precision is critical and you need the single best match.
Consider your LLM’s context window. Frontier models handle 100K+ tokens; smaller models may choke on large contexts. Check your model’s documentation and balance retrieval breadth against token limits.
Can I run RAG completely locally without external APIs?
Yes. Use this stack:
- LLM: Ollama running any supported open-source model
- Embeddings: Ollama embeddings or open-source alternatives from the MTEB Leaderboard
- Vector Store: Qdrant via Docker
Trade-offs: Local models are slower and slightly less capable than frontier API models. But you get complete data privacy, no API costs, and no rate limits.
For self-hosted infrastructure guidance, check our n8n self-hosted setup service.
How do I update documents without re-embedding everything?
Implement incremental updates:
- Track document IDs in your vector store metadata
- Detect changes (file hash, modified date, or content diff)
- Delete old vectors for changed documents
- Ingest new versions only
Most vector databases support upsert operations. Use the document ID as the key. Changed documents get their old vectors replaced; unchanged documents stay untouched.
Schedule this as a recurring n8n workflow that checks for updates daily or hourly depending on your change frequency.
Why does my RAG sometimes retrieve irrelevant chunks?
Common causes:
- Chunking too aggressive: Important context split across chunks
- Missing metadata: No way to filter irrelevant sources
- Embedding mismatch: Different models for ingestion vs query
- Semantic gap: Query phrasing differs from document language
Debug by logging retrieved chunks for failing queries. If the right chunk exists but isn’t retrieved, the issue is likely the embedding or query formulation. If the right chunk doesn’t exist, the issue is ingestion or chunking.
Use techniques like query expansion (rephrasing) or hypothetical document embeddings (HyDE) to bridge semantic gaps between how users ask and how documents are written.
Should I use managed RAG APIs instead of building custom?
Managed RAG APIs (like those from major AI providers) handle chunking, embedding, and retrieval automatically. They’re great for getting started quickly with small document sets.
Choose custom RAG (n8n + vector database) when you need:
- Control over chunking strategies
- Metadata filtering for scoped retrieval
- Hybrid search (vector + keyword)
- Large document collections (1000+ documents)
- Self-hosted/private deployment
- Full visibility into retrieval behavior
Practical approach: Start with managed to validate your use case. If you hit limitations (usually around control, scale, or privacy), migrate to custom. The concepts transfer directly.
How do I handle PDFs with images, charts, and tables?
Standard text extraction loses visual information. Charts become empty space. Diagrams disappear.
Two approaches work:
Approach 1: Vision model extraction
- Extract images from PDFs using the Extract from File node
- Send images to a vision-capable model API
- Get text descriptions of charts, diagrams, and tables
- Embed those descriptions alongside the document text
This captures information that text extraction misses.
Approach 2: Multimodal embeddings Use embedding models that handle both text and images in the same vector space. Your query can match either text content or visual content.
For most use cases, approach 1 is simpler to implement in n8n and works well. Approach 2 requires specialized embedding infrastructure but provides more seamless multimodal retrieval.
Next Steps
You now have the knowledge to build production RAG pipelines in n8n. Start simple:
- Pick a focused use case (one document type, one question category)
- Build the ingestion and query workflows
- Test with real questions
- Iterate on chunking and retrieval based on results
For complex implementations or enterprise deployments, our n8n consulting services can accelerate your path to production.
The difference between a demo that impresses and a system that works is iteration. Build, measure, improve. Your AI will only be as good as the context you give it.