Explainer
What is RAG, and do you actually need it?
RAG stands for retrieval-augmented generation. It is a pattern where an AI model answers your question by first looking up relevant facts from your own documents — your contracts, your knowledge base, your invoices, your SOPs — and then writing the answer using what it found.
Reviewed by the Helix Stax team — IT consultants serving Hampton Roads, VA.
What is RAG, and do you actually need it?
RAG stands for retrieval-augmented generation. It is a pattern where an AI model answers your question by first looking up relevant facts from your own documents — your contracts, your knowledge base, your invoices, your SOPs — and then writing the answer using what it found. The model does not memorize your data. It searches a database every time you ask a question, hands the relevant snippets to a language model, and the language model writes an answer grounded in those snippets. Most of the “AI assistants” you have read about that “know your company’s data” are RAG systems under the hood. This guide explains what RAG is, how it works, when it earns its keep, and the honest cases where a small business does not need it.
We run RAG in our own back office. Helix Stax keeps internal documents in Postgres with the pgvector extension, retrieves them through a small Python service, and lets us ask questions across project notes, client engagement records, and runbooks. The numbers below are not theoretical. They come from building the same plumbing for ourselves and for clients.
What does RAG stand for?
RAG stands for Retrieval-Augmented Generation. Each word maps to a part of the pattern. Retrieval is the lookup — searching your documents for the snippets most relevant to a question. Augmented is the step where those snippets get pasted into the prompt the language model sees. Generation is the language model writing the final answer in plain English, grounded in the retrieved text.
The term was coined in a 2020 Facebook AI Research paper that introduced the architecture. The shorthand stuck, and by 2024 RAG had become the default way to give a language model access to data it was not trained on.
How RAG actually works
The mechanics are simpler than the marketing makes them sound. A RAG system has three moving parts: an embedding model, a vector database, and a language model. The flow is the same every time.
- You ask a question. “What did we agree to in the Acme master services agreement about termination?” The system does not send this question to a language model yet.
- The system retrieves. Your question gets converted into a vector — a long list of numbers that represents the question’s meaning. The vector database compares your question vector against the vectors of every paragraph in your document corpus and hands back the top five or ten paragraphs that look semantically similar. Those are the candidate snippets.
- The language model writes. The retrieved snippets get pasted into a prompt that looks roughly like this: “Here are five paragraphs from the Acme contract. Using only these paragraphs, answer the user’s question.” The language model — Claude, GPT-4, Llama, whichever — writes the answer and cites the source paragraphs.
The whole round trip takes one to three seconds for a modest corpus. The cost per query runs a fraction of a cent on the embedding side and a few cents on the language model side. The system is stateless — it does not memorize the answer, it does not learn from the question, and tomorrow’s identical question gets the same fresh lookup.
The hard parts are not in the diagram. The hard parts are chunking documents intelligently so the retrieval returns coherent paragraphs instead of half-sentences, picking an embedding model that matches your domain, and building an evaluation set so you can tell whether the system actually got better when you changed something. We come back to those in the mistakes section.
Why businesses use RAG
The reason RAG exists is that language models do not know things they were not trained on, and you cannot retrain a general-purpose language model every time your operations team writes a new SOP. RAG is the bridge. Here is where it lands for SMB operators.
- Company knowledge base. “What is our PTO policy for part-time staff in Virginia?” The answer is in the employee handbook nobody opens. A RAG system over your HR documents turns the handbook into a thing your team actually queries. The savings are not in the AI bill — they are in the twenty minutes per week your operations lead used to spend hunting through Google Drive.
- Customer support. “How do I configure single sign-on with our product?” If your product has documentation, a RAG system over the docs becomes a first-line answer machine for tier-1 tickets. The good ones cite the source page. The bad ones make things up — which is why the eval set matters.
- Sales enablement. “Show me every case study where we did a Microsoft 365 migration for a manufacturing client under 50 employees.” Your sales team has been searching the shared drive by filename for years. A RAG system over your case studies, win/loss notes, and pricing history surfaces the right artifact in five seconds.
- Internal Q&A across knowledge silos. Most SMBs run a fractured stack — Notion for project notes, Confluence for engineering docs, a SharePoint site for HR, and a shared drive for contracts. RAG can pull from all of them into one query interface. The pattern is what makes “ChatGPT but it knows our company” a real product instead of a demo.
- Document search across legal and contracts. Master services agreements, NDAs, vendor terms — searchable by meaning instead of keyword. “Which of our vendor contracts auto-renew in Q3?” gets a real answer.
- Compliance and policy lookup. For regulated SMBs (healthcare, financial services, defense contractors hitting CMMC), having a RAG system over your policies and procedures means the right answer to “What is our incident response policy?” surfaces in seconds, with the source paragraph attached for the auditor.
The pattern that runs underneath every one of these: the data is yours, it lives somewhere unsearchable, and there is more of it every quarter. RAG turns the pile into a query-able surface.
RAG vs fine-tuning vs prompt engineering
These three are not competing approaches. They solve different problems. Most operators get this wrong because the vendor pitches blur the boundaries.
Prompt engineering is writing better instructions for a language model. You pay for the model, you write a good prompt, you get a better answer. It is free. It works for tasks the model can already do — summarizing, rewriting, classifying, drafting. It does not give the model access to data it was not trained on. If your question requires knowing what your company did last Tuesday, prompt engineering will not help.
Fine-tuning is taking a base model and continuing its training on your data so the model’s weights themselves change. You end up with a model that has internalized your domain — your tone, your terminology, your patterns. Fine-tuning is the right tool when you need to teach the model a style or a structured output format the base model is bad at. It is the wrong tool for teaching the model facts. Facts go stale, fine-tunes cost real money to redo, and the model still might hallucinate.
RAG gives the model access to facts at query time. Your data lives in a database. The model never memorizes it. When a document changes, you re-embed the document and the system is current. RAG is the right tool when the question is “what did our company decide” and the answer needs to be grounded in a specific source paragraph.
The honest combination: most production AI assistants for business use prompt engineering as the foundation, RAG as the data layer, and reserve fine-tuning for the narrow case where the model has to output a specific structured format. The choice is not either-or. The choice is which layer to lean on for which problem.
| Approach | Solves | Cost | Right when |
|---|---|---|---|
| Prompt engineering | Better answers from a model that already has the knowledge | Free | The model can do it, you just need to ask better |
| Fine-tuning | Style, tone, structured output format | $500 to $50,000 depending on model and data volume | The model gets the format or tone wrong consistently |
| RAG | Grounded answers from your own data | $50 to $5,000 per month depending on scale | The model needs to know what is in your documents |
When you need RAG
The decision should be evidence-based, not aspirational. Here are five scenarios where RAG earns its keep in an SMB context.
- Your team spends real time hunting for documents. If your operations lead, customer support, or sales team is searching shared drives or wikis more than thirty minutes per day in aggregate, RAG over that corpus has a payback window measured in weeks. The break-even is roughly two hours of reclaimed labor per day to justify a modest RAG stack.
- You have more than a few thousand documents. Below a thousand pages, a well-prompted language model with the documents pasted in can carry the load. Above that, the context window arithmetic stops working and you need retrieval to narrow the field before the model sees the data.
- The documents change frequently. Fine-tuning ages badly. If your knowledge base, contracts, or product documentation gets meaningful updates monthly or weekly, RAG handles updates by re-embedding the changed documents — a five-minute job. Fine-tuning would require retraining the model every time.
- You need source attribution. Compliance, legal, and customer-support use cases often require “where did this answer come from.” RAG cites the source paragraph natively. A model answering from its training data cannot.
- The questions are open-ended. If users ask the same five questions over and over, you do not need RAG — you need an FAQ page. RAG is for the case where users ask questions you could not have anticipated, and the system has to find the answer dynamically.
When you do NOT need RAG
This is the section most vendors skip. We will not.
- Your corpus is small enough to fit in a prompt. Modern language models accept context windows of 100,000 to 1,000,000 tokens. If your entire knowledge base is a 30-page employee handbook, paste it into the system prompt and call it done. No vector database, no embedding pipeline, no eval set. The whole RAG apparatus is overkill for corpora that fit in a single prompt.
- The same questions get asked over and over. A traditional FAQ page or a deterministic chatbot answers repeat questions with zero risk of hallucination, no infrastructure, and no AI bill. If 80% of your support volume is the same dozen questions, a decision-tree bot beats RAG on cost and reliability.
- Your data is mostly structured. If the answer to most questions is “look up this customer in the CRM and return their balance,” you do not need a language model. You need a SQL query and a small dashboard. RAG over a database of rows and columns is a category mistake — the structured query wins every time.
- You have not first solved your search problem. If your team cannot find documents because nothing is tagged, named consistently, or filed in a coherent folder structure, RAG will not save you. It will just hallucinate convincingly over a chaotic corpus. Fix the document hygiene first, then evaluate whether you still need RAG.
The pattern: RAG is the right answer when the question space is open, the corpus is too large to fit in a prompt, the data is mostly unstructured, and source attribution matters. Anything else, look at the simpler tool first.
The RAG stack — what is under the hood
A RAG system has four components, each of which is a buying decision.
Document store. Where the source documents live. PDFs, Word files, HTML pages, Markdown notes. This is the same Google Drive, SharePoint, or Notion you already have. RAG does not replace it — it reads from it.
Embedding model. Converts each document chunk (and each query) into a vector of numbers. Common choices: OpenAI’s text-embedding-3-small (cheap, good general performance), Cohere’s embed-v3 (strong on enterprise queries), or open-source models like BGE-large-en that you can run yourself. The embedding model is the cheapest thing in the stack and the easiest to swap later.
Vector database. Stores the vectors and runs the similarity search that retrieves candidate chunks. Choices range from “just use Postgres with the pgvector extension” to specialized vector databases like Pinecone, Weaviate, Chroma, and Qdrant. We come back to this in the next section.
Language model. The model that takes the retrieved chunks and writes the final answer. Claude, GPT-4, Gemini, Llama 3 — any of them work. The choice depends on cost, latency, hosting model, and whether you need the answer to read like a lawyer or a customer service rep.
A glue layer ties these together — usually a small Python service or a framework like LangChain or LlamaIndex. Both frameworks have a learning curve and both leak abstractions when the system gets non-trivial. We have shipped systems with both and without either.
Vector database options
The vector database market has a lot of voices and not many real differences for SMB-scale workloads. Here is the honest read.
pgvector Pinecone Weaviate Chroma Qdrant
- pgvector (Postgres extension). If you already run Postgres, install the pgvector extension and store your vectors in the same database that runs the rest of your application. Performance is fine up to a few million vectors. The operational story is “you already know how to back up Postgres.” This is what we run in Helix Stax’s own back office, and it handles every internal RAG workload we have thrown at it. Recommended starting point for almost every SMB.
- Pinecone. Managed, fast, scales effortlessly to billions of vectors. Costs $70 per month for the entry tier, climbing with usage. Right when you need a fully managed service and your team does not want to operate a database. Wrong if you already run Postgres and have ops capability — you are paying a premium for someone else to do something you can do.
- Weaviate. Open-source, self-hostable, with a strong query API and good docs. Sits between pgvector and Pinecone — more capable than the Postgres extension, more work to operate than the managed service. Right when your RAG workload has outgrown pgvector but you want to stay self-hosted.
- Chroma. Lightweight, embeddable, designed to be the easiest possible way to prototype a RAG system on a laptop. Right for proof-of-concept and local development. Wrong for production at meaningful scale.
- Qdrant. Open-source, written in Rust, fast, with a polished operator experience. Strong choice for self-hosted production deployments when you want the performance ceiling above pgvector without the cost of Pinecone.
The default recommendation for an SMB starting a RAG project: start on pgvector. The migration to Pinecone or Qdrant later is straightforward, and you avoid paying for managed infrastructure you may not need.
Embedding model options
The embedding model is the cheapest component to swap and the one most teams overthink. Three credible paths.
- OpenAI embeddings (text-embedding-3-small or text-embedding-3-large). Cheap (about $0.02 per million tokens), strong baseline performance, no operational burden. The default for most production RAG systems. You send text, you get vectors back.
- Cohere embed-v3. Comparable price and quality to OpenAI, with slightly stronger performance on enterprise search benchmarks. Right if your queries lean toward enterprise documents — contracts, technical specifications, internal documentation.
- Open-source models (BGE, E5, GTE). Run on your own GPU or via a hosted service like Together AI or Hugging Face. The cost shifts from per-token to per-GPU-hour. Right when data residency, cost at scale, or hosting model matters more than the marginal quality of the proprietary models.
The pattern most teams converge on: start with OpenAI embeddings because they get out of your way, measure performance on your eval set, swap to Cohere or an open-source model only if the eval set shows the swap improves retrieval quality on the queries you actually care about.
RAG implementation cost
Honest ranges, with the caveat that scope drives cost more than any single line item.
- Proof of concept. $5,000 to $15,000 of consulting time, two to four weeks. Ingestion pipeline for a small corpus (say, a thousand documents), pgvector store, OpenAI embeddings, Claude or GPT-4 as the language model, a basic eval set, and a simple chat interface. Right for proving the pattern works on your data before committing to a production system.
- Production system, small SMB. $25,000 to $75,000 to build, $200 to $2,000 per month to run. Ingestion pipeline that handles document updates, eval set with at least a hundred curated queries, observability so you can see what the system is doing, a real interface (chat widget, Slack bot, internal web app), and operational documentation so your team can maintain it without the consultant in the room.
- Production system, mid-market SMB. $75,000 to $250,000 to build, $1,000 to $10,000 per month to run. Multi-source ingestion, role-based access control so users only retrieve documents they are entitled to see, audit logging for compliance, an evaluation pipeline that runs nightly, and integration with the rest of your business stack.
The running cost breakdown for a typical small production system: maybe $50 per month in embedding API calls, $200 to $1,500 per month in language model calls depending on query volume, $20 per month for the database if you self-host on a small VPS or $70+ for managed Pinecone, and your operator’s time to monitor the eval set. The AI bill is rarely the biggest line item. The biggest line item is the engineering hours to keep the ingestion pipeline current and the eval set honest.
Common RAG mistakes Helix Stax sees
Most of the RAG projects we see in operations advisory engagements have the same three or four problems. None of them are about the model.
- Chunking the documents wrong. The most common failure mode. Operators split documents on fixed character counts (“every 1000 characters becomes a chunk”), which means contracts get cut mid-sentence, policies get split between the title and the rule, and the retrieval returns paragraphs that read like word salad. Chunk on semantic boundaries — paragraphs, sections, list items — and overlap the chunks by 10 to 20 percent so context survives the cut.
- No evaluation set. Teams ship a RAG system, look at five sample queries, declare victory, and have no way to tell whether changes made the system better or worse. Build an eval set of fifty to two hundred curated query-answer pairs that represent real user questions. Run it after every change. The eval set is the only honest signal of whether the system is improving.
- No observability. A RAG system that you cannot inspect is a black box that hallucinates silently. Log every query, every retrieved chunk, every language model response, and every user thumbs-up or thumbs-down. Without that telemetry, you cannot debug a single bad answer, let alone improve the system over time.
- Treating retrieval as a solved problem. The retrieval step is harder than the generation step in production RAG. If retrieval returns the wrong paragraphs, the language model writes a confident wrong answer. Most quality problems in RAG systems are retrieval problems, not language model problems. Spend the engineering time on the retrieval layer — chunking, reranking, hybrid search combining keyword and vector — before you touch the prompt or swap the model.
The pattern: the model is the cheap part to get right. The data plumbing is the hard part, and the eval set is what separates a RAG system that improves from one that drifts.
Frequently asked questions
What is RAG in simple terms? RAG is a way to give an AI model access to your own documents at query time. The model does not memorize the documents. When you ask a question, the system searches your documents for the relevant parts, hands those parts to the language model, and the language model writes an answer based on what it found. The acronym stands for retrieval-augmented generation.
What is the difference between RAG and fine-tuning? Fine-tuning changes the model itself by continuing its training on your data. The model internalizes your style or your domain, but the changes are baked in and stale the moment your data changes. RAG leaves the model alone and gives it access to your data at query time through a search step. RAG handles changing data well; fine-tuning handles changing style well. Most production AI assistants use RAG for data and reserve fine-tuning for narrow format problems.
Do I need RAG for my small business? Probably not yet. If your corpus is small enough to paste into a prompt — say, under thirty pages — skip RAG and use a long-context model directly. If you ask the same questions repeatedly, build an FAQ. If your data is structured, use SQL. RAG earns its keep when the corpus is large, the questions are open-ended, the data changes, and source attribution matters. Most SMBs get there eventually; few get there in the first six months of working with AI.
How much does RAG cost? A proof of concept runs $5,000 to $15,000 in build time. A small production system runs $25,000 to $75,000 to build and $200 to $2,000 per month to run. A mid-market system can run into the hundreds of thousands to build and several thousand a month to operate. The model bill is rarely the biggest line item — the operational cost of keeping the ingestion pipeline and the eval set honest is.
What is the best vector database for small business? For most SMBs, pgvector running inside your existing Postgres is the right answer. You already know how to back up Postgres, you do not pay for managed infrastructure, and the performance ceiling is high enough that you will not outgrow it for years. Move to Pinecone, Weaviate, or Qdrant when you have a specific reason — usually scale beyond a few million vectors or a need for managed operations.
Is pgvector good enough or do I need Pinecone? For SMB workloads — under a few million vectors, under a few thousand queries per minute, single-tenant deployment — pgvector is good enough. Pinecone is the right call when you want a fully managed service and the engineering hours saved are worth the monthly bill, or when you are operating at a scale that genuinely strains a Postgres extension. Start on pgvector. Migrate later if the data tells you to.
Can I run RAG locally? Yes. Open-source embedding models (BGE, E5), open-source language models (Llama 3.1, Mistral, Qwen), pgvector or Chroma as the database, and your own hardware. The trade-off is the engineering time to operate the stack and the lower ceiling on language model quality compared to Claude or GPT-4. Right when data residency, cost at scale, or air-gap requirements drive the choice. Wrong when convenience matters more than control.
How long does it take to build a RAG system? A working proof of concept on a small corpus takes two to four weeks of focused engineering time. A small production system takes two to four months. A mid-market production system takes six to twelve months when you factor in role-based access control, audit logging, multi-source ingestion, and the evaluation pipeline. The build calendar is dominated by the data plumbing and the eval set, not the model integration.
Is RAG HIPAA-compliant? RAG itself is an architecture pattern, not a product, so compliance depends on which components you use and how they are configured. Self-hosted RAG on infrastructure under your control, with appropriate access controls and audit logging, can support HIPAA workflows. Using a managed language model like Claude or GPT-4 requires the vendor to sign a Business Associate Agreement and the architecture to keep PHI handling inside the BAA boundary. The compliance story depends on the BAA chain end-to-end, not on the RAG label.
Do you build RAG systems? We do. Helix Stax builds production RAG systems as part of Operations Advisory and Digital Strategy engagements. We default to pgvector on Postgres because it is what we run ourselves, but we have shipped systems on Pinecone, Weaviate, and Qdrant when the scale or operational model demanded it. The work is rarely “build a RAG system” in isolation — it is usually “your team cannot find documents and your shared drive is a mess; here is the workflow fix and the AI layer that sits on top of it.”
What is “agentic RAG”? Agentic RAG is the pattern where the language model decides what to retrieve, rather than retrieving once at the start and answering. The model gets a search tool, plans which queries to issue, runs multiple retrievals, and refines the answer iteratively. It handles complex multi-part questions better than single-shot RAG, at the cost of higher latency and higher language-model spend. Right for research and analytical workloads. Overkill for “what is our PTO policy.”
What is hybrid search and why does it matter for RAG? Hybrid search combines vector similarity search with traditional keyword search and merges the results. Vector search finds documents that are semantically similar; keyword search finds documents that contain the exact terms in the query. Hybrid handles edge cases that pure vector search fumbles — proper nouns, product names, internal codes — and is one of the highest-impact upgrades for a struggling RAG system. Most production systems use hybrid search by their second iteration.
Need help deciding if RAG is right for you?
The honest answer for most SMBs is “not yet” — and the path to a real answer starts with naming what your team actually needs, not which AI pattern is fashionable. Helix Stax runs RAG in our own back office on pgvector and Postgres, and we build production RAG systems for clients as part of Operations Advisory and Digital Strategy engagements. We will tell you when RAG is the right tool and when a simpler answer wins.
For more context on the AI vendor landscape, see ChatGPT vs Claude vs Gemini for business and top self-hosted AI tools for business. Or book a free Helix Pulse — 60 minutes with the founder, your top three gaps named in plain English, and an honest read on whether the AI layer is what you need or whether the workflow underneath needs fixing first.