RAG Architecture for JavaScript Developers
Large Language Models are impressive, but they have a fundamental problem: they only know what they were trained on. Ask GPT-4 about your internal product documentation, your company's support tickets, or a PDF uploaded last week, and it will either hallucinate or politely tell you it doesn't know. Retrieval Augmented Generation (RAG) solves this by grounding LLM responses in your own data — and as a JavaScript developer, you can build a production-ready RAG pipeline with LangChain.js and OpenAI.
What Is RAG and Why Does It Matter?
RAG is a pattern where you retrieve relevant chunks of text from a knowledge base and inject them into the LLM prompt as context before generating a response. The model isn't "learning" new information — it's reading it in real time, much like you'd hand someone a document and say "answer this question based on what's in here."
This approach gives you three key benefits:
- Accuracy — responses are grounded in real, verifiable source text
- Freshness — your knowledge base can be updated without retraining the model
- Control — you decide what the model can and cannot access
The RAG Pipeline: 5 Stages
- Load — ingest source documents (PDFs, markdown, web pages, databases)
- Chunk — split documents into smaller, semantically meaningful pieces
- Embed — convert each chunk into a vector using an embedding model
- Store — save vectors in a vector database (Chroma, Pinecone, pgvector, etc.)
- Retrieve & Generate — at query time, embed the user's question, find the closest chunks, and pass them to the LLM
Setting Up LangChain.js
Install the required packages:
npm install langchain @langchain/openai @langchain/community chromadb
Set your OpenAI API key:
export OPENAI_API_KEY=sk-...
Loading and Chunking Documents
import { TextLoader } from 'langchain/document_loaders/fs/text'; import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter'; const loader = new TextLoader('./docs/product-guide.txt'); const rawDocs = await loader.load(); const splitter = new RecursiveCharacterTextSplitter({ chunkSize: 500, chunkOverlap: 50, }); const docs = await splitter.splitDocuments(rawDocs); console.log(`Created ${docs.length} chunks`);
The chunkOverlap parameter ensures context isn't lost at chunk boundaries — a 50-character overlap between adjacent chunks prevents sentences from being split in a way that loses meaning.
Embedding and Storing in a Vector Store
For development, you can use Chroma running locally via Docker:
docker run -p 8000:8000 chromadb/chroma
Then store your chunks:
import { Chroma } from '@langchain/community/vectorstores/chroma'; import { OpenAIEmbeddings } from '@langchain/openai'; const embeddings = new OpenAIEmbeddings({ model: 'text-embedding-3-small', // cheaper than ada-002, better quality }); const vectorStore = await Chroma.fromDocuments(docs, embeddings, { collectionName: 'product-docs', url: 'http://localhost:8000', }); console.log('Documents embedded and stored.');
The Full Query Pipeline
import { ChatOpenAI } from '@langchain/openai'; import { RetrievalQAChain } from 'langchain/chains'; import { Chroma } from '@langchain/community/vectorstores/chroma'; import { OpenAIEmbeddings } from '@langchain/openai'; async function answerQuestion(question) { const embeddings = new OpenAIEmbeddings({ model: 'text-embedding-3-small' }); const vectorStore = await Chroma.fromExistingCollection(embeddings, { collectionName: 'product-docs', url: 'http://localhost:8000', }); const retriever = vectorStore.asRetriever({ k: 4, // return top 4 most relevant chunks }); const llm = new ChatOpenAI({ model: 'gpt-4o', temperature: 0, }); const chain = RetrievalQAChain.fromLLM(llm, retriever, { returnSourceDocuments: true, }); const result = await chain.invoke({ query: question }); console.log('Answer:', result.text); console.log('Sources:', result.sourceDocuments.map(d => d.metadata.source)); return result; } await answerQuestion('What is the refund policy for annual subscriptions?');
Setting temperature: 0 is critical for RAG — you want the model to stay close to the retrieved facts, not get creative.
RAG vs Fine-Tuning: When to Use Which
| Scenario | RAG | Fine-Tuning |
|---|---|---|
| Knowledge changes frequently | Yes | No |
| Need source attribution | Yes | No |
| Small budget | Yes | No (expensive) |
| Need to change response style/format | No | Yes |
| Domain-specific reasoning patterns | No | Yes |
| Low latency, offline inference | No | Yes |
The general rule: use RAG for knowledge, use fine-tuning for behavior. In most production SaaS applications, RAG is the right starting point — and choosing the right LLM provider matters more than most teams realise before they hit scale.
Production Tips
Chunking Strategy
Chunk size dramatically affects quality. Too large (>1000 chars) and retrieved chunks contain irrelevant noise. Too small (<100 chars) and individual chunks lose context.
- For FAQ documents: 300–500 characters with 50 overlap
- For technical documentation: 600–800 characters with 100 overlap
- For legal or dense prose: use semantic chunking instead of fixed-size
Metadata Filtering
Always attach metadata to your chunks so you can filter at retrieval time:
const docs = rawDocs.map(doc => ({ ...doc, metadata: { ...doc.metadata, category: 'billing', version: 'v2', lastUpdated: '2026-01', }, }));
Then filter during retrieval:
const retriever = vectorStore.asRetriever({ k: 4, filter: { category: 'billing' }, });
This prevents a question about billing from retrieving chunks about unrelated topics like onboarding or feature announcements.
Hybrid Search
For production, consider combining vector search (semantic similarity) with keyword search (BM25). This handles edge cases where the user's query uses exact technical terms that semantic search might miss. Some vector databases like Weaviate and Elasticsearch support this natively.
RAG is one of the most practical AI patterns available today. Once you have the pipeline running, you can power internal knowledge bases, customer support bots, document Q&A tools, and much more — all without retraining a single model.
Once your RAG pipeline is live, the next challenge is knowing when it's working well — and when it isn't. Read How to Monitor AI Pipelines in Production to instrument latency, token costs, and hallucination signals from day one.