How to Cut Your AI API Costs Without Sacrificing Quality

Q: How do I reduce prompt token costs in a RAG application?

Shorten your system prompt and remove padding. Reduce retrieved chunks from 5 to 3 if recall allows. Use a tighter similarity threshold to retrieve only the most relevant context. Summarise long retrieved documents before injection. Use prompt caching if your provider supports it for repeated system prompt content.

Savan PadaliyaApril 24, 20268 min read

LLM API costs spike when teams use overpowered models for simple tasks, send oversized prompts on every request, and skip caching for repeated queries. The fix is not cutting features — it is routing each task to the right model, trimming prompt tokens, and adding a semantic cache layer. Most production teams can reduce API spend by 40–70% with these strategies, and none of them require changing what users experience.

Understand Where Your Costs Come From First

Before optimizing, measure. The split is almost never what teams assume.

Run this query against your logs to see your actual cost drivers:

SELECT
  feature,
  COUNT(*) AS requests,
  SUM(prompt_tokens) AS total_prompt_tokens,
  SUM(completion_tokens) AS total_completion_tokens,
  SUM(cost_usd) AS total_cost,
  AVG(cost_usd) AS avg_cost_per_request
FROM ai_events
WHERE created_at > NOW() - INTERVAL '7 days'
GROUP BY feature
ORDER BY total_cost DESC;

In most production apps, 20% of features drive 80% of token spend. Fix those 20% first.

The monitoring infrastructure you need for this — structured logs with token counts and cost per request — is a prerequisite. If you're not logging that data yet, start there before optimizing.

Strategy 1: Downgrade to a Cheaper Model Tier

The biggest cost lever is model selection. The quality gap between a frontier model and a fast model is smaller than most teams assume — and the cost difference is large.

From the OpenAI vs Vertex AI comparison:

Model	Input cost per 1M tokens	Relative cost
GPT-4o	$2.50	baseline
GPT-4o mini	$0.15	17x cheaper
Gemini 1.5 Pro	$1.25	2x cheaper
Gemini 1.5 Flash	$0.075	33x cheaper

The approach: run your production prompts through both models, evaluate output quality on 50–100 representative inputs, and use the cheaper model everywhere quality is acceptable.

const MODEL_CONFIG = {
  'complex-analysis': { model: 'gpt-4o', temperature: 0 },
  'document-summary': { model: 'gpt-4o-mini', temperature: 0.2 },
  'faq-answer':        { model: 'gpt-4o-mini', temperature: 0 },
  'email-draft':       { model: 'gpt-4o-mini', temperature: 0.4 },
};

function getModel(feature) {
  return MODEL_CONFIG[feature] || { model: 'gpt-4o-mini', temperature: 0 };
}

Strategy 2: Cache Repeated Prompts

If the same (or very similar) prompt is sent multiple times, you're paying for identical computation repeatedly.

Exact Prompt Caching

import crypto from 'crypto';
import { createClient } from 'redis';

const redis = createClient({ url: process.env.REDIS_URL });
await redis.connect();

async function cachedCompletion(messages, options = {}) {
  const cacheKey = crypto
    .createHash('sha256')
    .update(JSON.stringify({ messages, model: options.model }))
    .digest('hex');

  const cached = await redis.get(`llm:${cacheKey}`);
  if (cached) return JSON.parse(cached);

  const response = await openai.chat.completions.create({ messages, ...options });

  await redis.setEx(`llm:${cacheKey}`, 3600, JSON.stringify(response)); // 1-hour TTL
  return response;
}

Use exact caching for: FAQ answers, static content generation, any prompt where system prompt + user input does not change.

Semantic Caching

For queries that ask the same thing in different words, embed the user query and check for a similar cached response:

async function semanticallyCachedCompletion(userQuery, systemPrompt) {
  const queryEmbedding = await embeddings.embedQuery(userQuery);

  const similar = await vectorStore.similaritySearchVectorWithScore(queryEmbedding, 1);

  if (similar.length > 0 && similar[0][1] > 0.97) {
    return similar[0][0].metadata.response; // cache hit
  }

  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [
      { role: 'system', content: systemPrompt },
      { role: 'user', content: userQuery },
    ],
  });

  const text = response.choices[0].message.content;

  await vectorStore.addDocuments([{
    pageContent: userQuery,
    metadata: { response: text },
  }]);

  return text;
}

Semantic caching can cut 30–60% of API calls for FAQ-style use cases where users rephrase the same underlying question.

Strategy 3: Optimize Prompt Length

Every token in your prompt costs money on every request. Long system prompts that repeat on every call are a significant, often invisible cost.

import { get_encoding } from 'tiktoken';

const enc = get_encoding('cl100k_base');
const systemPromptTokens = enc.encode(SYSTEM_PROMPT).length;
console.log(`System prompt: ${systemPromptTokens} tokens`);
// At $2.50/1M tokens, 1000-token system prompt at 500K requests/month = $1,250/month

Tactics to reduce system prompt token count:

Remove redundant instructions — if your whole user base is English, drop "always respond in English"
Compress few-shot examples — 2 tight examples work as well as 5 verbose ones
Use role framing — "You are an expert X" triggers knowledge without enumerating it
Remove formatting instructions that don't apply to your use case

Strategy 4: Set Tight `max_tokens` Per Feature

The default max tokens is very high. A prompt bug that causes verbose output costs much more without a cap.

const FEATURE_TOKEN_LIMITS = {
  'faq-answer':       300,
  'document-summary': 500,
  'code-review':      1000,
  'email-draft':      400,
};

async function boundedCompletion(feature, messages) {
  const { model, temperature } = getModel(feature);
  return openai.chat.completions.create({
    model,
    messages,
    temperature,
    max_tokens: FEATURE_TOKEN_LIMITS[feature] || 500,
  });
}

Strategy 5: Batch Requests for Background Jobs

For background processing — document ingestion, content generation, report creation — don't call the API one item at a time. OpenAI's Batch API processes requests at 50% of the standard price with a 24-hour turnaround:

import fs from 'fs';

const batchRequests = documents.map((doc, i) => ({
  custom_id: `doc-${i}`,
  method: 'POST',
  url: '/v1/chat/completions',
  body: {
    model: 'gpt-4o-mini',
    messages: [
      { role: 'system', content: 'Summarize this document in 2 sentences.' },
      { role: 'user', content: doc.content },
    ],
    max_tokens: 100,
  },
}));

fs.writeFileSync('batch_input.jsonl', batchRequests.map(r => JSON.stringify(r)).join('\n'));

const file = await openai.files.create({
  file: fs.createReadStream('batch_input.jsonl'),
  purpose: 'batch',
});

const batch = await openai.batches.create({
  input_file_id: file.id,
  endpoint: '/v1/chat/completions',
  completion_window: '24h',
});

console.log('Batch submitted:', batch.id);

For large-scale document processing, this alone halves your cost.

Strategy 6: Tighten RAG Chunk Sizes

In RAG pipelines, retrieved chunks are injected into every prompt. Chunk size directly affects token cost per request.

// Too large — wastes tokens on irrelevant content
const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 2000, // 4 chunks × 2000 chars ≈ 8000 context tokens per query
  chunkOverlap: 200,
});

// More efficient
const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 400,  // 4 chunks × 400 chars ≈ 1600 context tokens per query
  chunkOverlap: 40,
});

Halving chunk size can cut RAG prompt costs by 40–50%. Test retrieval quality on a representative sample before deploying — tighter chunks occasionally miss context that spans chunk boundaries.

What to Optimize First

Impact	Effort	Action
Very High	Low	Downgrade model for low-complexity features
Very High	Medium	Add exact caching for repeated prompts
High	Low	Set `max_tokens` per feature
High	Medium	Reduce system prompt length
Medium	Medium	Implement semantic caching
Medium	High	Batch API for background jobs
Medium	High	Tighten RAG chunk sizes

Start with the top two rows. Model downgrade plus exact caching together typically cut monthly bills by 40–60% with minimal engineering effort and negligible quality impact.

The key discipline is to instrument first, optimize second. Without token-level cost data per feature, you'll optimize the wrong things. The monitoring setup is the prerequisite — once you have the data, these strategies are straightforward to apply.

Frequently Asked Questions

What are the biggest sources of LLM API cost in production applications?
Prompt tokens are usually the largest cost — every call includes the system prompt, retrieved context, and conversation history. Using an overpowered model (GPT-4o) for simple tasks like classification or short summaries is the most common avoidable cost. Completion tokens cost more per unit than prompt tokens on most models.

How much can I save by using a smaller model for simple tasks?
Significant savings — GPT-4o mini costs roughly 15x less per token than GPT-4o, and Gemini 1.5 Flash costs roughly 10x less than Gemini 1.5 Pro. For classification, extraction, or short summaries, smaller models match larger models on most tasks. Route by task complexity, not by default choice.

What is semantic caching for LLM APIs and how much does it save?
Semantic caching stores LLM responses indexed by the embedding of the input query. When a new query is semantically similar to a cached one above a cosine similarity threshold, the cached response is returned instead of calling the API. For high-repetition use cases like FAQ bots or search, semantic caching can cut API costs by 30–60%.

Should I use the OpenAI Batch API to reduce costs?
Yes, for asynchronous workloads like bulk document processing, nightly analysis jobs, or background classification tasks. The OpenAI Batch API offers a 50% discount on API costs with up to 24-hour completion time. It is not suitable for real-time user-facing features that need immediate responses.

How do I reduce prompt token costs in a RAG application?
Shorten your system prompt and remove filler text. Reduce retrieved chunks from 5 to 3 if recall allows. Use a tighter similarity threshold to retrieve only the most relevant context. Summarise long retrieved documents before injection. Use prompt caching if your provider supports it to avoid re-processing identical system prompts across requests.

Savan Padaliya

Senior Full Stack Developer who ships faster with AI. Available for freelance, consulting, and project work.

Book a Free Call View Services →

Understand Where Your Costs Come From First

Strategy 1: Downgrade to a Cheaper Model Tier

Strategy 2: Cache Repeated Prompts

Exact Prompt Caching

Semantic Caching

Strategy 3: Optimize Prompt Length

Strategy 4: Set Tight max_tokens Per Feature

Strategy 5: Batch Requests for Background Jobs

Strategy 6: Tighten RAG Chunk Sizes

What to Optimize First

Frequently Asked Questions

More posts

How I Built a Claude Code Skill for Backend Development

Build a Production-Ready Express API with Claude Code's /backend Skill

Stop Scaffolding by Hand: Meet api-forge for Claude Code

Strategy 4: Set Tight `max_tokens` Per Feature