How to Reduce LLM API Costs in Production

Savan PadaliyaApril 27, 20266 min read

Your first month of Vertex AI or OpenAI bills lands and the number is higher than expected. It always is. Here is the systematic approach to cutting it without cutting features.

Understand Where Your Costs Come From First

Before optimizing, measure. The split is almost never what teams assume.

Run this query against your logs to see your actual cost drivers:

SELECT
  feature,
  COUNT(*) AS requests,
  SUM(prompt_tokens) AS total_prompt_tokens,
  SUM(completion_tokens) AS total_completion_tokens,
  SUM(cost_usd) AS total_cost,
  AVG(cost_usd) AS avg_cost_per_request
FROM ai_events
WHERE created_at > NOW() - INTERVAL '7 days'
GROUP BY feature
ORDER BY total_cost DESC;

In most production apps, 20% of features drive 80% of token spend. Fix those 20% first.

The monitoring infrastructure you need for this — structured logs with token counts and cost per request — is a prerequisite. If you're not logging that data yet, start there before optimizing.

Strategy 1: Downgrade to a Cheaper Model Tier

The biggest cost lever is model selection. The quality gap between a frontier model and a fast model is smaller than most teams assume — and the cost difference is large.

From the OpenAI vs Vertex AI comparison:

ModelInput cost per 1M tokensRelative cost
GPT-4o$2.50baseline
GPT-4o mini$0.1517x cheaper
Gemini 1.5 Pro$1.252x cheaper
Gemini 1.5 Flash$0.07533x cheaper

The approach: run your production prompts through both models, evaluate output quality on 50–100 representative inputs, and use the cheaper model everywhere quality is acceptable.

const MODEL_CONFIG = {
  'complex-analysis': { model: 'gpt-4o', temperature: 0 },
  'document-summary': { model: 'gpt-4o-mini', temperature: 0.2 },
  'faq-answer':        { model: 'gpt-4o-mini', temperature: 0 },
  'email-draft':       { model: 'gpt-4o-mini', temperature: 0.4 },
};

function getModel(feature) {
  return MODEL_CONFIG[feature] || { model: 'gpt-4o-mini', temperature: 0 };
}

Strategy 2: Cache Repeated Prompts

If the same (or very similar) prompt is sent multiple times, you're paying for identical computation repeatedly.

Exact Prompt Caching

import crypto from 'crypto';
import { createClient } from 'redis';

const redis = createClient({ url: process.env.REDIS_URL });
await redis.connect();

async function cachedCompletion(messages, options = {}) {
  const cacheKey = crypto
    .createHash('sha256')
    .update(JSON.stringify({ messages, model: options.model }))
    .digest('hex');

  const cached = await redis.get(`llm:${cacheKey}`);
  if (cached) return JSON.parse(cached);

  const response = await openai.chat.completions.create({ messages, ...options });

  await redis.setEx(`llm:${cacheKey}`, 3600, JSON.stringify(response)); // 1-hour TTL
  return response;
}

Use exact caching for: FAQ answers, static content generation, any prompt where system prompt + user input does not change.

Semantic Caching

For queries that ask the same thing in different words, embed the user query and check for a similar cached response:

async function semanticallyCachedCompletion(userQuery, systemPrompt) {
  const queryEmbedding = await embeddings.embedQuery(userQuery);

  const similar = await vectorStore.similaritySearchVectorWithScore(queryEmbedding, 1);

  if (similar.length > 0 && similar[0][1] > 0.97) {
    return similar[0][0].metadata.response; // cache hit
  }

  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [
      { role: 'system', content: systemPrompt },
      { role: 'user', content: userQuery },
    ],
  });

  const text = response.choices[0].message.content;

  await vectorStore.addDocuments([{
    pageContent: userQuery,
    metadata: { response: text },
  }]);

  return text;
}

Semantic caching can cut 30–60% of API calls for FAQ-style use cases where users rephrase the same underlying question.

Strategy 3: Optimize Prompt Length

Every token in your prompt costs money on every request. Long system prompts that repeat on every call are a significant, often invisible cost.

import { get_encoding } from 'tiktoken';

const enc = get_encoding('cl100k_base');
const systemPromptTokens = enc.encode(SYSTEM_PROMPT).length;
console.log(`System prompt: ${systemPromptTokens} tokens`);
// At $2.50/1M tokens, 1000-token system prompt at 500K requests/month = $1,250/month

Tactics to reduce system prompt token count:

  • Remove redundant instructions — if your whole user base is English, drop "always respond in English"
  • Compress few-shot examples — 2 tight examples work as well as 5 verbose ones
  • Use role framing — "You are an expert X" triggers knowledge without enumerating it
  • Remove formatting instructions that don't apply to your use case

Strategy 4: Set Tight max_tokens Per Feature

The default max tokens is very high. A prompt bug that causes verbose output costs much more without a cap.

const FEATURE_TOKEN_LIMITS = {
  'faq-answer':       300,
  'document-summary': 500,
  'code-review':      1000,
  'email-draft':      400,
};

async function boundedCompletion(feature, messages) {
  const { model, temperature } = getModel(feature);
  return openai.chat.completions.create({
    model,
    messages,
    temperature,
    max_tokens: FEATURE_TOKEN_LIMITS[feature] || 500,
  });
}

Strategy 5: Batch Requests for Background Jobs

For background processing — document ingestion, content generation, report creation — don't call the API one item at a time. OpenAI's Batch API processes requests at 50% of the standard price with a 24-hour turnaround:

import fs from 'fs';

const batchRequests = documents.map((doc, i) => ({
  custom_id: `doc-${i}`,
  method: 'POST',
  url: '/v1/chat/completions',
  body: {
    model: 'gpt-4o-mini',
    messages: [
      { role: 'system', content: 'Summarize this document in 2 sentences.' },
      { role: 'user', content: doc.content },
    ],
    max_tokens: 100,
  },
}));

fs.writeFileSync('batch_input.jsonl', batchRequests.map(r => JSON.stringify(r)).join('\n'));

const file = await openai.files.create({
  file: fs.createReadStream('batch_input.jsonl'),
  purpose: 'batch',
});

const batch = await openai.batches.create({
  input_file_id: file.id,
  endpoint: '/v1/chat/completions',
  completion_window: '24h',
});

console.log('Batch submitted:', batch.id);

For large-scale document processing, this alone halves your cost.

Strategy 6: Tighten RAG Chunk Sizes

In RAG pipelines, retrieved chunks are injected into every prompt. Chunk size directly affects token cost per request.

// Too large — wastes tokens on irrelevant content
const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 2000, // 4 chunks × 2000 chars ≈ 8000 context tokens per query
  chunkOverlap: 200,
});

// More efficient
const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 400,  // 4 chunks × 400 chars ≈ 1600 context tokens per query
  chunkOverlap: 40,
});

Halving chunk size can cut RAG prompt costs by 40–50%. Test retrieval quality on a representative sample before deploying — tighter chunks occasionally miss context that spans chunk boundaries.

What to Optimize First

ImpactEffortAction
Very HighLowDowngrade model for low-complexity features
Very HighMediumAdd exact caching for repeated prompts
HighLowSet max_tokens per feature
HighMediumReduce system prompt length
MediumMediumImplement semantic caching
MediumHighBatch API for background jobs
MediumHighTighten RAG chunk sizes

Start with the top two rows. Model downgrade plus exact caching together typically cut monthly bills by 40–60% with minimal engineering effort and negligible quality impact.

The key discipline is to instrument first, optimize second. Without token-level cost data per feature, you'll optimize the wrong things. The monitoring setup is the prerequisite — once you have the data, these strategies are straightforward to apply.

SP

Savan Padaliya

Senior Full Stack Developer who ships faster with AI. Available for freelance, consulting, and project work.