Layer 1 — Model Routing (saves 60-80%)
Stop sending every request to the most expensive model. Route by task complexity:
Tier 0 — Free/Near-Free ($0.00–$0.001 per 1K tokens)
llama3.2:3b via Ollama (local, zero cost)
gemini-flash-1.5 (Google free tier — 1M tokens/day free)
- Use for: classifications, yes/no decisions, simple summaries, routing logic
Tier 1 — Budget ($0.001–$0.01 per 1K tokens)
qwen2.5-coder:7b via Ollama or OpenRouter
gpt-4o-mini ($0.15/1M input tokens)
- Use for: email drafts, data extraction, code generation, templated responses
Tier 2 — Mid-Tier ($0.01–$0.10 per 1K tokens)
deepseek-r1:14b via OpenRouter ($0.14/1M)
claude-haiku-3-5 ($0.25/1M input)
- Use for: research summaries, complex extraction, multi-step reasoning
Tier 3 — Premium (only when nothing else works)
claude-sonnet / gpt-4o / claude-opus
- Use for: final copy review, complex strategy, nuanced judgment calls
The routing rule: Start at Tier 0. Only escalate if output quality fails. 90% of tasks never leave Tier 1.
Implementation (OpenRouter example):
def route_model(task_type: str) -> str:
routing = {
"classify": "meta-llama/llama-3.2-3b-instruct:free",
"extract": "qwen/qwen-2.5-7b-instruct",
"draft": "openai/gpt-4o-mini",
"reason": "deepseek/deepseek-r1-distill-qwen-14b",
"final": "anthropic/claude-3-5-sonnet"
}
return routing.get(task_type, routing["draft"])
Layer 2 — Prompt Compression (saves 20-40%)
Every unnecessary word costs money. Tokens = dollars.
The 5 compression rules:
-
Kill the preamble. Never say "You are a helpful assistant who..." Just give the instruction.
- ❌
You are an expert copywriter with 20 years of experience. Please help me write...
- ✅
Write a subject line for: [context]
-
Use structured inputs. JSON/CSV inputs use fewer tokens than prose descriptions.
- ❌
The customer's name is John, he bought the product on March 15th, the amount was $29
- ✅
Customer: John | Date: 2026-03-15 | Amount: $29
-
Strip examples when unnecessary. Few-shot examples double your input tokens. Use them only when zero-shot fails.
-
Set max_tokens aggressively. If you need a 50-word summary, set max_tokens: 100. Never leave it unlimited.
-
Compress system prompts once, reuse always. Don't repeat instructions per-call. Use a single compressed system prompt stored as a constant.
Token audit tool:
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
def count_tokens(text):
return len(enc.encode(text))
# Before/after comparison
before = count_tokens(your_verbose_prompt)
after = count_tokens(your_compressed_prompt)
print(f"Saved {before - after} tokens ({(before-after)/before*100:.0f}%)")
Layer 3 — Semantic Caching (saves 30-70% on repeated queries)
If the same (or similar) question is asked twice, never hit the API twice.
Simple cache with exact match:
import hashlib, json, os
CACHE_FILE = "prompt_cache.json"
def load_cache():
if os.path.exists(CACHE_FILE):
return json.load(open(CACHE_FILE))
return {}
def save_cache(cache):
json.dump(cache, open(CACHE_FILE, "w"))
def cached_completion(prompt: str, model: str, call_fn):
cache = load_cache()
key = hashlib.md5(f"{model}:{prompt}".encode()).hexdigest()
if key in cache:
print(f"Cache HIT — $0.00")
return cache[key]
result = call_fn(prompt, model)
cache[key] = result
save_cache(cache)
return result
Semantic cache with embeddings (for similar-but-not-identical queries):
- Embed each incoming query with
text-embedding-3-small ($0.02/1M tokens — near-free)
- Store embeddings in a vector DB (Qdrant local = free)
- If cosine similarity > 0.95 with cached query → return cached result
- Implementation: use
sentence-transformers/all-MiniLM-L6-v2 locally (zero cost)
What gets cached: FAQs, product descriptions, standard responses, classification results, any query that repeats.
Layer 4 — Batch Processing (saves 50% on OpenAI)
OpenAI's Batch API costs exactly 50% less than the standard API. Same models, half the price.
When to use it: Any non-real-time task. Content generation, data enrichment, bulk classification, report generation.
from openai import OpenAI
import json
client = OpenAI()
# Build batch requests
requests = []
for i, item in enumerate(your_data_list):
requests.append({
"custom_id": f"task-{i}",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "gpt-4o-mini", # Already cheap — now 50% cheaper
"messages": [{"role": "user", "content": your_prompt(item)}],
"max_tokens": 200
}
})
# Write batch file
with open("batch_input.jsonl", "w") as f:
for r in requests:
f.write(json.dumps(r) + "\n")
# Submit batch
batch_file = client.files.create(file=open("batch_input.jsonl","rb"), purpose="batch")
batch = client.batches.create(
input_file_id=batch_file.id,
endpoint="/v1/chat/completions",
completion_window="24h"
)
print(f"Batch ID: {batch.id} — results in up to 24h at 50% cost")
Rule: If it doesn't need to happen in the next 10 minutes, batch it.
Layer 5 — Local Model Fallback (eliminates 40-90% of API calls)
Run Ollama locally. Zero API cost. Works offline. Fast enough for most tasks.
Setup (5 minutes):
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
# Pull the models you need
ollama pull llama3.2:3b # 2GB — classification, routing, simple tasks
ollama pull qwen2.5-coder:7b # 5GB — code generation, structured extraction
ollama pull deepseek-r1:14b # 9GB — reasoning, research (needs 16GB RAM)
# Use via API (drop-in OpenAI replacement)
ollama serve # runs on localhost:11434
Cost: $0.00/month beyond electricity (~$0.50/month on a Mac mini).
OpenClaw integration — already built in. Set your routing policy:
{
"tier0": "ollama/llama3.2:3b",
"tier1": "ollama/qwen2.5-coder:7b",
"tier2": "openrouter/deepseek/deepseek-r1-distill-qwen-14b",
"tier3": "anthropic/claude-sonnet-4-5"
}