How we squeezed $0/month out of Claude Code in production: a 5-tier cascade, 374 skills, six sub-agents






How we squeezed $0/month out of Claude Code in production: a 5-tier cascade, 374 skills, six sub-agents


Claude Code
AI Engineering
OpenClaw
Automation

How we squeezed $0/month out of Claude Code in production: a 5-tier cascade, 374 skills, six sub-agents

Claude Code 5-tier cascade routing — production stack at WebCoreLab
WebCoreLab studio · AI Engineering · 14 May 2026 · 8 min read

Most teams I talk to pay $400-$1,200/month for one developer’s Claude API bill. Ours sits at $0.18/month for a Content Factory pushing 30+ posts a day. Same models. Same Anthropic. The difference is routing, and it isn’t clever — it’s boring.

This is a write-up of what’s actually running, not a talk-track. We’ve been on this stack for fourteen months. I’ll show the cascade, the 374 skill files, the six sub-agents, the hook chain, and the memory layer. Skip to the numbers section at the bottom if you only want proof.

Why we built this instead of just paying the bill

The honest reason: pay-per-call API destroys experimentation. When every retry costs four cents, you stop trying things. You start gating yourself, batching everything, refusing to test the second hypothesis. The infrastructure becomes a damper on engineering, not a multiplier.

The Anthropic Max plan ($200 + tax, ~$240/month) gives us roughly 900 messages per 5-hour rolling window on Opus 4.7. The Codex Plus seat ($20/month, separate vendor) gives us another 160 messages per 3-hour window on GPT-5.5 with native web search. Two flat subscriptions, no per-call anxiety.

That covers maybe 85% of our load. The rest goes to a cheap pay-per-call tier where we route mass-classify and embedding jobs to OpenRouter — Qwen 3.5 Flash at $0.065 input / $0.260 output, DeepSeek V3.1 for code, Gemini 2.5 Pro for 2M-context translation. Twenty cents a month, give or take, depending on btcnews wire volume.

Honestly, the saved money matters less than the saved decision-making. We stopped asking “can I afford to test this?”

The router: nebo_ask.sh and a 5-tier cascade

Every LLM call from every script, every cron job, every agent funnels through one shell wrapper: nebo_ask.sh. It picks a tier, executes, falls back if the tier is rate-limited, and logs.

Tier Model Use Cost / call
T1 Claude Opus 4.7 (Plus) Primary internal — analysis, code, design $0
T2 GPT-5.5 (Plus OAuth) Native web search, 200K+ context, client research $0
T3 OpenRouter mix Mass classify, dedup, humaniser passes ~$0.0002
T4 OpenAI API Fallback when T1/T2 rate-limited and we need GPT pay
T5 Gemini Flash AI Studio Public data only — RSS, price feeds, classify $0 (250/day)

Routing isn’t magical. It’s an 18-task-type lookup table in JSON: code_kimi goes to T3, web_search to T2, geo_audit to T1, classify_bulk to T5 first. The cascade kicks in only on rate-limit or quota-exceeded errors. Average cascade depth: 1.04 — almost every call lands on the first try.

The piece that took us longest to get right wasn’t the routing — it was the OAuth refresh flock. When four Claude Code sessions start at once, they all try to refresh the shared .credentials.json with rotating tokens. Race condition, 401 cascade, everyone’s locked out for a minute. We patched it with a 12-second flock on a startup mutex. Sounds dumb. Wasted us a Saturday.

374 skills — the part that actually changes how Claude thinks

374 skills organized like a library — Claude Code reusable workflows

Skills are Markdown files with YAML frontmatter. We have 374 of them in /opt/nebo/openclaw/skills/. Forty-seven are our own (nebo-*), the rest came from OpenClaw and various imports we then rewrote.

Claude Code’s SessionStart hook loads a classifier that, on each user prompt, picks 3-5 skills matching the task type and injects them into context. SEO audit? Loads aaron-on-page-seo-auditor + aaron-technical-seo-checker + nebo-seo-mastery. Proposal? Loads humaniser, anti-AI checklist, our two etalon templates. Design? Loads nebo-web-design-mastery plus the Lazyweb references for 257k real-app screenshots.

Here’s the trick that took a while to land: reverse skill transformation. The naive way to use a skill is to read it. The brain skims, says “I know this”, and ignores the unique bits. We force the model to re-write the skill as a fake brief from the client — “Nebo wants: parallax via scroll-driven animations with Lenis + GSAP ticker, hero CTA above the fold, 8-point spacing scale, OKLch palette only.” That fake brief becomes a hard checklist the model has to satisfy before delivery. Difference between “generic good output” and “WCL signature output”. Boring, mechanical, works.

Our four house skills

We wrote a few skills ourselves because the generic ones miss our voice and our checks:

  • nebo-seo-mastery — the 200-item audit checklist we actually use on client sites, with GSC + GA4 + Semrush hooks.
  • nebo-content-writing — 22-point anti-AI checklist, ban-word list, burstiness rules.
  • nebo-web-design-mastery — five direction systems (Editorial, Modern Minimal, Human, Tech Utility, Brutalist) with prepared OKLch palettes.
  • nebo-content-factory — the pipeline that runs btcnews.biz daily.

Six sub-agents — and why one of them is just QA

Claude Code lets you define agents as separate context windows with their own system prompts. We run six. Each has a job and refuses scope-creep, which is more important than it sounds.

Agent Job Why it exists separately
wcl-designer Edits the webcorelab.com theme files Design DNA pre-loaded; refuses without proxy-routing check
wcl-qa Visual + cache QA after any theme edit Forces Playwright screenshot of dev + prod before “done”
proposal-writer Client docs (cover letters, audits, decks) Anti-AI preflight + price-free rule baked in
semantic-harvester Semrush + GSC keyword pulls UI scraper, not API — Semrush API tokens churn
osint-researcher Maigret + 3,000+ public site lookups Ethical guard built in; refuses harassment requests
task-onboarder Discovery walkthrough before complex tasks Prevents “starting work without scoping” failures

The one that paid for itself the fastest is wcl-qa. We had three incidents in two weeks where the main agent reported “done” on a theme change without screenshotting the page in a browser. Once wcl-qa got added as a forced step, those incidents went to zero. Worth saying out loud: visual eyes-test is not optional, curl is not a UI test.

Hooks — the part most people skip

Claude Code has a hooks chain you can configure in settings.json. Hooks run before/after tool calls, on session start, on user prompts. This is where 90% of our reliability lives.

SessionStart    → 4 hooks  (boot indicator, extended memory inject, OpenClaw sync, personal layer)
UserPromptSubmit → 4 hooks (skill loader, memsearch inject, GPT-5.5 routing advisor, model persist)
PreToolUse       → 6+ hooks (design guard, audit preflight, antipattern blocker, agent kickoff injector, secret scanner, catastrophic command blocker)
Stop             → 2 hooks (auto-capture facts to drafts, cleanup)
SessionEnd       → 1 hook  (snapshot writer)

A few examples of what these catch in practice:

  • design-guard — blocks Edit/Write on theme CSS/PHP unless the agent has read the 56KB Design DNA file in this session. Stops “let me just bump padding real quick” disasters.
  • secret-scanner — 50+ vendor regex on every git commit. Caught two Stripe keys and one AWS pair in the last quarter that would have shipped publicly.
  • catastrophic-cmd-blocker — refuses rm /, mkfs, fork bombs. Belt and braces, but cheap to install.
  • audit-preflight — when the user types “audit X.com”, forces a two-stage AskUserQuestion (“which sections?” → “which atoms within each?”) instead of charging straight in.
A thing I’d warn about: hooks are a foot-gun. Every hook adds latency to every tool call. We had a phase where SessionStart took 11 seconds because someone wrote a hook that did a full repo grep. Profile your hooks. Budget them. Kill the slow ones.

Memory: 481 files, semantic search, hard-inject on hit

Memory: 481 markdown files indexed for semantic search

The memory layer is what makes the agents stop forgetting. We have 481 Markdown files split into three flavours:

  • 138 feedback files — lessons from incidents. Format: “we did X, it broke Y, here’s the rule now.”
  • 129 reference files — etalons, methodologies, API registries, design DNAs.
  • 115 session snapshots — point-in-time state at the end of each working session. New sessions read the latest snapshot before doing anything.

Searching them with grep is too slow and too dumb. We index with sqlite-vec + a MiniLM embedding, refresh every 2 minutes, and expose a CLI: memsearch "rate limit on Claude Plus". Top-K semantic matches come back in 200ms.

The trick that surprised me: hard-inject on close match. If the top result has cosine distance under 1.05, a hook pastes the entire file content into the prompt instead of just the title. The model doesn’t have to choose to read it — it’s already there. This single change reduced “agent reinvents the wheel” incidents to near zero.

Real production numbers — btcnews.biz Content Factory

The proof everyone actually wants. We run a daily crypto news pipeline on a separate domain (btcnews.biz) as our showcase tenant. Here’s what it does and what it costs:

Posts published per day 30+
LLM spend / month $0.18
Average rewrite latency / post 2.8s
24h smoke test (after v2 upgrade) 21/21 PASS
Anti-AI detection score (RoBERTa) 12-13% (target ≤15%)
Image cost / post (DALL-E) $0.04

That $0.18 is for the rewrite-classify-dedup chain alone. T3 OpenRouter, Gemini Flash for the brief extraction step, Qwen Flash for the dedup pass. Image generation is a separate line and that’s where the actual money sits — but DALL-E at $0.04/post for a daily news factory is comfortable.

The point is not that this is cheap. The point is that the per-unit cost is small enough that we stopped negotiating with ourselves about whether to ship.

WebCoreLab studio

Building AI infrastructure that doesn’t bleed money

We’ve shipped this stack for our own ops and for clients. If your team is staring at a Claude/OpenAI bill that doesn’t match the value you’re getting back, we’ll do a 30-minute audit at no charge — show you where the leaks are, what to consolidate, what’s worth a sub-agent vs a prompt.

Talk to us about your AI infra

Tradeoffs and what I’d warn against

This stack is not free. The setup cost is real. Here’s where it bites:

  • Two subscriptions, not one. Anthropic Max + Codex Plus = $240/month flat. Below 300 dev-hours/month it’s overkill — just pay per call.
  • Subscription accounts get rate-limited. Burst hard on a Sunday afternoon and the quota monitor will pause Agent/Task tools at 95%. We have a TG alert. You’ll need one.
  • Hook chain has a learning curve. First three weeks you’ll add hooks that fight each other. Profile, log, kill the duplicates.
  • 374 skills is too many. We added them faster than we audited them. Roughly 30% are stale or redundant. Pruning is on the list.
  • Memory bloat. 481 files means search is fast but human review is hard. We’re moving toward auto-archive at 90 days unless cited.

If you’re starting from zero, I’d build in this order: 1) one cascade router, 2) one Markdown memory file with hard-inject, 3) one sub-agent for the thing you do most often, 4) one PreToolUse hook that blocks something obviously dangerous. Stop there for two weeks. Add the next thing only when you can name the incident it would have prevented.

FAQ

Are you using OpenClaw? Is this the same thing Neeraj sells?

Partially. The OpenClaw skill format is open — we use it, we contribute skills back. We are not affiliated with anyone selling courses on it. Their content is fine. The point of this post is that the value isn’t in the skill files alone — it’s in the routing, the hooks, and the memory that surrounds them. The skills are 20% of the lift.

How do you handle Claude Plus rate limits?

A cron job parses the JSONL session files every 5 minutes, calculates the rolling 5-hour usage, and writes a pause flag at 95% threshold. A PreToolUse hook blocks Agent/Task tools while the flag is up. We resume below 80%. Manual calibration after the first real 429: claude_quota_set_limit.sh <msgs> <tokens>.

What if I don’t want to write skills in Markdown?

You don’t have to. A skill is just a structured prompt you load conditionally. JSON works, YAML works, an inline string in your router works. Markdown won because it’s diff-friendly and humans read it during code review. The format is not the point.

Why six sub-agents and not one big one?

Context separation. A 200K-token context full of design DNA is awful for SEO work and vice versa. Sub-agents let us pre-load the right context for the right job without bloating every other call. They also let us refuse scope-creep at the agent level — wcl-designer won’t do SEO, period.

What’s the single thing you’d tell a team starting out?

Build the memsearch + hard-inject loop before anything else. Most teams I’ve seen fail at AI infra fail at the same point: the model keeps forgetting what they decided yesterday. Solve that with a 200-line script and a vector DB, and the rest of the stack becomes optional.

Can I see the code?

Parts of it. The cascade router, the agent kickoff injector, and a couple of hooks are open. The proprietary skills and our memory corpus are not. If you’re a working dev team and want a guided walk-through, that’s what the studio call link above is for.