Caveman: Cut Your LLM Token Costs by 65–87% Without Losing Accuracy

Token costs are invisible until they aren't. You open the billing dashboard, see the number, and realize that half of what you paid for was Claude saying "Certainly! I'd be happy to help you with that."

Caveman is a Claude Code skill that fixes this at the output level. The pitch is simple: compress everything the model writes back to you into terse, accurate fragments — no preamble, no summaries, no verbal padding. The name is the documentation.

Why Output Tokens Are the Hidden Cost Center

Most developers obsess over prompt engineering — trimming system prompts, compressing context windows, batching requests. That's all correct. But output tokens are quietly expensive too, and most of the waste is purely stylistic.

A standard Claude response to "how do I fix this PostgreSQL query?" might run 2,347 tokens. It'll open with a pleasantry, walk you through the reasoning in full prose, provide a code block, then close with a summary of what it just told you.

The code block itself? Maybe 380 tokens.

That 6:1 ratio is the problem Caveman solves. The benchmarks from the project's eval suite are worth seeing in full:

| Task | Normal | Caveman | Saved | |------|--------|---------|-------| | React re-render explanation | 1,180 tokens | 159 tokens | 87% | | Auth middleware fix | 704 tokens | 121 tokens | 83% | | PostgreSQL setup | 2,347 tokens | 380 tokens | 84% | | Average | 1,214 tokens | 294 tokens | 65–75% |

These aren't cherry-picked. The repo includes formal evals with controls comparing caveman output against standard verbose baselines — not against artificially inflated ones.

Context Management: The Bigger Win

Token savings on individual responses are nice. The compounding effect on context is where it actually matters.

Every AI coding session has a budget — the context window. In a long session, the model's own previous responses eat a significant chunk of that budget. Verbose output early in a session means you hit compression or truncation faster, which means the model loses track of earlier decisions, earlier code, earlier context.

Caveman's caveman-compress command attacks this directly. It rewrites your memory files — the markdown files that carry context between sessions — to reduce their input token footprint by approximately 46%. Less input, more room for the actual work.

The mechanics:

Removes filler words and transitional prose
Converts flowing paragraphs to fragments and lists
Strips redundant context that can be inferred
Preserves technical accuracy (this is the hard part — the evals verify it)

For anyone running a multi-session project with accumulated CLAUDE.md files, session notes, and memory entries, this isn't a marginal improvement. It's the difference between a context window that works for two hours and one that works for five.

What It Actually Produces

Three modes, each with a different compression ceiling:

Lite — strips filler, keeps grammar intact. Good if you're sharing outputs with non-technical stakeholders who'd find fragments confusing.

Full (default) — uses fragments, drops articles and connectives. This is the sweet spot for solo developer work. Reads like a senior engineer's Slack messages.

Ultra — telegraphic. Maximum compression. Use this for anything you're logging, storing, or passing between agents where human readability isn't the priority.

The specialized commands are where it gets practical:

caveman-commit — generates conventional commit messages capped at 50 characters. No more three-paragraph commit messages that say nothing.
caveman-review — one-line PR comments. No preamble, no "Great work on this PR!"
caveman-compress — rewrites memory/context files to cut input tokens by ~46%

Installation

Claude Code:

claude plugin marketplace add JuliusBrussee/caveman
claude plugin install caveman@caveman

Gemini CLI:

gemini extensions install https://github.com/JuliusBrussee/caveman

Other agents:

npx skills add JuliusBrussee/caveman -a [agent-name]

Once installed, the modes are available as skills. Default to full and adjust from there.

When It Matters Most

Caveman pays off most in three scenarios:

High-frequency tasks. Commit messages, PR reviews, quick fixes — things you do dozens of times a day. The overhead of verbose output accumulates fast.

Long sessions. If you're running a multi-hour build session, output verbosity compounds against your context window. Caveman slows the burn.

Multi-agent pipelines. When agents pass outputs to other agents, token efficiency multiplies. Compressed outputs mean cheaper API calls at every hop.

It matters less for one-off explanations where you actually want the full reasoning, or when sharing outputs with clients who need readable prose. The modes exist for this reason — lite covers the middle ground.

The Honest Tradeoff

Caveman changes how Claude communicates with you, not what it knows or how it reasons. The tradeoff is readability for cost and context headroom.

Some developers find ultra-compressed output disorienting at first. The fragments can read as brusque. You adjust faster than you'd expect — the same way you adjust to a colleague who texts in bullets instead of paragraphs.

The benchmarks show accuracy is preserved. The evals in the repo are the receipts. For a tool that touches output format, that verification matters — "terse but wrong" is worse than verbose.

If you're spending meaningfully on Claude API calls, or running into context limits on long sessions, this is a one-install fix that doesn't require you to rethink your prompts or your stack.

View Caveman on GitHub

→ Ask the index what to build your token optimization stack

→ Free credits for these tools

Written by McKlaud AI. Want to know which AI tools actually fit your business? Get a free AI audit.

Why Output Tokens Are the Hidden Cost Center

The code block itself? Maybe 380 tokens.

That 6:1 ratio is the problem Caveman solves. The benchmarks from the project's eval suite are worth seeing in full:

These aren't cherry-picked. The repo includes formal evals with controls comparing caveman output against standard verbose baselines — not against artificially inflated ones.

Context Management: The Bigger Win

Token savings on individual responses are nice. The compounding effect on context is where it actually matters.

The mechanics:

Removes filler words and transitional prose
Converts flowing paragraphs to fragments and lists
Strips redundant context that can be inferred
Preserves technical accuracy (this is the hard part — the evals verify it)

What It Actually Produces

Three modes, each with a different compression ceiling:

Lite — strips filler, keeps grammar intact. Good if you're sharing outputs with non-technical stakeholders who'd find fragments confusing.

Full (default) — uses fragments, drops articles and connectives. This is the sweet spot for solo developer work. Reads like a senior engineer's Slack messages.

Ultra — telegraphic. Maximum compression. Use this for anything you're logging, storing, or passing between agents where human readability isn't the priority.

The specialized commands are where it gets practical:

caveman-commit — generates conventional commit messages capped at 50 characters. No more three-paragraph commit messages that say nothing.
caveman-review — one-line PR comments. No preamble, no "Great work on this PR!"
caveman-compress — rewrites memory/context files to cut input tokens by ~46%

Installation

Claude Code:

claude plugin marketplace add JuliusBrussee/caveman
claude plugin install caveman@caveman

Gemini CLI:

gemini extensions install https://github.com/JuliusBrussee/caveman

Other agents:

npx skills add JuliusBrussee/caveman -a [agent-name]

Once installed, the modes are available as skills. Default to full and adjust from there.

When It Matters Most

Caveman pays off most in three scenarios:

High-frequency tasks. Commit messages, PR reviews, quick fixes — things you do dozens of times a day. The overhead of verbose output accumulates fast.

Long sessions. If you're running a multi-hour build session, output verbosity compounds against your context window. Caveman slows the burn.

Multi-agent pipelines. When agents pass outputs to other agents, token efficiency multiplies. Compressed outputs mean cheaper API calls at every hop.

The Honest Tradeoff

Caveman changes how Claude communicates with you, not what it knows or how it reasons. The tradeoff is readability for cost and context headroom.

The benchmarks show accuracy is preserved. The evals in the repo are the receipts. For a tool that touches output format, that verification matters — "terse but wrong" is worse than verbose.

If you're spending meaningfully on Claude API calls, or running into context limits on long sessions, this is a one-install fix that doesn't require you to rethink your prompts or your stack.

View Caveman on GitHub

→ Ask the index what to build your token optimization stack

→ Free credits for these tools

Written by McKlaud AI. Want to know which AI tools actually fit your business? Get a free AI audit.

Caveman: Cut Your LLM Token Costs by 65–87% Without Losing Accuracy

Turn this guide into a stack decision.

Why Output Tokens Are the Hidden Cost Center

Context Management: The Bigger Win

What It Actually Produces

Installation

When It Matters Most

The Honest Tradeoff

Continue Learning

Caveman: Cut Your LLM Token Costs by 65–87% Without Losing Accuracy

Turn this guide into a stack decision.

Why Output Tokens Are the Hidden Cost Center

Context Management: The Bigger Win

What It Actually Produces

Installation

When It Matters Most

The Honest Tradeoff

Continue Learning