Skip to content

Add llm-cost-optimization skill#287

Open
vikast908 wants to merge 1 commit into
addyosmani:mainfrom
vikast908:add-llm-cost-optimization-skill
Open

Add llm-cost-optimization skill#287
vikast908 wants to merge 1 commit into
addyosmani:mainfrom
vikast908:add-llm-cost-optimization-skill

Conversation

@vikast908

Copy link
Copy Markdown

What

A new llm-cost-optimization skill on cutting runtime LLM token and dollar cost without lowering output quality.

Why

The repo has context-engineering (the coding agent's own session) and performance-optimization (web vitals and latency), but nothing on the API cost of the model calls a product makes at runtime. That cost is usually dominated by avoidable waste: a static system prompt resent uncached on every call, full history replayed each turn, whole documents stuffed into context, unbounded output, and a frontier model doing cheap work.

What's in it

  • Measuring cost per request before optimizing
  • Caching the static prefix (system prompt, tool schemas, examples)
  • Windowing or summarizing history instead of full replay
  • Narrow retrieval instead of stuffing context
  • Capping and structuring output
  • Routing cheap, high-volume tasks to a smaller model
  • A "don't trade away quality" guard, with a Verification checklist

Conventions followed

  • Standard anatomy (Overview, When to Use, Process, Common Rationalizations, Red Flags, Verification).
  • Frontmatter is name + description only; description leads with what it does, then Use when triggers.
  • Single SKILL.md, no supporting files.
  • References context-engineering and performance-optimization rather than duplicating them.

Related: this pairs with my other PRs adding reliable-agent-loops (#285) and evaluating-llm-output (#286). Happy to adjust naming, scope, or split anything.

🤖 Generated with Claude Code

Covers reducing runtime LLM token and dollar cost without lowering quality: measuring cost per request, caching the static prefix, windowing/summarizing history, narrow retrieval, capped/structured output, and model routing, with a Verification checklist.

Distinct from context-engineering (agent session) and performance-optimization (web/latency), and references both rather than duplicating them.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant