← June 2026
App Idea Cards 2026-06-03

ToolCull

ToolCull

ToolCull

A toolbelt auditor for AI agents — points at your MCP servers and agent skills, replays real session logs to see which tools actually fire, embedding-clusters the descriptions to expose silent overlap, and emits a pruned per-task manifest that keeps each agent run inside the 5–7-tool sweet spot where selection accuracy stops cratering.

Problem

A team wires up eight MCP servers and a folder of agent skills over two months, and the agent quietly gets worse. The cause is well-measured now: past roughly 20 exposed tools, tool-selection accuracy degrades sharply, and in head-to-head tests an agent handed 50+ tools lands the right call ~60% of the time versus ~92% when given only the 5–7 tools relevant to the task. The extra tools don't just waste context tokens — they spread the model's attention thin, invite parameter hallucination, and push it to invent plausible-sounding tools or decline to act at all. Nobody decides to over-equip an agent; it accretes one npm install, one "just add the Notion server," one copied skill at a time, and there's no tool today that tells you which half of the toolbelt is dead weight or actively dragging accuracy down.

Target user

The AI-engineering lead or indie builder running production agents on Claude Code, Cursor agent mode, Windsurf, or a custom Agents-SDK stack with a sprawling MCP + skills inventory. They've already felt the "it used to pick the right tool, now it flails" regression and suspect tool bloat, but their observability stack shows them token spend and traces — not which tools are redundant, never-invoked, or semantically colliding. They want a usage-grounded verdict and a concrete pruned config, not another dashboard to read.

MVP scope

  • Ingest the active tool surface from local agent configs (Claude Code / Cursor / Windsurf / Claude Desktop MCP blocks + skill manifests) into one normalized inventory of tool name, description, schema, and source server.
  • Replay N recent session transcripts/trace logs and compute per-tool invocation stats: fire count, success/error rate, last-used timestamp, and "never invoked across the window" flags.
  • Embedding-cluster tool + skill descriptions to surface silent overlap (two filesystem-write tools, three "search" tools) and rank each cluster's members by real usage so duplicates are obvious.
  • Score each tool: keep / merge-candidate / dead (never fired) / overflow (pushes the active set past the 20-tool danger line), with the 5–7-tool target called out per task profile.
  • Emit a pruned lean manifest — a ready-to-paste reduced MCP/skills config (or a task-scoped allowlist) plus a one-page rationale of what was cut and why.
  • Local-first: all transcripts and configs stay on the machine; no source or tool traffic leaves.

Monetization

Freemium. Free single-machine CLI: full inventory, usage replay, overlap clustering, and a lean-manifest export for one agent. Paid tier ($15/dev/mo): cross-machine and team-wide tool inventories, scheduled drift reports ("3 new tools added this week, active set now 27 — over the line"), task-profile manifests (a lean set per workflow rather than one global cut), and CI hooks that fail a PR when a config change pushes the exposed toolset past a set threshold. Team/enterprise adds shared overlap policies and an export pipeline for AI-ops reporting.

Why now

MCP servers are now openly described as "the new shadow IT" — they accrete fast and unmanaged, and the 2026 wave of "MCP tool overload" write-ups (jenova's 60%-vs-92% accuracy comparison, lunar.dev, jentic's "MCP tool trap") moved the problem from anecdote to measured failure mode within the last 30 days. The research answer — dynamic tool selection that loads only a relevant 5–7-tool subset (RAG-MCP / MCP-Zero report it more than triples selection accuracy while cutting prompt tokens ~50%) — assumes you already know which tools matter; almost nobody has that inventory. ToolCull is the missing first step: measure the sprawl before you route around it. With 79% of orgs now running agents and tool counts climbing per agent, the audit layer is overdue.

Risks & open questions

  • Demand-side: teams may treat tool bloat as a "later" problem and never pay until an agent visibly breaks — the wedge has to lead with a free, alarming-but-true inventory report that sells itself.
  • Trace access varies: Claude Code, Cursor, and custom SDK stacks log differently, and some agent runners don't expose per-tool invocation cleanly; the replay layer may need per-host adapters and graceful degradation to config-only analysis.
  • Embedding overlap clustering will produce false "duplicates" (two search tools that are genuinely different scopes); the merge suggestions must be advisory with an easy reject, never auto-applied.
  • The upstream fix is dynamic tool discovery built into the agent runtimes (MCP-Zero-style) — if Anthropic/Cursor ship native relevance filtering, the pruning value shrinks; the durable moat is the measurement + governance layer, not the cut itself.
  • Build-side: keeping config/transcript parsing current across four-plus fast-moving agent hosts is real maintenance surface for a solo build.

Next step

Promote to a weekly prototype: build a CLI that ingests a real Claude Code MCP config plus a captured transcript, prints the keep/merge/dead/overflow scorecard, and exports a lean manifest — then measure tool-selection accuracy on a fixed task set before vs. after the cull.

Sources

More from App Idea Cards