Guide
Scope an AI Agent Project
By Spencer Hill, Founder · Last updated: May 24, 2026
What you'll learn
Most agent projects fail in scoping, not in code. This guide is the checklist we run with every client before a single prompt is written: how to pick a success metric, inventory the tools the agent needs, choose between single-shot, ReAct, and planner-executor architectures, set up evals from day one, model the unit economics, and cut a defensible MVP. The output is a short scoping doc that gives engineering a real plan and gives stakeholders a real number to grade the project on.
Prerequisites
- Access to the workflow or process the agent will automate
- A stakeholder who can name the business outcome in one sentence
- API access to the systems the agent will read from and write to
- A sample of 30-100 real inputs (tickets, emails, PRs, etc.)
- Budget authority for model spend and engineering time
Steps
- 1
Define the single success metric
Pick one number that, if it moves, the project worked. Resolution rate on support tickets, deals enriched per hour, percent of PRs auto-triaged correctly. Vague goals like 'improve productivity' kill agent projects because there's no way to know when you're done.
- 2
Map the tools the agent actually needs
List every external action: read a CRM record, write a calendar event, query a warehouse, call an internal API, post to Slack. Each tool needs an owner, a stable schema, idempotency guarantees, and a permission model. The tool surface is 70% of the engineering work — model choice is 5%.
- 3
Choose a loop architecture
Three viable shapes. (a) Single-shot: one prompt in, one structured output out — use when the task is bounded and tools aren't needed. (b) ReAct / tool-calling loop: model decides at each step whether to act or finish — use for most agent work. (c) Planner-executor: a planning pass produces a typed plan, a deterministic executor runs it — use when steps are expensive, audit logs matter, or actions touch money.
- 4
Stand up evals before writing the agent
Build a labeled dataset of 30-100 representative inputs with expected outputs or rubric scores. Wire it into CI so every prompt change is measured. Without evals you're flying blind — every 'this feels better' becomes religious debate.
- 5
Estimate the unit economics and budget
Calculate tokens-per-task across input, output, and tool round-trips. Multiply by model price and expected task volume. Cap max_tokens, max_steps, and total spend per task. Agents without budgets routinely spend $10 on $0.50 jobs because nobody set the guardrails.
- 6
Design the human-in-the-loop fallback
Decide which actions require human approval, which can run autonomously, and what the escalation path looks like when confidence is low. The agent should know how to say 'I don't know' and where to send it.
- 7
Cut the MVP scope ruthlessly
From the full feature list, pick the smallest slice that moves the success metric on a defined subset of inputs. One workflow, one tool surface, one user cohort. Ship in 4-6 weeks. Everything else is roadmap.
- 8
Plan rollout, monitoring, and a kill switch
Run shadow mode (agent runs, humans still act), then assisted mode (agent suggests, humans approve), then autonomous mode for the safest task subset. Log every step. Wire a feature flag to stop the agent globally in under 60 seconds.
Common pitfalls
- Skipping evals. Without a labeled dataset you cannot tell if a prompt change made the agent better or worse. Build evals before the agent.
- Treating the LLM as the product. The LLM is the engine. The product is the tools, the eval harness, the rollout strategy, and the kill switch.
- No step or token cap. An agent without max_steps and max_tokens will eventually do something expensive and embarrassing.
- Scope creep before the MVP ships. Every new tool added pre-launch doubles the eval surface. Resist.
- Skipping shadow mode. Going straight to autonomous is how you discover, in production, that the agent confidently mislabels 8% of inputs.
Next steps
If you want a partner to run this scoping with you and ship the MVP:
- AI & agent development services — agent design, evals, and production rollout.
- Web development — internal tooling and dashboards for human-in-the-loop review.
- Book a scoping call.