Part 3
Hiring, Guardrails & Measuring
We've explored what agents are, where they add value, and how teams are using them across real GTM workflows.
Now we come to the operational reality: actually making AI agents work. Not on LinkedIn, not in a demo or a sandbox, but in production where workflows are messy, stakes are high, and brand risk is real.
This section is about hiring agents with the right scope, deploying them with the right supervision, and measuring performance in a way that reflects real value. We'll cover:
- The Build vs. Buy decision, and when it makes sense to do either (or both)
- The role of human-in-the-loop (HITL) design in reducing risk and drift
- How to define guardrails and ownership for the agent
- What to measure, and what performance metrics actually mean in agent workflows
Build vs. Buy: The First Fork in the Road
Before you can measure performance or enforce guardrails, you have to answer the first operational question:
Are you building your own agent, or buying one off the shelf?
This question is less about speed or cost and more about ownership, flexibility, and accountability. And it has downstream consequences for how your agent performs, how it scales, and how you supervise it.
What You're Really Deciding
So how do you decide? Nina Butler says it starts with understanding the problem's complexity. If the workflow is simple and well-understood, like pulling account research, you might build it yourself. But when the task involves multiple inputs, market nuance, or subjective value framing, it's often better to rely on teams who've already done the hard work.
The build vs. buy comes down to how nuanced of a problem you're trying to solve. If it's straightforward, like account research, you might build it. But if it's complex, with market nuance and value framing, you're better off with someone who's already figured it out.
She also cautions against underestimating the investment required to build something reliable, especially when AI literacy is still low across most teams.
You might want to go with a vendor that's already had a leg up on this innovation, versus stumbling around in the dark trying to do it yourself.
The more strategic or sensitive the workflow, the more careful you need to be.
Sometimes the better move isn't building or buying: it's scoping more clearly. Because no agent can succeed without a clearly defined job. And no decision — build, buy, or blend — works without that first.
Whether you build or buy, one thing doesn't change: your responsibility. Just because you're using a vendor's agent doesn't mean you outsource accountability. And just because you've custom-built one doesn't mean it will behave predictably. That's what makes human supervision essential.
Human-in-the-Loop: Why Supervision Matters
AI agents aren't plug-and-play. As Ori Entis reminds us, agents are non-deterministic; they won't always behave the same way twice. They respond to new data, shifting context, and dynamic prompts. And that means they can drift, hallucinate, or misfire.
Most old-school systems are deterministic. AI agents have an element of non-determinism… On one side it gives you flexibility. On the other, you have a risk that the agent will behave unexpectedly.
That unpredictability is the tradeoff. And that's what makes ongoing supervision essential. As such, every high-functioning agent needs a Human-in-the-Loop (HITL).
HITL is not synonymous with micromanagement; it's about structured supervision:
Nina Butler describes this as the "teammate" model:
You have these two extremes: fully human or fully automated. In reality, you want a teammate concept where the AI handles the repetitive tasks and the human jumps in for the critical, empathetic moments.
A Maturity Model for Supervision
As agents prove themselves, supervision can evolve. But it should never disappear. Ori Entis offers a practical roadmap:
Start with 100% review of outputs.
Every output should be checked until you're confident in the agent's performance.
Start with 100% review of outputs. Every output should be checked until you're confident in the agent's performance.
Move to sampling as confidence grows.
Once early reviews show consistent results, shift from checking every output to sampling. Log failures, track metrics over time, and tighten thresholds where drift shows up.
The human in the loop has to be there, but it has to adapt to the situation. So maybe the human in the loop does sampling versus checking every single [output]… usually that sampling rate tends to go down as a system matures.
Use agents to supervise agents.
In high-volume or high-stakes contexts, a secondary agent can monitor outputs or flag anomalies. Example: a support-response agent generates replies, while a second agent flags answers that exceed a certain length, include pricing references, or contain sentiment risks for human review.
Add stricter controls for sensitive use cases.
The more critical the workflow, the stricter the controls. Tighter sampling, stricter thresholds, faster escalation paths — and humans still own the performance loop.
Guardrails & Ownership
Every agent needs a job description, and it has to come with a rulebook: one that clearly defines what they're allowed to do, when to escalate, and who's ultimately responsible for their behavior.
That's what guardrails are: a way to define your boundaries without hard-coding every if-then. You don't need 60 pages of red-tape. You just need clear answers to a few key questions, and a team that knows who owns what.
What Guardrails Actually Look Like
Here's a quick framework you can use to define agent limits, no matter the workflow:
Example: Your support agent can draft ticket replies and pull knowledge base articles, but must not initiate customer refunds independently.
Example: In campaign QA, if a subject line includes risky keywords (e.g. pricing, legal claims), the agent pauses the send and routes the draft to a human.
Example: Every 10th outbound sequence sent by the agent is sampled by a marketer for tone and accuracy.
Example: If the onboarding agent can't find a tutorial for a user's setup, it sends a fallback message and tags the CX team instead of generating a guess.
Example: A spike in unsubscribes triggers an alert to pause automated nurture sequences and review recent changes.
Who Owns the Agent?
When ownership is distributed or unclear, agents lose reliability fast. Nobody monitors them closely. Nobody tunes them when things go off. And nobody feels responsible when something breaks. Guardrails keep agents safe. Ownership keeps them accountable.
Every agent should have a clearly defined “agent owner”, the person or team responsible for:
- Setting and updating the agent's scope
- Reviewing and approving changes to behavior or prompts
- Monitoring output quality and performance metrics
- Leading postmortems when failures happen
- Communicating changes to stakeholders
Questions to Clarify Ownership
Use these prompts to define ownership before deployment:
Who is responsible for defining the agent's job and workflows?
Who reviews its outputs regularly — and how often?
Who is notified when the agent fails or triggers escalation?
Who is allowed to make changes to its prompts or behavior?
Who maintains visibility across versions, tools, and deployment contexts?
Don't let vendors “own” the agent's purpose, even if they built it. You still own the outcomes. You still own the risk.
Measuring Agent Performance: What Good Actually Looks Like
It's tempting to evaluate AI agents the way you'd evaluate a new tool: uptime, output volume, maybe accuracy. But that misses the bigger picture.
Agents aren't just producing content or completing tasks. They're interacting with your workflows, your systems, and your team. And that means asking:
- Was the output useful?
- Did it save time or improve consistency?
- Did it help the team make a better decision, or a faster one?
Where Most Teams Go Wrong
Most early deployments track the wrong things:
To track real performance, measure across three layers:
Task Accuracy
Did the agent do what it was asked? Was the output complete, relevant, correct?
Workflow Impact
Did the agent reduce effort, increase consistency, or unblock the next step?
Team Adoption
Are people actually using it? Does it improve trust, speed, or clarity?
Remember: Impact ≠ Autonomy. An agent doesn't need to work independently to be valuable. It just needs to work reliably, and in the right context.
Field Notes
A year ago, Lattice made headlines for announcing the world's first AI “employee”: an HR agent to manage other agents. The internet mocked it; the use case felt dystopian.
Cut to today and that future no longer feels like a Silicon Valley parody. At enterprise scale, we're seeing the rise of internal “agent managers,” drag-and-drop agent workflows, and early-stage infrastructure for secure, accountable agent orchestration.
That's how fast this space is moving.
As we built this playbook, we spoke with operators and analysts to ground it in expert insight. But even outside formal interviews, we've had front-row seats to the the AI agent rush. Here's what we're learning from the field:
Everyone wants agents. Few know what they're asking for.
We've seen $300k consulting deals signed just to advise orgs on “where to put an agent on the org chart.” We've heard of boards demanding agent adoption, only for exec teams to scramble so the next board meeting can report, “We've deployed two agents.” It's easy to laugh — but the truth is, every shift this big starts with confusion. We're currently in the phase where hype is moving faster than understanding.
Agentic systems are not yet the norm. Copilots are.
Most enterprise deployments today are still at the copilot stage: tools that enhance productivity through chat-like interfaces. This is the layer where adoption is actually happening. True AI agents capable of handling structured goals independently are still rare.
The breakthroughs will come when copilots grow up: from interface overlays to persistent agents that plan, act, and learn within bounded systems — not just responding to prompts. Right now, we're still in early innings.
The architecture matters more than the promise.
Many of today's so-called agents are just AI wrappers on brittle, legacy SaaS systems. And they break easily. Copilots are often poorly integrated, bolted on rather than built in. But if your copilot layer is brittle, your agent layer will be worse. The disruption will start with rethinking these foundations. And most commercial tools won't make that leap.
Vendors are still figuring it out, too.
It's not just buyers who are uncertain. Most vendors are also in test-and-learn mode. They need early users to discover what works. That means buyers aren't just adopting agents; they're co-developing them. And that's not a bad thing, as long as expectations are aligned.
The fear of looking behind the curve is real.
No one wants to be the one saying, “I don't get it.” So teams overspend, overpromise, and over-automate just to keep up appearances. But the most thoughtful leaders we met weren't rushing. They were asking better questions: What process are we trying to improve? What failure modes do we expect? Who owns the outcomes?
Confusion doesn't mean failure. This is what progress looks like.
Most of this playbook has already said it, so we'll keep this part short.
Don't fall for the myth that everyone else has it figured out. They don't. If you're experimenting with agents, start small. Scope tightly. Choose outcomes you can measure. Ask vendors sharp questions. And resist the urge to copy someone else's AI strategy; you know your business best.
That said, what we're seeing is also healthy. The doubt, the discovery, the urgency — all signs of a market waking up, not giving up. You're not falling behind if you're still figuring it out. You're doing it right.
That's what we're betting on at Petavue. And we're just as excited to see where this goes next.