Agentic Workflows and AI in FDE Deployments

Why this lesson exists

Five years ago this lesson was not in the FDE curriculum. Today it is one of the most important. Customers ask about AI on day one of nearly every engagement. Many of them have been burned by an “AI pilot” that produced a demo and never crossed into operations. The FDE who can deploy a working, trusted, operationally adopted agent has a disproportionate impact on the engagement.

The good news: the foundations from Phases 2-4 — discovery, ontology, typed actions, operational app discipline — are exactly what a healthy agent deployment needs. If you have done that work, agents are an extension of it, not a separate discipline.

The bad news: many engagements skip those foundations and reach for agents directly, producing demos that wow executives and break the moment a real operator depends on them. This lesson is about not doing that.

The right place for agents

Agents are valuable when they:

Read across many sources to assemble context a human would have to alt-tab to gather
Suggest — not decide — on consequential actions
Draft routine outputs (notifications, ticket descriptions, summary emails) for human review
Translate between unstructured (free text, voice, scanned docs) and structured (ontology actions)
Search and explore the ontology in response to operator questions

Agents are not valuable when they:

Replace a deterministic rule with a probabilistic one for no reason
Make terminal decisions on safety-critical workflows
Handle inputs the customer’s compliance regime hasn’t classified
Are deployed because a sponsor wants “AI in the demo”

A useful test: if you cannot articulate what the agent does that a well-designed app couldn’t do, the agent is the wrong choice.

Two agent shapes

Most FDE agent work falls into one of two shapes.

Shape 1 — The copilot

A chat surface alongside an operational app. The operator asks questions; the agent answers, citing the ontology. The operator drafts an action; the agent fills in plausible parameters; the operator reviews and submits.

Example for Northbound: Maria has a chat panel on the morning view. She types “what’s going on with NB-87423?” The agent assembles: the load’s current status, recent GPS positions, the assigned driver, similar loads from the customer, the last 5 LoadStatusChanges, the predicted ETA. It returns a short summary with links to drill-in.

Useful, low-risk, immediate value. The agent does not act — it informs.

Shape 2 — The autonomous workflow agent

A scheduled or event-triggered agent that takes a step or sequence of steps automatically, with human review at chosen gates.

Example for Northbound: at 6:00 AM, a “morning brief” agent runs. It pulls every slipping load, ranks them by impact, drafts a one-paragraph summary, identifies candidate reassignments, and posts the brief to a Slack channel where Maria sees it before she opens her laptop. She still does the reassignment — the agent does not act on its own.

Higher value per run, higher risk if mis-designed. Always keeps humans in the loop on the action.

The agent’s tools are the ontology’s actions

The single most important design principle for FDE-style agents:

An agent’s tools are the typed action types in the semantic layer.

Not free-text SQL. Not “browse the web.” Not “execute arbitrary Python.” The agent calls assignDriverToStops, reassignStop, markStopComplete — the same typed actions a human operator calls through the app.

Why this matters:

Validation is unified. The same preconditions that protect the system from a human mistake also protect it from an agent hallucination.
Audit is unified. Every agent action shows up in the same audit log as human actions, with a clear actor identity (agent:morning_brief_v3).
Reasoning is unified. When you debug a Friday incident, you don’t have to distinguish “the human path” from “the agent path” — both went through the same action surface.
Hand-off is unified. The customer’s engineers extend the agent by adding new actions to the ontology, not by hand-editing prompts.

If your platform requires you to expose tools to the agent, the rule is: every tool is a wrapper around exactly one semantic-layer action type or query. Nothing else.

A reference agent architecture

A pattern that works across most engagements:

   ┌────────────────────────────────────────────────────────────────┐
   │                       Operator (Maria)                         │
   └────────────────────────────────────────────────────────────────┘
                                  ▲
                                  │ chat / brief / drafted action
                                  │
   ┌────────────────────────────────────────────────────────────────┐
   │                         AGENT                                  │
   │                                                                │
   │   ┌──────────────┐    ┌─────────────┐    ┌──────────────┐      │
   │   │   Planner    │ ── │   Critic    │ ── │   Drafter    │      │
   │   │  (which       │    │  (validate   │    │  (write the  │     │
   │   │   tools?)    │    │   plan)      │    │   action)    │     │
   │   └──────────────┘    └─────────────┘    └──────────────┘      │
   │                                  │                             │
   └──────────────────────────────────┼─────────────────────────────┘
                                      ▼
            ┌────────────────────────────────────────────┐
            │      Typed tools = ontology actions        │
            │                                            │
            │  query.loadsSlippingNow()                  │
            │  query.driverHistory(driver_id)            │
            │  action.reassignStop(stop_id, driver_id)   │
            └────────────────────────────────────────────┘
                                      │
                                      ▼
                          ┌─────────────────────┐
                          │   Semantic layer    │
                          └─────────────────────┘

Three roles inside the agent:

Planner — decides which tools to call, in what order
Critic — validates the plan against constraints (this driver is on rest hours; this stop is too far)
Drafter — produces a human-readable output (chat answer, brief, drafted action)

Some platforms collapse these into one LLM call; some separate them. The conceptual split is useful regardless.

The human-in-the-loop spectrum

Not every agent decision needs human review, and not every decision should run autonomously. Pick the right gate per decision class.

Decision class	Example	Gate
Read-only summary	”Brief me on slipping loads”	None
Draft for review	”Suggested reassignment for NB-87423”	Human approves before submit
Reversible action with limit	”Send Maria a Slack ping when GPS goes stale”	Rate-limited; human can disable
Reversible action without limit	”Reassign a stop to a closer driver”	Always human-approved
Irreversible action	”Cancel a load”	Always human-approved + reason logged
Safety-critical action	”Take a driver off duty”	Almost never agent-driven; if so, multi-party approval

The mistake is to apply one gate uniformly. Senior FDEs map each agent capability to a class and choose the gate deliberately.

Designing the Northbound iteration-3 agent

By iteration 3, the dispatcher’s morning view is working and Maria asks: “could the system tell me what to do, not just what’s wrong?”

The candidate agent: a morning brief generator that runs at 5:45 AM, before Maria sits down, and posts to a Slack channel.

What it reads

All loads slipping by >15 min (via loadsSlippingNow())
For each, the load’s recent GPS trail, driver, route history
Today’s hub capacity (drivers on duty, available tractors)
Yesterday’s similar loads — were they recovered, how?

What it writes

A one-message brief, in plain English, that Maria can read on her phone before 6:00:

Good morning, Maria.

47 active loads today; 4 are slipping.

🔴 NB-87423 (Acme → DET): +47 min behind. Driver Petrov, tractor T-4011,
    last GPS 6 min ago south of Toledo. Has missed his last 3 hub
    check-ins on time. Suggested: reassign final stop to Watson
    (currently 18 min closer, capacity available).

🟠 NB-87412 (Acme → CLE): +22 min. Driver Hernandez (no relation 😉),
    GPS healthy, weather looks OK. Likely recovers without action.

🟠 NB-87418 (Riverpoint → IND): +18 min. Driver Watson is going to
    receive your NB-87423 reassign if you approve — capacity reused.

🟢 NB-87431 (Linden → CHI): +3 min, on track.

Hub status: East hub has 12 drivers on duty, 3 tractors free.

[Open Morning View]   [Take suggested actions]

What it does not write

Anything. Every “suggested action” is a draft. Maria clicks through to the app to execute. The agent’s authority is to propose, never to act.

The tool surface

The agent has exactly four read-only tools and zero write tools:

tools: [
  "ontology.query.loadsSlippingNow()",
  "ontology.query.driverNearby(stop_id, max_distance_km)",
  "ontology.query.loadHistory(load_id, window_days)",
  "ontology.query.hubCapacity(hub_id, window_minutes)",
]

No reassignStop. No cancelLoad. The agent reads; the human acts. This is a deliberate gate, not an oversight.

What success looks like

Two things, measurable in week 4:

Maria reads the brief on her phone before opening her laptop, on at least 4 of 5 mornings.
The brief’s “suggested reassignment” matches what Maria actually does at least 70% of the time. (Not 100% — disagreement is a signal of where the model is missing context.)

If both metrics hold for two weeks, iteration 4 considers loosening the gate (a one-tap “approve suggestion” that submits the action). Until then, the gate stays hard.

Evaluation: how do you know the agent is good?

LLM-powered agents need real evaluation, not vibes. Many FDE engagements stall here.

The pattern:

Build a labeled set

Pull 50-100 real historical situations from the customer’s data. For each, record what the operator actually did. This is your gold-standard set. Build it in week 1 of the agent’s iteration — before you write the prompt.

Run the agent against the set

Replay each situation as input. Capture what the agent produces. Compare to what the operator did.

Use metrics that match the agent’s job

Agent type	Useful metrics
Summary / brief	Coverage (did it mention all the slipping loads?), accuracy (no hallucinated facts), readability
Drafted action	Agreement rate with operator’s choice; severity of disagreements
Q&A	Precision (no false answers), citations to ontology
Autonomous action	Action-level accuracy; rollback rate; harm-avoidance

For each metric, pick a target before you measure. “We want 85% agreement on suggested reassignments, with no severity-2-or-higher disagreements.” Now you can answer “is the agent good enough?” with a number.

Re-evaluate on every change

Prompt change? Re-run the eval. Tool change? Re-run the eval. New objects in the ontology? Re-run the eval. Versioned evaluation runs are the only way to catch regressions — and prompt-driven systems regress easily.

A simple but powerful default: every Friday, run the full eval set on the production agent and post the report.

Failure modes specific to agents

A few that bite repeatedly.

Hallucinated entities

The agent reports “NB-87499 is slipping” but NB-87499 doesn’t exist. Cause: the LLM is generating IDs from prior context instead of looking them up.

Fix: structure the agent so every entity in the output is grounded in a tool call. Reject any output containing IDs not in the tool-call results.

Stale context

The agent reads a load’s status, drafts a suggestion, but by the time the human reviews the draft, the status has changed. Suggestion is now wrong.

Fix: every drafted action includes the timestamp of the data it was based on. Apps surface this. Re-validate at submit time.

Tool misuse

The agent calls loadsSlippingNow() with a parameter the function never exposed, or formats arguments wrong.

Fix: pass tools through a strict-typed shim that rejects malformed calls and reports back to the agent — better than silently letting it succeed on a string-coerced parameter.

Quiet drift

A vendor LLM is upgraded mid-engagement; behavior drifts; nobody notices for a week.

Fix: pin model versions. Treat the model as a dependency with a version number. Eval before adopting a new version.

Single-vendor dependency

The customer’s air-gapped environment cannot reach OpenAI. Or the customer’s compliance team forbids Anthropic. Or the contract requires US-only inference.

Fix: design the agent layer to be vendor-portable. Define your tool calling and prompt structure abstractly; let the implementation swap between Claude / GPT / Llama / Mistral / a local model.

Agents in regulated and air-gapped environments

A reprise of the secure on-prem lesson, agent-specific edition.

For most regulated environments:

Public LLM APIs are off-limits for any input that touches regulated data
Customer-hosted LLM endpoints (Azure OpenAI in their tenant, Bedrock in their account, a self-hosted Llama / Mistral) are the path forward
Inference logs are themselves regulated data — your evals and any debugging artifacts inherit the classification of the input
On-prem / air-gapped environments require a self-hosted model; coordinate model size with the customer’s GPU budget early

The model-selection conversation often happens in week 1 of an agent-bearing engagement and determines what is and is not possible. Have it explicitly.

Cost and latency budgets

Agents are easy to make expensive and slow. Both kill adoption.

Budgets to set per agent:

Cost per invocation. A daily morning brief that costs $0.50 is fine; a per-message chat that costs $0.50 is not.
Latency. Maria’s chat panel must respond in under 3 seconds (p99) for her to use it. A morning brief can take 30s — nobody is waiting.
Token budget per session. Long conversations balloon context windows; pin a maximum.

These budgets often drive architectural choices: a small fast model for the chat panel, a larger model for the morning brief, batched preprocessing of context so the LLM call itself is cheap.

Hand-off and ownership

When you leave, who owns the agent?

The realistic answer for most engagements:

The customer’s engineers own the tool surface — the actions and queries the agent uses
The customer’s analysts can own the evaluation set — they know the real situations better than anyone
The prompts and the model selection stay with you (or with a successor FDE) for the first 6-12 months
A long-term renewal handles ongoing prompt tuning and model upgrades

This is one area where pure customer hand-off is harder than other parts of the system. Be honest with the customer about this from the start — agents need ongoing attention, not just deployment.

A short FAQ

“Can we just put a chatbot on the data and call it done?” You can. It will demo well and operationally underperform. The valuable agents are the ones grounded in typed tools, not the ones with free run of the data.

“What’s the difference between an agent and a workflow?” Workflows are deterministic — same input, same output, every time. Agents are probabilistic. Use a workflow when you can; reach for an agent when the input is unstructured or the decision space is genuinely large.

“Should the agent be allowed to take actions autonomously?” Almost always no, in the first 6 months. Once you have measured agreement rate and rollback rate, you can loosen specific gates with the customer’s explicit approval. Default to humans in the loop.

“What about RAG / vector search?” Useful as one tool among many — for searching unstructured customer documents (contracts, runbooks, manuals). Less useful as a substitute for the ontology — RAG over your structured data is almost always worse than calling a typed query against it.

“How do we keep the prompt versioned and tested?” Treat the prompt like code. Versioned in Git, reviewed in PRs, exercised by the eval set on every change, deployed atomically with model selection.

Key terms to remember

Agentic workflow — an LLM-driven sequence that reads context, plans, and produces output
Copilot vs autonomous workflow agent — the two common shapes
Tools = actions — agents call the same typed actions as humans
Human-in-the-loop spectrum — match the gate to the reversibility and stakes of the decision
Eval set — labeled historical situations used to measure agent quality
Model pinning — fixing a vendor model version to prevent drift
Cost / latency / token budgets — explicit limits per invocation

What’s next

You have apps and you have agents — both reading and writing through the same semantic layer. The last lesson of Phase 4 covers the display surfaces: dashboards, reports, and operator UX. These are how the customer’s executives, analysts, and operators see the system from a thousand feet without losing the trust the operational apps built up close.

Course Content