Course Content
Deploying to Production at the Customer
Cutover plans, dual-running with the old system, rollback procedures, and the first week of live operations
What “production” means at the customer
Up to this point in the engagement, your system has lived in a developer or pilot environment. Maria opens a bookmark on her laptop, the URL goes to a host inside the customer’s network but outside of “the production stack,” and the data feeding it is real but the consequence of being wrong is small. If the morning view shows the wrong ETA, Maria can still fall back to the spreadsheet.
Production cutover is the moment that changes. After cutover:
- The platform is the system of record for the workflows it owns
- The spreadsheet, the legacy screen, the morning email — those are gone, or at minimum no longer trusted by the operators
- The customer’s change-management process governs every future change
- An incident affecting your platform is now an operational incident with real consequences (missed loads, financial impact, customer-facing fallout)
Many FDE engagements never get here. They produce a working pilot and a friendly demo, but the customer never lets the system carry real weight, and the engagement ends as a fond memory rather than a renewed contract. Production cutover is the inflection point that determines which kind of engagement you are running.
What you need to be true before you cut over
Cutover is not a milestone you schedule; it is a state you reach. The state has prerequisites — most of which we set up across earlier phases.
A pre-cutover gate checklist, distilled from real engagements:
| Category | Gate |
|---|---|
| Data | Every datasource has been running on the production schedule for at least two weeks with PASS validation |
| Model | The ontology is versioned; current version tagged; migration plan exists |
| Apps | Every screen in the cutover scope has been used in pilot for two weeks; no outstanding P1 bugs |
| Actions | Every action has been exercised by real operators on real data; audit logs verified |
| Agents | If any are in cutover scope: eval set has been run weekly for the last four weeks with passing scores |
| Security | InfoSec sign-off received; pen test done if required; secrets inventory complete and rotated |
| Operability | On-call rotation defined, runbooks written, incident playbooks tested |
| Communications | Cutover plan distributed to stakeholders 2+ weeks in advance |
| Reversibility | Rollback procedure documented and dry-run at least once |
| Sign-off | Sponsor, IT director, security lead, and at least one operator have signed off |
If any gate is “no,” cutover is not yet. Each gate is a hard line, not a soft preference.
The cutover plan
A cutover plan is a written, dated, signed document that lists every step from the start of the cutover window through the first week of live operations. It is the single most important artifact you produce in Phase 5.
Structure:
NORTHBOUND DISPATCH PLATFORM — CUTOVER PLAN
Cutover window: 2026-06-15 (Mon) 06:00 CST → 2026-06-22 (Mon) 06:00 CST
Cutover owner: <FDE lead> Customer owner: <ops VP>
T-14d (June 1) Cutover plan distributed to all stakeholders
T-14d Operator training sessions begin (3 sessions x 2 hrs)
T-7d Final dry-run of cutover sequence (no production data)
T-7d Rollback dry-run
T-7d On-call rotations confirmed
T-3d Comms to all dispatchers, hub managers, and IT
T-1d Pre-cutover go/no-go meeting (60 min)
T-0 06:00 Cutover window opens
T+0 06:00 → Continue dual-run from previous two weeks
T+0 07:00 Morning view is the canonical screen at the East hub
T+0 09:00 Reassignment action goes live (was pilot-only)
T+1 06:00 Doug's hub view active for shift handoff
T+1 06:30 First "is everything OK?" sync (15 min)
T+2 06:30 Daily check-in
...
T+7 06:00 End of cutover window
T+7 06:30 Cutover retro; declare cutover complete (or extend)
ROLLBACK TRIGGERS — any one of:
- Two or more P1 incidents within a single shift
- On-time delivery KPI drops > 5pts week-over-week
- Sponsor request
ROLLBACK PROCEDURE
1. ...The plan looks dry. That is exactly its job. Surprise is the enemy of cutover. Everything the customer’s team sees during the window should appear on this plan, with a date and an owner. When something does go wrong, the deviation is visible against the plan.
Dual-running, properly
The most useful cutover technique — and the most consistently mis-used — is dual-running: operating both the old system and the new system in parallel for a window. Operators continue to use the legacy way and use your new platform, with both reading the same data and (ideally) producing the same decisions.
What dual-running gets you:
- Comparison signal. When the two systems disagree, you find out which one is wrong, in production, with a safety net.
- Trust accumulation. Operators see the new system produce the same answers the old one does, over time, and start to prefer the new one.
- Reversibility. If the new system fails badly, the old one is warm — not cold-starting in panic.
What dual-running can go wrong:
- Double work for operators. If they have to enter every action in both systems, you’ve doubled their workload. They’ll stop using one — usually the new one.
- Drift. If the two systems read from different data and diverge, you create a third problem: nobody knows which is right.
- Forever-dual. Without a hard end-date, dual-running becomes the status quo and the cutover never happens.
Doing it right:
- Read-dual, write-single. Both systems display the data; only one accepts writes. Operators read from the new system; writes still go through the old until you flip them, one workflow at a time.
- Comparison reports. Every morning, an automated diff highlights any disagreement between the two systems on key metrics. The cutover owner reviews. Disagreements are tickets, not surprises.
- Hard end-date. The cutover plan specifies when dual-running ends. The customer’s team plans around that date. Slipping it requires sponsor approval.
For Northbound: weeks 5-6 are dual-running. Maria reads the morning view as her primary display, but the spreadsheet is still updated by the night shift “just in case.” On Monday of week 7 the spreadsheet is retired — the cutover plan made the retirement a planned event, not a surprise.
Rollback: planning to need it
Every cutover plan includes a rollback procedure. Most engagements never use theirs. All engagements should be ready to.
A rollback procedure that actually works has three properties:
Property 1 — It is dry-run before cutover
Schedule a half-day before the cutover window where you execute the rollback procedure end-to-end on the staging environment. If you find out at 7 PM Friday during a real incident that step 4 doesn’t actually work, you have already failed.
Property 2 — It is invoked by a named human, on documented triggers
The cutover plan lists explicit rollback triggers (the example above: two P1s in a shift; KPI regression; sponsor request). When any trigger fires, a named person (the cutover owner) makes the call. Not “the team” — one person, named, with the authority to invoke.
Property 3 — It is a documented sequence, not a checklist of guesses
Each step has:
- What: what command, what config change, what manual action
- Who: which named person executes it
- Verify: how do you know it succeeded
- Time: how long it takes
- Reversibility: can this step itself be undone
A skeletal Northbound rollback:
ROLLBACK PROCEDURE — DISPATCH PLATFORM
T-0 Invocation
- Cutover owner declares rollback to the war-room channel
- Time-stamped notice posted to ops, IT, sponsor
T+0 Restore the spreadsheet pipeline
- Re-enable the legacy SAP → Sharepoint export
- Verify file appears (15 min)
- Owner: <IT lead>
T+15 Redirect dispatchers
- Email + Slack notice to dispatchers: use spreadsheet until further notice
- Bookmark removed from shared deck
- Owner: <ops manager>
T+30 Disable write actions
- Toggle feature flag fde.dispatch.write to false
- Verify reassignment form is no longer submittable
- Verify any in-flight actions are completed or rolled back
- Owner: <FDE lead>
T+45 Comms to executives
- VP of Operations and CEO notified with brief: what triggered, what is now live, ETA
- Owner: <FDE lead>
T+60 Incident review scheduled
- 24-hr incident post-mortem scheduled
- Resume-plan owner appointedEvery step has an owner, a verification, and a time. When you read it under stress at 2 AM, it tells you exactly what to do next.
The cutover window: hours and days, not minutes
In consumer software, “cutover” can be a single deploy command run at 03:00. In FDE work it is rarely a single moment — it is a window, usually a week, sometimes longer.
Inside that window:
- The first 24 hours are the highest-risk. New data shapes hit your pipeline for the first time. Users hit screens they didn’t pilot. Edge cases that took 4 weeks to encounter in pilot, hit on day 1 of production simply because the system is now used 10x as much.
- The first three days are when most surprise issues surface. Plan to be at the customer site, in person, the full week.
- The first week is when the customer’s team forms its long-term opinion of the system. “Did the new platform hold up the first week?” decides whether they trust it for the next year.
During the cutover window, daily ritual matters more than feature velocity:
- Daily 06:30 stand-up with the customer’s ops team, IT, and the FDE team. 15 minutes. Open with: “Anything broken? Anything weird? Anything you need from us?”
- End-of-day status post in the customer’s preferred channel. What happened, what’s resolved, what’s still outstanding.
- Operator drop-in time — be physically present where the operators are working, at least an hour at peak shift each day.
This is the FDE at peak embedding. You are not at your desk. You are visible. You are calm. You are taking notes.
What goes wrong in the first week (and how to handle it)
A short, non-exhaustive list of things that have gone wrong in real first-week cutovers. Expect at least three of these:
Stale data
The SAP export is 24 hours stale by the second morning. The dispatcher opens the morning view, sees outdated ETAs, and panics. Handle: surface freshness prominently (already a Phase 4 rule), instrument data-freshness alerts to fire at the alert threshold, have an IT contact ready for the SAP team.
An unmapped enum value
The customer’s IT team adds a new Load.status value (COMPLETED) without telling you. Your domain validator (Phase 3) catches it and fails the pipeline. Handle: the validator is doing its job — but in production, “fail loud” needs a “and we have a 30-minute fallback path.” Default to continuing to ingest the row with status = UNKNOWN, log loudly, alert the FDE team, and surface the unknown status in the UI.
A driver does something the model didn’t anticipate
A driver checks in for a load at the wrong hub. The semantic layer rejects the action. Maria’s reassignment workflow won’t let her fix it because the validator says the driver’s home hub is wrong. Handle: an emergency override action — typed, audited, requires a reason — that lets a senior dispatcher bypass a validation with intent. Designed in Phase 3, deployed for cutover.
A scheduled job times out
The hourly HR sync starts taking 90 minutes during business hours because the customer’s database is under unexpected load. Handle: time budgets on every job (from the API integration lesson) kill the job before it cascades; alert fires; you reschedule for off-hours.
An operator types a typo in an action
Maria assigns a stop to driver 4012 instead of 4021. The action succeeds (both are valid drivers); driver 4012 wonders why he has a new pickup. Handle: the audit trail makes this trivially reversible. A reassignStop action runs in reverse with a note. Total elapsed time: 90 seconds.
A dashboard tile loads slowly under real traffic
The executive dashboard, which loaded in 800ms in pilot, takes 4 seconds with the full operator pool hitting it Monday morning. Handle: identified during the Monday check-in; the function backing the slow tile gets a 5-minute materialization (Phase 3); resolved by Tuesday morning’s open.
The customer’s IT team rolls out a security patch overnight
A patch to the customer’s identity provider breaks SSO. Operators can’t log in Monday at 06:00. Handle: this is not your platform’s fault, but it is your problem. Have the IT lead’s mobile number. Have a documented break-glass auth for at least one cutover-owner account.
The pattern across all of these: the platform’s defensive design (Phases 2-4) absorbs the surprises, and the cutover discipline (this lesson) makes resolution fast and visible.
Communication during the cutover
Cutover surfaces an emotional dimension to the work. Operators are nervous about a new system that is now real. Executives are watching the KPI. The IT director is watching their incident dashboard. Everyone is on edge.
A communication discipline that absorbs that pressure:
- Pre-cutover memo (T-14d). One-page summary of what is changing, when, what to do if something breaks. Sent to every operator and every relevant exec. Plain English.
- Daily morning email (during window). Short status — what is live, what was resolved overnight, what to expect today. Written for operators, not engineers.
- Real-time channel (during window). Slack channel, Teams channel, whatever the customer uses. Watched continuously. Every operator question gets an answer within 15 minutes.
- End-of-day post (during window). Slightly more detailed than the morning email. Acknowledges anything that went wrong; explicitly names what is resolved.
- Weekly cutover-status summary to sponsors. Three numbers, two sentences each. The sponsor sees momentum without having to ask.
- Cutover-complete declaration. Explicit, on the planned date (or with a documented one-week extension). “Cutover is complete. The new platform is the system of record.”
Operators in particular need to see you communicating, calmly, regularly. The single biggest signal that the cutover is going well is not technical — it is the absence of panic in the customer’s communication. A calm cutover owner produces a calm customer team.
Declaring cutover complete
Cutover ends with an explicit declaration. Not “things are mostly OK now” — a documented, communicated moment.
The criteria for declaring complete:
- All planned workflows are live in the new system
- No P1 or P2 incidents in the previous 72 hours
- All dual-run comparison reports show parity (or documented expected differences)
- Operators are using the new system as their primary; the old fallback is documented as deprecated
- Three measurable outcomes (the ones from the iteration MVP plans) are tracking on target
When all are true, the cutover owner writes the cutover-complete memo: what was deployed, what is now retired, what the on-going operational model is, who owns it, and what the next phase looks like.
For Northbound, the memo might end:
Cutover declared complete 2026-06-22 06:00 CST.
Now live:
- Morning view (3 dispatchers, 2 shifts)
- Reassignment workflow (Maria + Jorge)
- Doug's hub view
- Morning brief agent (5:45 AM daily)
- VP weekly report (Sundays 21:00)
- Analyst workbench (Raj)
Retired:
- Manual SAP-spreadsheet morning consolidation
- The 06:15 status email Doug used to send
Owners:
- Platform operations: <customer ops engineer>
- Ontology + apps: <customer senior engineer>, FDE retainer for 6 months
- Agent + prompts: FDE retainer
Outcomes (week of cutover):
- On-time delivery: 87.1% (▲ from 78% baseline; target 92%)
- Avg morning batch time for Maria: 6 min (was 45)
- Doug staffs accurately (per his self-report): 5/5 mornings
Next phase: external customer tracking; iteration 7-10 scope kickoff July 15.The memo is signed by the cutover owner, the customer’s ops VP, and the IT director. It becomes the artifact the engagement is renewed against.
Common failure modes
A short list of cutover mistakes that have ended otherwise-good engagements.
Cutting over without sign-offs
The sponsor wants it live by month-end; the IT director hasn’t signed off; you ship anyway. A week later, an IT incident review finds your platform is the cause of a network event and demands an emergency rollback you weren’t planning for.
Cutting over without a rollback dry-run
You’re confident the rollback will work. You haven’t actually executed it end-to-end. When you need it Saturday at 11 PM, step 4 fails because nobody has the credential for the legacy job re-enable.
Cutting over over a weekend without on-site presence
The customer’s team has questions Monday morning. You’re flying back. You handle it over Slack. They feel abandoned. The next renewal has a “they were responsive” question that gets a “mostly” answer.
Forever-dual
Three months after cutover, both systems are still running. Nobody trusts the new one fully because the old one is still there. Nobody trusts the old one fully because the new one is supposedly authoritative. You have two systems, twice the maintenance, no clarity. Force the end-date.
Cutting over agents along with apps
You ship the agent in the same cutover window as the operational app. The agent has issues — predictable, recoverable issues — but they fold into the cutover narrative as “the platform is unreliable.” Cut over apps first; ship agents as a separate cutover, days or weeks later.
Skipping the pre-cutover go/no-go meeting
It feels redundant. You’ve talked to everyone individually. But the meeting forces every stakeholder to put their sign-off on paper, in front of each other. Skipping it is how surprise objections surface mid-cutover.
What the FDE actually does during the cutover week
A practical breakdown:
| Time | Activity |
|---|---|
| 05:30 | Arrive at the customer site (or open laptop in hotel adjacent) |
| 06:00-07:00 | Be visible at the dispatcher’s desk; observe the first hour |
| 07:00-07:15 | Joint stand-up |
| 07:15-09:00 | Deep on whatever surfaced |
| 09:00-12:00 | Build / triage anything that needs immediate attention |
| 12:00-13:00 | Lunch in the customer’s cafeteria, not at your laptop |
| 13:00-15:00 | Office hours with operators; talk through anything weird |
| 15:00-17:00 | Build out next-week iteration; review monitoring |
| 17:00-17:30 | End-of-day status post; sync with customer’s IT |
| Evening | On call. Pager nearby. |
Five days of this. By Friday evening, the rhythms have settled, the surprises have surfaced, and the customer’s team has formed its opinion. Either you have earned the cutover-complete declaration or you have a clear punch list to extend the window.
Closing thoughts
Production cutover is where the lessons of every prior phase compound. The clean ontology means the rollback is mechanical. The validated plumbing means the surprises are loud, not silent. The operator-shaped apps mean adoption happens organically. The audit trail means every weird event is investigable. The agents-with-humans-in-the-loop means AI failures are recoverable.
If any one of those prior phases was skipped or rushed, cutover surfaces it — visibly and at the worst possible time. If the foundations are solid, cutover is hard work but not heroics.
Key terms to remember
- Cutover — the moment the platform becomes the system of record
- Cutover plan — the dated, signed document specifying every step of the window
- Dual-run — reading both systems in parallel; writing through one
- Rollback procedure — dry-run, owner-named, step-by-step
- Cutover window — typically a full week; the first 24 hours are highest-risk
- Cutover-complete memo — the artifact declaring cutover done
What’s next
The system is live. But “operating” is not the same as “adopted.” Operators can use a system grudgingly, or quietly route around it, or use it for the first three weeks and then drift back to old habits. The next lesson covers change management and adoption — the human side of getting the platform actually used past the cutover honeymoon.