Data Plumbing in the Wild

Monday of week 3

You walk into the customer’s office on Monday of week 3 with an MVP plan in your bag. You sit down. You open your editor. And you immediately remember the most consistently underestimated fact about FDE work:

You don’t have the data yet.

You have promises of the data. You have exports of the data, maybe a sample, possibly from a stale environment. You have queries the customer’s analyst wrote a year ago that returned data once. What you don’t have is a working pipeline from a real source system into your platform that you can build against.

Phase 3 starts here, because the rest of the engagement depends on it. No data, no model. No model, no app. No app, no outcome. Data plumbing is not glamorous — but it is where the engagement either gets traction or gets stuck.

The plumbing mindset

Two principles to internalize:

Principle 1 — The data is worse than you think

Whatever you saw in interviews, whatever the analyst showed you in their notebook, whatever the IT director assured you about — the data, when you pull it for real, will be worse. Missing values, inconsistent encodings, duplicate keys, surprise NULLs, timestamps in five timezones, and a charming variety of “we used to do it differently before 2019.”

Plan for it. Build cheap, loud validation at every ingest point. The bug you find on day one of plumbing is free; the same bug you find on day four of building is expensive; the same bug you find on day one of demo is catastrophic.

Principle 2 — Plumbing is not a sub-phase, it is a craft

In a traditional project plan, “data engineering” is one box you cross off before “application development.” In an FDE engagement, plumbing is a craft you practice across the entire engagement, in tight loops, with the same iteration discipline you use for app development. The customer’s data will continue to surprise you in week 6. Build for that.

The seven ingestion patterns

Every customer dataset you ever ingest will fit into one of seven patterns. Learn to recognize them on sight.

1. Direct DB extract (read replica or batch job)

The customer has a database — Oracle, Postgres, SQL Server — and you can run queries against it.

Typical setup: read replica of the production OLTP database, accessible from a jump host inside the customer’s network.

Pros: structured data, queryable, joinable. The closest thing to “easy mode.”

Cons: queries against production are unwelcome; replicas may lag; schemas change without warning.

FDE moves:

Always pull from a replica, never primary
Materialize a daily/hourly snapshot rather than querying live
Pin your queries to a stable schema view if the DBAs will create one
Wrap every ingest in a row-count and freshness check (more on this below)

2. SFTP file drop

The customer’s other team writes a CSV / TSV / fixed-width / XML file to a server, on a schedule. You pull it.

Typical setup: a vendor system writes to /exports/orders_YYYYMMDD.csv every night at 02:00.

Pros: dead simple to consume. Files are durable artifacts you can replay.

Cons: schema is in someone’s head; encoding varies; “the file didn’t arrive” is a frequent ticket.

FDE moves:

Always pull the file by date, not “the latest”
Verify file checksums or expected size before processing
Keep the raw files indefinitely — they are your only backup if downstream pipelines break
Build a “no file today” alert; do not silently fail when the customer’s system stops dropping files

3. REST API poll

The customer (or a vendor) exposes a REST endpoint. You poll it.

Typical setup: GPS vendor’s /vehicles/positions endpoint, paged, rate-limited, with a 5-minute SLA on updates.

Pros: modern, typed, well-documented (sometimes).

Cons: rate limits, pagination quirks, auth that expires, “the endpoint is being deprecated in Q3.”

FDE moves:

Page conservatively; assume the vendor’s pagination is buggy
Cache responses with the response’s own freshness header
Build retry-with-backoff into every call from day one
Capture and store the raw JSON responses for replay

4. Webhook / push

The customer (or a vendor) calls you when an event happens.

Typical setup: the load-tendering vendor POSTs to your endpoint on every load-status change.

Pros: real-time. No polling waste.

Cons: you need a reachable endpoint inside the customer’s network (often hard); duplicate events; out-of-order events; missed events when your endpoint was down.

FDE moves:

Always design for idempotency — every webhook may arrive twice
Sequence by the event’s own timestamp, not arrival order
Capture the raw payload before processing
Build a “replay last 24 hours” command you can run if your endpoint was down

5. Stream (Kafka, Pulsar, Kinesis, JMS)

The customer has a message bus.

Typical setup: a Kafka topic with vehicle telemetry, 10K events/second, retained 7 days.

Pros: real-time, replayable, native to modern enterprise stacks.

Cons: schema evolution chaos; consumer group politics; backfills are an ordeal.

FDE moves:

Use a schema registry if one exists; if not, write down the schema you observe and version it yourself
Always start with the smallest possible consumer group that does not interfere with existing teams
Test your consumer with a deliberate offset reset before going to production — you will need to do it
Sample-decode every message type into your own logs at startup; you will be surprised what you find

6. Manual file / email attachment

Someone — a person — emails you a file every morning, or drops it in Sharepoint.

Typical setup: an analyst exports a report from a legacy mainframe each morning and emails it to ops. Now they email it to you too.

Pros: zero infrastructure required to start.

Cons: it is a person. People take vacations, get sick, change formats, and leave the company.

FDE moves:

Treat this as explicitly temporary. Day one of using it, write down the plan to replace it.
Build automatic format validation; loudly alert when the file changes shape
Get a second person CC’d on the email so the pipeline does not die when one person is out
Move it to SFTP or an API as soon as it is politically possible

7. Screen scrape / DOM extraction

The customer’s only access to a legacy system is through a web UI. You automate clicks against it.

Typical setup: a 2003-era J2EE app that has no API. Operators log in and copy values into Excel.

Pros: sometimes the only option.

Cons: every UI change breaks you; the customer’s IT is uncomfortable; auth and session handling are fragile.

FDE moves:

Use this only if you have negotiated explicit permission; never on a system you weren’t told about
Pin to specific DOM selectors, log every change in the page structure, fail loudly
Make the schedule conservative — multiple full crawls per day will surface as a security event
Document this honestly to the customer as technical debt; do not pretend it is a permanent solution

A worked Northbound plumbing plan

Walking through what plumbing actually looks like for the iteration-1 MVP:

Source	Pattern	What we need	Cadence	Owner
SAP load board	Nightly SFTP drop	Active loads, planned ETAs	Daily 02:00	SAP team
GPS portal	REST API poll	Vehicle positions, last check-in	Every 5 min	GPS vendor
Hubs reference	Manual CSV (one-time)	Hub IDs, codes, locations	Static, refresh quarterly	Ops admin
Drivers + tractors	DB extract	Driver-to-tractor assignments	Hourly	HR team

Already, in week 3, you have four datasources to wrangle. Each one needs a working pull, validation, storage, and observability before Maria sees a single screen.

The right week-3 sequence:

Day 1 (Mon): Get one record of each datasource into your environment, even by hand. Eyeball it. Note surprises.
Day 2 (Tue): Build the SAP nightly ingest. It is the largest and most stable; everything else depends on it.
Day 3 (Wed): Build the GPS poll. Cache responses. Handle 429s.
Day 4 (Thu): Load the static hubs CSV. Hourly driver/tractor refresh. Cross-join everything to verify the ontology is wired.
Day 5 (Fri): Run the full pipeline end-to-end. Wire the first screen to it. Demo.

Notice what is missing from this list: hand-rolled streaming, complex CDC, schema migrations. Those are iteration-2+ concerns. Iteration 1 wants the simplest pipeline that delivers correct enough data to Maria’s morning view.

Validation: cheap, fast, loud

Every dataset ingest should run through a small bank of validations on every load. The pattern is the same for every source:

Row-count check

“Today’s SAP export had 387 loads. Yesterday had 412. Two-day rolling average is 405. Difference is within 10%. Pass.”

If the row count drops by 50%, something happened — SAP partial export, regional outage, schedule change. You want to know within minutes, not on Friday at demo time.

Freshness check

“Latest GPS ping observed at 07:12 UTC. Now is 07:18. Lag is 6 minutes. Threshold is 15. Pass.”

Stale data is the single most common cause of “the system is wrong” complaints from operators. Surface staleness prominently in the app (we’ll see this in Phase 4) and alert on it before they do.

Schema check

“Today’s columns: load_id, customer_id, origin_hub, dest_hub, planned_pickup, planned_delivery, status. Expected: load_id, customer_id, origin_hub, dest_hub, planned_pickup, planned_delivery, status. Match. Pass.”

When a customer team adds a column (“oh yeah, we added priority last week”), your pipeline either handles it or breaks. You want it to handle it (extra columns OK, missing required columns fail loud) and tell you about it.

Domain check

“All status values in {OPEN, TENDERED, ACCEPTED, IN_TRANSIT, DELIVERED, CANCELLED}. Pass.”

Or: “1 row with status=‘COMPLETED’ — not in expected set. Fail.”

The customer team will, without telling you, add COMPLETED because they are migrating off the old DELIVERED value. You want to find this on day one, not in week 5.

Referential check

“187 stop records reference load_id values. 187 of 187 match a load in today’s load export. Pass.”

Or: “12 of 187 don’t match. Fail.”

If you reference an entity that does not exist, your downstream queries get silently wrong. Loudly fail at ingest.

A practical pattern: every ingest job ends with a validation summary line in the log. Something like:

[2026-05-15T07:13:04Z] sap_loads ingest complete:
  rows=387  freshness=4h12m  schema=ok  domain=ok  refs=ok  STATUS=PASS

When that line says STATUS=FAIL, you alert. When it says PASS, you sleep.

Replays and idempotency

Two design rules you’ll be glad you followed:

Make every ingest replayable by date

For batch (SFTP, REST poll, DB extract): the input to every job is a date or a range, never “the latest.” This means when the Thursday job fails, you can re-run just Thursday on Friday morning without reprocessing the week.

# Wrong — you can never replay
$ ingest_sap_loads.sh

# Right — replay is one parameter away
$ ingest_sap_loads.sh --date 2026-05-15

Make every write idempotent

For every record you write into the ontology, the same input must produce the same output regardless of how many times you run it. Use the source system’s stable identifier as your primary key. If a record comes in twice, the second write is a no-op (or a deterministic update).

This becomes critical when (not if) a vendor replays the last 7 days of webhooks because they had an outage. If you handle replays correctly, this is a non-event. If you don’t, you spend Saturday cleaning up duplicate rows.

Schema drift: the slow disaster

Schema drift is when the customer’s source system changes shape in a way your pipeline did not anticipate. It is almost never announced. It manifests as:

A new column you ignore (silent — fine, log it)
A renamed column (silent disaster — your value is suddenly NULL)
A removed column (loud — pipeline fails, you fix it)
A retyped column (the worst — the pipeline succeeds but values are wrong)

Defenses:

Compare schema on every ingest. Hash the column-name list; if it differs from yesterday, alert (don’t fail) and log a diff.
Validate domain on critical columns. A retyped column will fail your domain checks.
Build a single weekly “schema review” alert that fires every Monday with any drift observed in the last 7 days.

The DBA negotiation

Most plumbing struggles in an FDE engagement are not technical — they are political. The customer’s database team owns the data, and you need their cooperation.

A few moves that consistently work:

Lead with their constraints

Don’t open with “we need a read replica with these 12 tables.” Open with: “What’s the right way to get a daily extract of loads and stops without touching your production load?” Let them propose. They will often offer something better than what you would have asked for.

Be a customer of their system

Operate like an external consumer, not a privileged insider. Use the staging environment. Hit the documented APIs. Don’t ask for direct write access to anything. This earns trust quickly.

Bring them a problem they want fixed too

DBAs are usually under-resourced and over-blamed. Identify something they would benefit from — maybe their export job has been failing silently, or their schema has undocumented columns nobody understands. Offer to help fix it. You’ll have a friend for the rest of the engagement.

Document everything in writing

When the DBA says “yeah, we’ll add a column to the export,” follow up that day with an email summarizing the agreement and any timeline. Memory is short and DBAs work for the customer, not for your project — they will not chase you.

Escalate gracefully if you must

If the DBA is genuinely blocking the engagement (and sometimes they are, for valid reasons), escalate to the IT director with a specific ask, not a complaint. “We need access to the load table by Wednesday so iteration 1 can demo on Friday — what’s the right path?” Most blocks resolve at that level.

What gets ingested versus what stays at the source

A common mistake: pulling everything into the ontology because you can.

You should pull what you’ll query, join, or display. You should leave at the source what is:

High-volume and low-query-rate (raw GPS pings from 5 years ago)
Sensitive in a way that doesn’t need to be in the ontology (PII you don’t display)
Owned by another team’s system of record that they will keep updating (HR data — read from it, but don’t own it)

The ontology is not a data lake. It is a typed model of what your apps and agents need. Be deliberate about what enters it.

Common failure modes

A short list of mistakes you’ll see in junior FDE engagements:

Pulling production live instead of from a replica → DBA escalation, sometimes a real outage
Polling without backoff → vendor rate-limits you, GPS stops working
Trusting timestamps from the source → off-by-timezone bugs in week 4
Skipping validation to “save time” → silent bad data in week 5, demo blowup in week 6
No replay capability → spending Saturday morning fixing Thursday’s outage
Mixing identifiers across systems → SAP’s load_id is not the same as the GPS vendor’s load_id; if you assume they are, you get cross-joined nonsense
Ingesting everything → 90% of what you pulled never gets used, but you spent 3 days pulling it

A plumbing-ready checklist

Before you put any source into your iteration-1 build, you should be able to answer yes to all of these:

That is a 11-item checklist. It looks like a lot. It saves you weeks.

Key terms to remember

Ingestion pattern — one of the seven canonical shapes (DB, SFTP, REST, webhook, stream, manual, scrape)
Watermark — the boundary timestamp that defines what is “new” since last run
Idempotency — re-running an ingest produces the same downstream state
Replay — the ability to re-process a specific date or range
Schema drift — silent change in source shape
Validation summary line — the per-ingest log line that says PASS or FAIL

What’s next

You can move data. Now we shape it into a typed semantic layer the rest of the system reads from. The next lesson takes the Northbound model from Domain Capture and walks through committing it to the platform as a working ontology — and how to make the FDE-specific design calls that distinguish a maintainable semantic layer from a brittle one.

Course Content

Monday of week 3

The plumbing mindset

Principle 1 — The data is worse than you think

Principle 2 — Plumbing is not a sub-phase, it is a craft

The seven ingestion patterns

1. Direct DB extract (read replica or batch job)

2. SFTP file drop

3. REST API poll

4. Webhook / push

5. Stream (Kafka, Pulsar, Kinesis, JMS)

6. Manual file / email attachment

7. Screen scrape / DOM extraction

A worked Northbound plumbing plan

Validation: cheap, fast, loud

Row-count check

Freshness check

Schema check

Domain check

Referential check

Replays and idempotency

Make every ingest replayable by date

Make every write idempotent

Schema drift: the slow disaster

The DBA negotiation

Lead with their constraints

Be a customer of their system

Bring them a problem they want fixed too

Document everything in writing

Escalate gracefully if you must

What gets ingested versus what stays at the source

Common failure modes

A plumbing-ready checklist

Key terms to remember

What’s next

🍪 Cookie Notice

Cookie Preferences

Essential Cookies

Analytics Cookies

Marketing Cookies

Functionality Cookies