Course Content
Data Plumbing in the Wild
Sourcing data from legacy systems, broken exports, and reluctant DBAs — getting from raw bytes to a usable pipeline
Monday of week 3
You walk into the customer’s office on Monday of week 3 with an MVP plan in your bag. You sit down. You open your editor. And you immediately remember the most consistently underestimated fact about FDE work:
You don’t have the data yet.
You have promises of the data. You have exports of the data, maybe a sample, possibly from a stale environment. You have queries the customer’s analyst wrote a year ago that returned data once. What you don’t have is a working pipeline from a real source system into your platform that you can build against.
Phase 3 starts here, because the rest of the engagement depends on it. No data, no model. No model, no app. No app, no outcome. Data plumbing is not glamorous — but it is where the engagement either gets traction or gets stuck.
The plumbing mindset
Two principles to internalize:
Principle 1 — The data is worse than you think
Whatever you saw in interviews, whatever the analyst showed you in their notebook, whatever the IT director assured you about — the data, when you pull it for real, will be worse. Missing values, inconsistent encodings, duplicate keys, surprise NULLs, timestamps in five timezones, and a charming variety of “we used to do it differently before 2019.”
Plan for it. Build cheap, loud validation at every ingest point. The bug you find on day one of plumbing is free; the same bug you find on day four of building is expensive; the same bug you find on day one of demo is catastrophic.
Principle 2 — Plumbing is not a sub-phase, it is a craft
In a traditional project plan, “data engineering” is one box you cross off before “application development.” In an FDE engagement, plumbing is a craft you practice across the entire engagement, in tight loops, with the same iteration discipline you use for app development. The customer’s data will continue to surprise you in week 6. Build for that.
The seven ingestion patterns
Every customer dataset you ever ingest will fit into one of seven patterns. Learn to recognize them on sight.
1. Direct DB extract (read replica or batch job)
The customer has a database — Oracle, Postgres, SQL Server — and you can run queries against it.
Typical setup: read replica of the production OLTP database, accessible from a jump host inside the customer’s network.
Pros: structured data, queryable, joinable. The closest thing to “easy mode.”
Cons: queries against production are unwelcome; replicas may lag; schemas change without warning.
FDE moves:
- Always pull from a replica, never primary
- Materialize a daily/hourly snapshot rather than querying live
- Pin your queries to a stable schema view if the DBAs will create one
- Wrap every ingest in a row-count and freshness check (more on this below)
2. SFTP file drop
The customer’s other team writes a CSV / TSV / fixed-width / XML file to a server, on a schedule. You pull it.
Typical setup: a vendor system writes to /exports/orders_YYYYMMDD.csv every night at 02:00.
Pros: dead simple to consume. Files are durable artifacts you can replay.
Cons: schema is in someone’s head; encoding varies; “the file didn’t arrive” is a frequent ticket.
FDE moves:
- Always pull the file by date, not “the latest”
- Verify file checksums or expected size before processing
- Keep the raw files indefinitely — they are your only backup if downstream pipelines break
- Build a “no file today” alert; do not silently fail when the customer’s system stops dropping files
3. REST API poll
The customer (or a vendor) exposes a REST endpoint. You poll it.
Typical setup: GPS vendor’s /vehicles/positions endpoint, paged, rate-limited, with a 5-minute SLA on updates.
Pros: modern, typed, well-documented (sometimes).
Cons: rate limits, pagination quirks, auth that expires, “the endpoint is being deprecated in Q3.”
FDE moves:
- Page conservatively; assume the vendor’s pagination is buggy
- Cache responses with the response’s own freshness header
- Build retry-with-backoff into every call from day one
- Capture and store the raw JSON responses for replay
4. Webhook / push
The customer (or a vendor) calls you when an event happens.
Typical setup: the load-tendering vendor POSTs to your endpoint on every load-status change.
Pros: real-time. No polling waste.
Cons: you need a reachable endpoint inside the customer’s network (often hard); duplicate events; out-of-order events; missed events when your endpoint was down.
FDE moves:
- Always design for idempotency — every webhook may arrive twice
- Sequence by the event’s own timestamp, not arrival order
- Capture the raw payload before processing
- Build a “replay last 24 hours” command you can run if your endpoint was down
5. Stream (Kafka, Pulsar, Kinesis, JMS)
The customer has a message bus.
Typical setup: a Kafka topic with vehicle telemetry, 10K events/second, retained 7 days.
Pros: real-time, replayable, native to modern enterprise stacks.
Cons: schema evolution chaos; consumer group politics; backfills are an ordeal.
FDE moves:
- Use a schema registry if one exists; if not, write down the schema you observe and version it yourself
- Always start with the smallest possible consumer group that does not interfere with existing teams
- Test your consumer with a deliberate offset reset before going to production — you will need to do it
- Sample-decode every message type into your own logs at startup; you will be surprised what you find
6. Manual file / email attachment
Someone — a person — emails you a file every morning, or drops it in Sharepoint.
Typical setup: an analyst exports a report from a legacy mainframe each morning and emails it to ops. Now they email it to you too.
Pros: zero infrastructure required to start.
Cons: it is a person. People take vacations, get sick, change formats, and leave the company.
FDE moves:
- Treat this as explicitly temporary. Day one of using it, write down the plan to replace it.
- Build automatic format validation; loudly alert when the file changes shape
- Get a second person CC’d on the email so the pipeline does not die when one person is out
- Move it to SFTP or an API as soon as it is politically possible
7. Screen scrape / DOM extraction
The customer’s only access to a legacy system is through a web UI. You automate clicks against it.
Typical setup: a 2003-era J2EE app that has no API. Operators log in and copy values into Excel.
Pros: sometimes the only option.
Cons: every UI change breaks you; the customer’s IT is uncomfortable; auth and session handling are fragile.
FDE moves:
- Use this only if you have negotiated explicit permission; never on a system you weren’t told about
- Pin to specific DOM selectors, log every change in the page structure, fail loudly
- Make the schedule conservative — multiple full crawls per day will surface as a security event
- Document this honestly to the customer as technical debt; do not pretend it is a permanent solution
A worked Northbound plumbing plan
Walking through what plumbing actually looks like for the iteration-1 MVP:
| Source | Pattern | What we need | Cadence | Owner |
|---|---|---|---|---|
| SAP load board | Nightly SFTP drop | Active loads, planned ETAs | Daily 02:00 | SAP team |
| GPS portal | REST API poll | Vehicle positions, last check-in | Every 5 min | GPS vendor |
| Hubs reference | Manual CSV (one-time) | Hub IDs, codes, locations | Static, refresh quarterly | Ops admin |
| Drivers + tractors | DB extract | Driver-to-tractor assignments | Hourly | HR team |
Already, in week 3, you have four datasources to wrangle. Each one needs a working pull, validation, storage, and observability before Maria sees a single screen.
The right week-3 sequence:
- Day 1 (Mon): Get one record of each datasource into your environment, even by hand. Eyeball it. Note surprises.
- Day 2 (Tue): Build the SAP nightly ingest. It is the largest and most stable; everything else depends on it.
- Day 3 (Wed): Build the GPS poll. Cache responses. Handle 429s.
- Day 4 (Thu): Load the static hubs CSV. Hourly driver/tractor refresh. Cross-join everything to verify the ontology is wired.
- Day 5 (Fri): Run the full pipeline end-to-end. Wire the first screen to it. Demo.
Notice what is missing from this list: hand-rolled streaming, complex CDC, schema migrations. Those are iteration-2+ concerns. Iteration 1 wants the simplest pipeline that delivers correct enough data to Maria’s morning view.
Validation: cheap, fast, loud
Every dataset ingest should run through a small bank of validations on every load. The pattern is the same for every source:
Row-count check
“Today’s SAP export had 387 loads. Yesterday had 412. Two-day rolling average is 405. Difference is within 10%. Pass.”
If the row count drops by 50%, something happened — SAP partial export, regional outage, schedule change. You want to know within minutes, not on Friday at demo time.
Freshness check
“Latest GPS ping observed at 07:12 UTC. Now is 07:18. Lag is 6 minutes. Threshold is 15. Pass.”
Stale data is the single most common cause of “the system is wrong” complaints from operators. Surface staleness prominently in the app (we’ll see this in Phase 4) and alert on it before they do.
Schema check
“Today’s columns: load_id, customer_id, origin_hub, dest_hub, planned_pickup, planned_delivery, status. Expected: load_id, customer_id, origin_hub, dest_hub, planned_pickup, planned_delivery, status. Match. Pass.”
When a customer team adds a column (“oh yeah, we added priority last week”), your pipeline either handles it or breaks. You want it to handle it (extra columns OK, missing required columns fail loud) and tell you about it.
Domain check
“All
statusvalues in{OPEN, TENDERED, ACCEPTED, IN_TRANSIT, DELIVERED, CANCELLED}. Pass.”
Or: “1 row with status=‘COMPLETED’ — not in expected set. Fail.”
The customer team will, without telling you, add COMPLETED because they are migrating off the old DELIVERED value. You want to find this on day one, not in week 5.
Referential check
“187 stop records reference load_id values. 187 of 187 match a load in today’s load export. Pass.”
Or: “12 of 187 don’t match. Fail.”
If you reference an entity that does not exist, your downstream queries get silently wrong. Loudly fail at ingest.
A practical pattern: every ingest job ends with a validation summary line in the log. Something like:
[2026-05-15T07:13:04Z] sap_loads ingest complete:
rows=387 freshness=4h12m schema=ok domain=ok refs=ok STATUS=PASSWhen that line says STATUS=FAIL, you alert. When it says PASS, you sleep.
Replays and idempotency
Two design rules you’ll be glad you followed:
Make every ingest replayable by date
For batch (SFTP, REST poll, DB extract): the input to every job is a date or a range, never “the latest.” This means when the Thursday job fails, you can re-run just Thursday on Friday morning without reprocessing the week.
# Wrong — you can never replay
$ ingest_sap_loads.sh
# Right — replay is one parameter away
$ ingest_sap_loads.sh --date 2026-05-15Make every write idempotent
For every record you write into the ontology, the same input must produce the same output regardless of how many times you run it. Use the source system’s stable identifier as your primary key. If a record comes in twice, the second write is a no-op (or a deterministic update).
This becomes critical when (not if) a vendor replays the last 7 days of webhooks because they had an outage. If you handle replays correctly, this is a non-event. If you don’t, you spend Saturday cleaning up duplicate rows.
Schema drift: the slow disaster
Schema drift is when the customer’s source system changes shape in a way your pipeline did not anticipate. It is almost never announced. It manifests as:
- A new column you ignore (silent — fine, log it)
- A renamed column (silent disaster — your value is suddenly NULL)
- A removed column (loud — pipeline fails, you fix it)
- A retyped column (the worst — the pipeline succeeds but values are wrong)
Defenses:
- Compare schema on every ingest. Hash the column-name list; if it differs from yesterday, alert (don’t fail) and log a diff.
- Validate domain on critical columns. A retyped column will fail your domain checks.
- Build a single weekly “schema review” alert that fires every Monday with any drift observed in the last 7 days.
The DBA negotiation
Most plumbing struggles in an FDE engagement are not technical — they are political. The customer’s database team owns the data, and you need their cooperation.
A few moves that consistently work:
Lead with their constraints
Don’t open with “we need a read replica with these 12 tables.” Open with: “What’s the right way to get a daily extract of loads and stops without touching your production load?” Let them propose. They will often offer something better than what you would have asked for.
Be a customer of their system
Operate like an external consumer, not a privileged insider. Use the staging environment. Hit the documented APIs. Don’t ask for direct write access to anything. This earns trust quickly.
Bring them a problem they want fixed too
DBAs are usually under-resourced and over-blamed. Identify something they would benefit from — maybe their export job has been failing silently, or their schema has undocumented columns nobody understands. Offer to help fix it. You’ll have a friend for the rest of the engagement.
Document everything in writing
When the DBA says “yeah, we’ll add a column to the export,” follow up that day with an email summarizing the agreement and any timeline. Memory is short and DBAs work for the customer, not for your project — they will not chase you.
Escalate gracefully if you must
If the DBA is genuinely blocking the engagement (and sometimes they are, for valid reasons), escalate to the IT director with a specific ask, not a complaint. “We need access to the load table by Wednesday so iteration 1 can demo on Friday — what’s the right path?” Most blocks resolve at that level.
What gets ingested versus what stays at the source
A common mistake: pulling everything into the ontology because you can.
You should pull what you’ll query, join, or display. You should leave at the source what is:
- High-volume and low-query-rate (raw GPS pings from 5 years ago)
- Sensitive in a way that doesn’t need to be in the ontology (PII you don’t display)
- Owned by another team’s system of record that they will keep updating (HR data — read from it, but don’t own it)
The ontology is not a data lake. It is a typed model of what your apps and agents need. Be deliberate about what enters it.
Common failure modes
A short list of mistakes you’ll see in junior FDE engagements:
- Pulling production live instead of from a replica → DBA escalation, sometimes a real outage
- Polling without backoff → vendor rate-limits you, GPS stops working
- Trusting timestamps from the source → off-by-timezone bugs in week 4
- Skipping validation to “save time” → silent bad data in week 5, demo blowup in week 6
- No replay capability → spending Saturday morning fixing Thursday’s outage
- Mixing identifiers across systems → SAP’s load_id is not the same as the GPS vendor’s load_id; if you assume they are, you get cross-joined nonsense
- Ingesting everything → 90% of what you pulled never gets used, but you spent 3 days pulling it
A plumbing-ready checklist
Before you put any source into your iteration-1 build, you should be able to answer yes to all of these:
- I have a working pull from this source running on a schedule
- The pull is parameterized by date (or watermark) and can be replayed
- The output schema is checked on every run
- Row count and freshness are checked and alert on anomaly
- Domain values on critical columns are validated
- Referential integrity to upstream sources is checked
- Writes to the ontology are idempotent
- The raw payload is preserved (file, JSON dump, or snapshot)
- I know the timezone of every timestamp
- I have a human contact on the customer’s side I can ping if the source fails
- I have written documentation: source name, owner, cadence, ID strategy
That is a 11-item checklist. It looks like a lot. It saves you weeks.
Key terms to remember
- Ingestion pattern — one of the seven canonical shapes (DB, SFTP, REST, webhook, stream, manual, scrape)
- Watermark — the boundary timestamp that defines what is “new” since last run
- Idempotency — re-running an ingest produces the same downstream state
- Replay — the ability to re-process a specific date or range
- Schema drift — silent change in source shape
- Validation summary line — the per-ingest log line that says PASS or FAIL
What’s next
You can move data. Now we shape it into a typed semantic layer the rest of the system reads from. The next lesson takes the Northbound model from Domain Capture and walks through committing it to the platform as a working ontology — and how to make the FDE-specific design calls that distinguish a maintainable semantic layer from a brittle one.