Course Content
Datasource Integration
Back object types with datasets, streams, REST sources — and keep them in sync
The datasource layer revisited
An object type without a datasource is just a definition — no instances, no data. Datasource integration is the bridge from raw data systems into the ontology.
Three flavors to think about:
| Kind | Latency | Examples | Typical use |
|---|---|---|---|
| Batch dataset | Hours | Snowflake table, Parquet on S3, Iceberg | Most reference data: customers, contracts, history |
| Stream | Seconds | Kafka topic, CDC stream, Kinesis | Telemetry, events, real-time state |
| External API | Live, on-demand | Salesforce, Workday, internal REST | Authoritative but external systems |
A single object type can be backed by multiple datasources — e.g., Driver backed by the HR system (for identity) and a stream (for live status).
Backing an object type with a dataset
The simplest case: an object type sourced from a table.
objectType: Customer
backingDatasource:
kind: dataset
ref: warehouse.crm.customers_v3
primaryKey: customer_id
mode: batch
refresh: daily
mapping:
- source: customer_id
target: customerId
- source: company_name
target: companyName
- source: signed_dt
target: signedAt
transform: parseTimestamp(format = "yyyy-MM-dd")
- source: country_iso
target: billingAddress.country
transform: upper()The mapping block does three jobs:
- Renames columns to your ontology conventions.
- Transforms values to ontology types (parse timestamps, normalize codes).
- Drops, validates, or pivots as needed.
Things to enforce at mapping time:
- Reject rows missing a primary key. Better to lose a row than to mint a fake ID.
- Coerce nullability. If your ontology says
signedAtis non-null but 0.1% of rows have NULL — either fix the source, exclude those rows, or relax the ontology. Pick one consciously; don’t let bad data sneak in. - Validate enums. If the source has
country = "DE","de","Germany","GER"— normalize before storing. The ontology should not see four spellings of “Germany.”
Streaming sources
Streams update the ontology in near real-time. Common shapes:
CDC (change data capture). Every change in the operational DB flows in as an insert/update/delete event. The ontology applies it incrementally.
backingDatasource:
kind: stream
ref: kafka.dispatch.shipments_cdc
primaryKey: shipment_id
mode: cdc # insert/update/delete events
freshnessSlaSeconds: 30 # alert if events are stale > 30s
mapping: ...Event stream. Each event is a record; you aggregate or project onto object types.
# A stream of telemetry pings, projected as the "currentLocation" property
projection:
fromStream: kafka.telemetry.vehicle_pings
ontoObjectType: Vehicle
matchKey: vehicleId
set:
currentLocation: GeoPoint(payload.lat, payload.lng)
locationUpdatedAt: event_time
retentionPolicy: latestPerKeyA few things to plan for in streams:
Out-of-order events. Your locationUpdatedAt should not regress when a delayed event arrives. Filter event.timestamp > current.locationUpdatedAt before writing.
Duplicates. Network retries cause duplicates. Idempotent projections (set, not append) handle this naturally.
Schema evolution. Your stream’s schema changes. The ontology mapping should fail loudly on missing required fields, not silently produce NULL.
External API sources
Some data should not be copied — it should be federated from a system of record.
objectType: HrDriver
backingDatasource:
kind: externalApi
endpoint: https://workday.internal/api/v2/workers
primaryKey: workerId
authentication: oauth2(workday_app)
fetchStrategy:
listing: cursored
individual: byId
cache:
ttl: 5m
mapping: ...When to federate vs. copy:
| Federate (external API) | Copy (batch into dataset) |
|---|---|
| Data changes frequently, copy would always be stale | Stable data, easy to refresh |
| You have strict rate limits or licensing constraints | Volume is large and traffic spiky |
| Single source-of-truth must remain external | You need to join across systems efficiently |
Federation has a cost — every read calls the external API. Cache aggressively, but invalidate carefully.
Multi-source object types
A Driver might have:
- Identity, name, address from Workday (external API).
- License + certifications from a batch dataset.
- Current location from a Kafka stream.
Each source provides a slice of the same object type:
objectType: Driver
sources:
- role: identity
kind: externalApi
ref: workday.workers
primaryKey: workerId
properties: [firstName, lastName, employeeStatus, ...]
- role: certification
kind: dataset
ref: warehouse.compliance.driver_certs
primaryKey: worker_id
properties: [licenseClass, certifications, ...]
- role: telemetry
kind: stream
ref: kafka.telemetry.driver_status
primaryKey: driver_id
properties: [currentLocation, currentStatus, lastSeenAt]Each source is responsible for its slice. They must agree on the primary key — the system needs to know that Workday.workerId, compliance.worker_id, and kafka.driver_id refer to the same person. That mapping (sometimes a separate “identity resolution” layer) is what makes multi-source object types work.
Freshness, SLAs, and observability
Every datasource has a freshness profile. Make it explicit:
- Source freshness: how stale can the source itself be? (CDC lag, batch interval)
- Ingestion lag: how long after the source updates does the ontology see it?
- Index lag: how long after ingestion is the data queryable?
A typical SLA might read: “Customer is fresh within 24 hours; Shipment location is fresh within 30 seconds.”
Wire alerting to those SLAs. If Shipment.currentLocation is stale by 5 minutes when the SLA is 30 seconds, you have an outage — not a “we’ll look at it Tuesday” issue. The ontology is now infrastructure.
Write-back
So far we have talked about data flowing into the ontology. Sometimes the ontology needs to push state back to a source system — e.g., when an action updates a customer’s tier, the CRM should know.
actionType: changeCustomerTier
effects:
- update: Customer[customerId]
set: { tier: newTier }
- writeBack:
target: salesforce.account
match: { id: customer.salesforceId }
set: { Account_Tier__c: newTier }Two patterns for write-back:
Synchronous. The action does not succeed until the external system confirms. Strong consistency, but ties your action’s latency and availability to the external system.
Asynchronous. The action commits in the ontology and enqueues an event; a worker pushes to the external system. Eventual consistency, but resilient.
Most actions want asynchronous write-back with retry-and-alert. Make the action complete based on the ontology change; let a separate process handle external propagation.
Reconciliation
Sources drift. Even with CDC, you will occasionally have an ontology object that disagrees with the source.
Build a reconciliation job:
- Periodically sample a slice of the ontology against the source.
- Compute a diff: ontology has X, source has Y.
- For known categories of drift, auto-correct. For unknown, alert.
Without reconciliation, drift accumulates silently and you find out about it from an angry customer.
A small worked example
A logistics startup’s Shipment object, end-to-end:
objectType: Shipment
primaryKey: shipmentId
sources:
- role: master
kind: stream
ref: kafka.dispatch.shipments_cdc
mode: cdc
freshnessSlaSeconds: 30
mapping:
shipment_id -> shipmentId
status_code -> status (mapEnum statusMap)
created_ts -> createdAt
origin_hub -> originHubId
dest_hub -> destinationHubId
weight_kg -> weightKg
...
- role: telemetry
kind: stream
ref: kafka.telemetry.shipment_pings
projection:
matchKey: shipmentId
set:
currentLocation: GeoPoint(lat, lng)
locationUpdatedAt: event_time
retentionPolicy: latestPerKey
- role: documents
kind: dataset
ref: warehouse.docs.shipment_attachments
mode: batch
refresh: hourly
primaryKey: shipment_id
mapping:
proof_of_delivery_url -> attachments.proofOfDeliveryThree sources, one object type, one consumer-facing API. The ontology layer hides all of this — every consumer simply reads shipment.status, shipment.currentLocation, shipment.attachments.proofOfDelivery.
Key takeaways
- Object types are backed by datasources — batch, stream, or external API.
- The mapping is where source schemas become ontology schemas — invest in it.
- A single object type can have multiple sources, each responsible for a slice.
- Freshness is part of the contract. Treat staleness as an outage, not an annoyance.
- Write-back closes the loop when the ontology needs to update upstream systems.
What’s next
That completes the conceptual half of the course. From here we build. The next lesson — Setting Up Your Environment — gets you a working ontology workspace ready for the hands-on phase.
The data flows in. The ontology gives it meaning. 🚰