Press ESC to exit fullscreen
📖 Lesson ⏱️ 90 minutes

Datasource Integration

Back object types with datasets, streams, REST sources — and keep them in sync

The datasource layer revisited

An object type without a datasource is just a definition — no instances, no data. Datasource integration is the bridge from raw data systems into the ontology.

Three flavors to think about:

KindLatencyExamplesTypical use
Batch datasetHoursSnowflake table, Parquet on S3, IcebergMost reference data: customers, contracts, history
StreamSecondsKafka topic, CDC stream, KinesisTelemetry, events, real-time state
External APILive, on-demandSalesforce, Workday, internal RESTAuthoritative but external systems

A single object type can be backed by multiple datasources — e.g., Driver backed by the HR system (for identity) and a stream (for live status).

Backing an object type with a dataset

The simplest case: an object type sourced from a table.

objectType: Customer
backingDatasource:
  kind: dataset
  ref:  warehouse.crm.customers_v3
  primaryKey: customer_id
  mode: batch
  refresh: daily
  mapping:
    - source: customer_id
      target: customerId
    - source: company_name
      target: companyName
    - source: signed_dt
      target: signedAt
      transform: parseTimestamp(format = "yyyy-MM-dd")
    - source: country_iso
      target: billingAddress.country
      transform: upper()

The mapping block does three jobs:

  1. Renames columns to your ontology conventions.
  2. Transforms values to ontology types (parse timestamps, normalize codes).
  3. Drops, validates, or pivots as needed.

Things to enforce at mapping time:

  • Reject rows missing a primary key. Better to lose a row than to mint a fake ID.
  • Coerce nullability. If your ontology says signedAt is non-null but 0.1% of rows have NULL — either fix the source, exclude those rows, or relax the ontology. Pick one consciously; don’t let bad data sneak in.
  • Validate enums. If the source has country = "DE", "de", "Germany", "GER" — normalize before storing. The ontology should not see four spellings of “Germany.”

Streaming sources

Streams update the ontology in near real-time. Common shapes:

CDC (change data capture). Every change in the operational DB flows in as an insert/update/delete event. The ontology applies it incrementally.

backingDatasource:
  kind: stream
  ref:  kafka.dispatch.shipments_cdc
  primaryKey: shipment_id
  mode: cdc                  # insert/update/delete events
  freshnessSlaSeconds: 30    # alert if events are stale > 30s
  mapping: ...

Event stream. Each event is a record; you aggregate or project onto object types.

# A stream of telemetry pings, projected as the "currentLocation" property
projection:
  fromStream: kafka.telemetry.vehicle_pings
  ontoObjectType: Vehicle
  matchKey: vehicleId
  set:
    currentLocation: GeoPoint(payload.lat, payload.lng)
    locationUpdatedAt: event_time
  retentionPolicy: latestPerKey

A few things to plan for in streams:

Out-of-order events. Your locationUpdatedAt should not regress when a delayed event arrives. Filter event.timestamp > current.locationUpdatedAt before writing.

Duplicates. Network retries cause duplicates. Idempotent projections (set, not append) handle this naturally.

Schema evolution. Your stream’s schema changes. The ontology mapping should fail loudly on missing required fields, not silently produce NULL.

External API sources

Some data should not be copied — it should be federated from a system of record.

objectType: HrDriver
backingDatasource:
  kind: externalApi
  endpoint: https://workday.internal/api/v2/workers
  primaryKey: workerId
  authentication: oauth2(workday_app)
  fetchStrategy:
    listing: cursored
    individual: byId
  cache:
    ttl: 5m
  mapping: ...

When to federate vs. copy:

Federate (external API)Copy (batch into dataset)
Data changes frequently, copy would always be staleStable data, easy to refresh
You have strict rate limits or licensing constraintsVolume is large and traffic spiky
Single source-of-truth must remain externalYou need to join across systems efficiently

Federation has a cost — every read calls the external API. Cache aggressively, but invalidate carefully.

Multi-source object types

A Driver might have:

  • Identity, name, address from Workday (external API).
  • License + certifications from a batch dataset.
  • Current location from a Kafka stream.

Each source provides a slice of the same object type:

objectType: Driver
sources:
  - role: identity
    kind: externalApi
    ref:  workday.workers
    primaryKey: workerId
    properties: [firstName, lastName, employeeStatus, ...]
  - role: certification
    kind: dataset
    ref:  warehouse.compliance.driver_certs
    primaryKey: worker_id
    properties: [licenseClass, certifications, ...]
  - role: telemetry
    kind: stream
    ref:  kafka.telemetry.driver_status
    primaryKey: driver_id
    properties: [currentLocation, currentStatus, lastSeenAt]

Each source is responsible for its slice. They must agree on the primary key — the system needs to know that Workday.workerId, compliance.worker_id, and kafka.driver_id refer to the same person. That mapping (sometimes a separate “identity resolution” layer) is what makes multi-source object types work.

Freshness, SLAs, and observability

Every datasource has a freshness profile. Make it explicit:

  • Source freshness: how stale can the source itself be? (CDC lag, batch interval)
  • Ingestion lag: how long after the source updates does the ontology see it?
  • Index lag: how long after ingestion is the data queryable?

A typical SLA might read: “Customer is fresh within 24 hours; Shipment location is fresh within 30 seconds.”

Wire alerting to those SLAs. If Shipment.currentLocation is stale by 5 minutes when the SLA is 30 seconds, you have an outage — not a “we’ll look at it Tuesday” issue. The ontology is now infrastructure.

Write-back

So far we have talked about data flowing into the ontology. Sometimes the ontology needs to push state back to a source system — e.g., when an action updates a customer’s tier, the CRM should know.

actionType: changeCustomerTier
effects:
  - update: Customer[customerId]
    set: { tier: newTier }
  - writeBack:
      target: salesforce.account
      match: { id: customer.salesforceId }
      set:   { Account_Tier__c: newTier }

Two patterns for write-back:

Synchronous. The action does not succeed until the external system confirms. Strong consistency, but ties your action’s latency and availability to the external system.

Asynchronous. The action commits in the ontology and enqueues an event; a worker pushes to the external system. Eventual consistency, but resilient.

Most actions want asynchronous write-back with retry-and-alert. Make the action complete based on the ontology change; let a separate process handle external propagation.

Reconciliation

Sources drift. Even with CDC, you will occasionally have an ontology object that disagrees with the source.

Build a reconciliation job:

  1. Periodically sample a slice of the ontology against the source.
  2. Compute a diff: ontology has X, source has Y.
  3. For known categories of drift, auto-correct. For unknown, alert.

Without reconciliation, drift accumulates silently and you find out about it from an angry customer.

A small worked example

A logistics startup’s Shipment object, end-to-end:

objectType: Shipment
primaryKey: shipmentId

sources:
  - role: master
    kind: stream
    ref:  kafka.dispatch.shipments_cdc
    mode: cdc
    freshnessSlaSeconds: 30
    mapping:
      shipment_id  -> shipmentId
      status_code  -> status (mapEnum statusMap)
      created_ts   -> createdAt
      origin_hub   -> originHubId
      dest_hub     -> destinationHubId
      weight_kg    -> weightKg
      ...

  - role: telemetry
    kind: stream
    ref:  kafka.telemetry.shipment_pings
    projection:
      matchKey: shipmentId
      set:
        currentLocation: GeoPoint(lat, lng)
        locationUpdatedAt: event_time
      retentionPolicy: latestPerKey

  - role: documents
    kind: dataset
    ref:  warehouse.docs.shipment_attachments
    mode: batch
    refresh: hourly
    primaryKey: shipment_id
    mapping:
      proof_of_delivery_url -> attachments.proofOfDelivery

Three sources, one object type, one consumer-facing API. The ontology layer hides all of this — every consumer simply reads shipment.status, shipment.currentLocation, shipment.attachments.proofOfDelivery.

Key takeaways

  • Object types are backed by datasources — batch, stream, or external API.
  • The mapping is where source schemas become ontology schemas — invest in it.
  • A single object type can have multiple sources, each responsible for a slice.
  • Freshness is part of the contract. Treat staleness as an outage, not an annoyance.
  • Write-back closes the loop when the ontology needs to update upstream systems.

What’s next

That completes the conceptual half of the course. From here we build. The next lesson — Setting Up Your Environment — gets you a working ontology workspace ready for the hands-on phase.


The data flows in. The ontology gives it meaning. 🚰