Watchdog

The watchdog is one of six Lambda functions in the Interlock framework. It runs independently on an EventBridge schedule (default: every 5 minutes) and runs eight checks in a table-driven loop to detect silent failures:

Stale triggers – a Step Function execution started but never completed (timeout, infrastructure failure)
Missed schedules – a cron-scheduled pipeline’s expected start time passed with no trigger record
Missed inclusion schedules – a pipeline with an inclusion calendar has no trigger on a scheduled date
Sensor-trigger reconciliation – a sensor-triggered pipeline’s conditions are met but no trigger exists (self-heals missed triggers)
SLA scheduling – proactively ensures EventBridge Scheduler entries exist for pipelines with SLA configs
Trigger deadlines – a sensor-triggered pipeline’s auto-trigger window has expired with no trigger
Missing post-run sensors – a pipeline completed but the expected post-run sensor never arrived
Relative SLA breaches – a pipeline with maxDuration SLA has exceeded its time budget since first sensor arrival

In STAMP terms, these are safety constraint violations caused by what didn’t happen rather than what went wrong.

Problem

Interlock’s event-driven architecture (sensor write -> DynamoDB Stream -> Step Function) only acts when data arrives. Two failure modes escape this detection:

Stale triggers: A Step Function execution starts but gets stuck or times out silently. The trigger lock remains in RUNNING status indefinitely.

Missed schedules: Upstream ingestion fails silently for a cron-scheduled pipeline. No sensor data arrives, no trigger fires, no SLA check ever runs. The pipeline is silently skipped with zero alerts.

This is a classic STAMP gap: the control structure assumes the controlled process always produces feedback. When it doesn’t, the controller never acts.

Stale Trigger Detection

The watchdog scans the control table for TRIGGER# records with RUNNING status and checks whether their TTL has expired.

Algorithm

Scan all TRIGGER# records with status=RUNNING

For each trigger:
  If TTL > 0 and now > TTL → stale
  Otherwise → not stale (skip)

  Parse pipeline ID, schedule, date from PK/SK
  Publish SFN_TIMEOUT event to EventBridge
  Set trigger status to FAILED_FINAL

Stale Threshold

The default stale trigger threshold is 24 hours. Triggers whose TTL has expired are considered timed-out Step Function executions. The watchdog transitions them to FAILED_FINAL status, which prevents the stream-router from attempting further retries.

Event Format

{
  "source": "interlock",
  "detail-type": "SFN_TIMEOUT",
  "detail": {
    "pipelineId": "silver-orders",
    "scheduleId": "stream",
    "date": "2026-03-01",
    "message": "step function timed out for silver-orders/stream/2026-03-01",
    "timestamp": "2026-03-01T12:30:00Z"
  }
}

Missed Schedule Detection

The watchdog loads all pipeline configs and checks cron-scheduled pipelines for missing trigger records.

Algorithm

Load all pipeline configs (via config cache)

For each pipeline with a cron schedule:
  Skip if excluded by calendar (weekends, specific dates)

  Resolve schedule ID ("cron" for cron-scheduled pipelines)
  Check if a TRIGGER# record exists for today's date

  If trigger exists → not missed (skip)

  If schedule.time is configured:
    Resolve timezone (UTC if not specified)
    If current time < expected start time → skip (not yet due)

  Publish SCHEDULE_MISSED event to EventBridge

Deadline Resolution

The watchdog determines whether a schedule is missed by checking:

Trigger record existence – if a TRIGGER#{schedule}#{date} record exists in the control table, the pipeline has already been triggered today
Expected start time – if the pipeline config includes a schedule.time, the watchdog only alerts after that time has passed in the configured timezone

If no schedule.time is configured, the watchdog alerts as soon as it detects a missing trigger record for today’s date.

Event Format

{
  "source": "interlock",
  "detail-type": "SCHEDULE_MISSED",
  "detail": {
    "pipelineId": "gold-revenue",
    "scheduleId": "cron",
    "date": "2026-03-01",
    "message": "missed schedule for gold-revenue on 2026-03-01",
    "timestamp": "2026-03-01T09:10:00Z"
  }
}

Missing Post-Run Sensor Detection

For pipelines with postRun config, the watchdog checks whether post-run sensors have arrived within a configurable grace period after job completion.

Algorithm

Load all pipeline configs (via config cache)

For each pipeline with postRun config:
  Find TRIGGER# record for today with status=COMPLETED
  If no completed trigger → skip

  Calculate elapsed time since trigger completion
  If elapsed < sensorTimeout → skip (still within grace period)

  For each postRun rule key:
    Check if SENSOR#{key} exists with data newer than trigger completion
    If sensor exists → skip

  Publish POST_RUN_SENSOR_MISSING event to EventBridge

Configuration

The sensorTimeout field on postRun controls the grace period. Defaults to "2h" (2 hours).

postRun:
  sensorTimeout: "2h"
  rules:
    - key: output-row-count
      check: gte
      field: count
      value: 1000

Event Format

{
  "source": "interlock",
  "detail-type": "POST_RUN_SENSOR_MISSING",
  "detail": {
    "pipelineId": "silver-orders",
    "scheduleId": "stream",
    "date": "2026-03-01",
    "message": "post-run sensor not received within 2h of completion",
    "timestamp": "2026-03-01T14:30:00Z"
  }
}

Deployment

The watchdog Lambda is invoked by an EventBridge scheduled rule on the default event bus:

EventBridge rule (rate) → watchdog Lambda → DynamoDB (read configs, triggers)
                                          → EventBridge custom bus (publish events)

The watchdog reads from all three DynamoDB tables (configs from control, job events from joblog, reruns from rerun) but only writes to the control table (to update stale trigger status).

IAM Permissions

Action	Scope
DynamoDB read (GetItem, Query, Scan, etc.)	All 3 tables + indexes
DynamoDB write (PutItem, UpdateItem)	Control table only
EventBridge PutEvents	Custom event bus

Terraform Configuration

The watchdog schedule is configurable via the watchdog_schedule Terraform variable:

variable "watchdog_schedule" {
  description = "EventBridge schedule expression for watchdog invocations"
  type        = string
  default     = "rate(5 minutes)"
}

Error Handling

All eight checks run independently in sequence. An error in any check does not prevent the remaining checks from running. Errors are collected into an aggregate error and returned to the Lambda runtime.

When one or more checks fail, the watchdog publishes a WATCHDOG_DEGRADED event to EventBridge listing the failed check names. This allows operators to detect partial watchdog failures through the standard alerting pipeline.

{
  "source": "interlock",
  "detail-type": "WATCHDOG_DEGRADED",
  "detail": {
    "message": "watchdog checks failed: stale-triggers, sla-scheduling",
    "timestamp": "2026-03-01T12:30:00Z"
  }
}

Relationship to Step Functions

The watchdog is not part of the Step Functions state machine. It runs on its own schedule precisely to detect cases where the Step Function didn’t start (missed schedules) or where it started but got stuck (stale triggers). This independence is fundamental to its role as a safety net.

AWS Architecture