Skip to content

Alerting

Interlock publishes all lifecycle and alert events to a custom EventBridge event bus. The framework includes two built-in consumers: an event-sink Lambda that logs all events to a DynamoDB events table, and an alert-dispatcher Lambda that delivers Slack notifications with message threading. You can also create custom rules to route events to SNS, SQS, Lambda, CloudWatch Logs, or any other EventBridge target.

EventBridge Event Bus

The Terraform module creates a custom event bus named {environment}-interlock-events. All four Lambda functions publish events to this bus using events:PutEvents.

The bus name is available as a Terraform output:

output "event_bus_name" {
  value = module.interlock.event_bus_name
}

output "event_bus_arn" {
  value = module.interlock.event_bus_arn
}

Detail Types

Each event is published with a detail-type field that classifies the event. Use these values in EventBridge rule patterns for filtering and routing.

SLA Events

Published by the sla-monitor Lambda via EventBridge Scheduler callbacks and during SLA cleanup in the Step Functions state machine.

Detail TypeMeaningWhen
SLA_WARNINGSLA warning threshold reachedPipeline has not completed by the warning timestamp
SLA_BREACHSLA deadline exceededPipeline has not completed by the breach timestamp
SLA_METJob completed before SLA warning deadlinePipeline completed before any SLA alert fired

Lifecycle Events

Published by the orchestrator and stream-router Lambdas during the pipeline lifecycle.

Detail TypeMeaningWhen
VALIDATION_PASSEDAll validation rules passedReadiness evaluation succeeds, before trigger
VALIDATION_EXHAUSTEDEvaluation window closed without passingMax evaluation time exceeded
JOB_TRIGGEREDPipeline job was triggeredTrigger fired successfully
JOB_COMPLETEDTriggered job completed successfullyJob polling detects success
JOB_FAILEDTriggered job failedJob polling detects failure
JOB_TIMEOUTTriggered job timed outJob polling detects timeout
RETRY_EXHAUSTEDAll retry attempts consumedJob failed maxRetries (or maxCodeRetries for permanent failures) times without success
JOB_POLL_EXHAUSTEDJob polling window exceededOrchestrator stopped checking job status after jobPollWindowSeconds elapsed without a terminal result
INFRA_FAILUREUnrecoverable infrastructure errorStep Functions execution reaches Fail state
SFN_TIMEOUTStep Functions execution timed outGlobal TimeoutSeconds exceeded (configurable via sfn_timeout_seconds Terraform variable)
DATA_DRIFTPost-run drift detectedPost-run evaluation detected data quality drift against baseline
POST_RUN_DRIFTPost-run sensor changed after completionSensor value drifted from baseline after job completed
POST_RUN_DRIFT_INFLIGHTPost-run sensor changed while job runningInformational drift detected during an active execution
POST_RUN_SENSOR_MISSINGNo post-run sensor data receivedWatchdog detected no post-run sensor within sensorTimeout
BASELINE_CAPTURE_FAILEDBaseline capture errorError occurred while capturing the post-run baseline at trigger completion
SENSOR_DEADLINE_EXPIREDSensor trigger window closed without pipeline startingSensor trigger window closed without pipeline starting; manual restart required via RERUN_REQUEST
PIPELINE_EXCLUDEDPipeline excluded by calendarSensor, rerun, job-failure, or post-run drift skipped due to calendar exclusion

Rerun Events

Published by the stream-router when processing rerun requests and late data arrivals.

Detail TypeMeaningWhen
LATE_DATA_ARRIVALSensor updated after job completedSensor updatedAt is newer than joblog completedAt
RERUN_REJECTEDRerun request rejected by circuit breakerNo new sensor data since last job completion
RERUN_ACCEPTEDRerun request acceptedRerun passed circuit breaker validation and trigger lock was reset

Dry-Run Events

Published by the stream-router Lambda for pipelines with dryRun: true. These events record what Interlock would do without executing any jobs or starting Step Functions.

Detail TypeMeaningWhen
DRY_RUN_WOULD_TRIGGERAll validation rules passedInterlock would have triggered the pipeline job at this time
DRY_RUN_LATE_DATASensor updated after trigger pointSensor data arrived after the dry-run trigger was recorded — would have triggered a re-run
DRY_RUN_SLA_PROJECTIONEstimated completion vs. deadlineProjects whether the SLA would be met or breached based on expectedDuration and deadline
DRY_RUN_DRIFTPost-run sensor data changedSensor value drifted from baseline captured at trigger time — would have triggered a drift re-run
DRY_RUN_COMPLETEDObservation loop closedTerminal event for the evaluation period — carries the SLA verdict (met, breach, or n/a)

The DRY_RUN_SLA_PROJECTION detail includes status ("met" or "breach"), estimatedCompletion, deadline, and marginSeconds fields.

The DRY_RUN_COMPLETED detail includes triggeredAt, slaStatus ("met", "breach", or "n/a"), and optionally estimatedCompletion and deadline when SLA is configured.

Watchdog Events

Published by the watchdog Lambda, invoked on an EventBridge schedule (default: every 5 minutes).

Detail TypeMeaningWhen
SCHEDULE_MISSEDNo evaluation started by the schedule deadlineWatchdog detects absence of expected pipeline activity
TRIGGER_RECOVEREDSensor trigger condition met but no trigger existedWatchdog re-evaluated sensor data and self-healed a missed trigger

Event Payload Structure

Events follow the standard EventBridge envelope format:

{
  "version": "0",
  "id": "abc123",
  "source": "interlock",
  "detail-type": "SLA_WARNING",
  "time": "2026-03-01T10:00:00Z",
  "region": "us-east-1",
  "resources": [],
  "detail": {
    "pipelineId": "gold-revenue",
    "scheduleId": "daily",
    "date": "2026-03-01",
    "message": "Pipeline gold-revenue has not completed by warning deadline 10:00 UTC"
  }
}

The detail object contains pipeline-specific context. The exact fields vary by detail type but always include pipelineId.

Creating EventBridge Rules

Subscribe to events by creating EventBridge rules that match on detail-type. Route matched events to any supported EventBridge target.

Route SLA Alerts to SNS

resource "aws_cloudwatch_event_rule" "sla_alerts" {
  name           = "interlock-sla-alerts"
  event_bus_name = module.interlock.event_bus_name

  event_pattern = jsonencode({
    "detail-type" = ["SLA_WARNING", "SLA_BREACH"]
  })
}

resource "aws_cloudwatch_event_target" "sla_to_sns" {
  rule           = aws_cloudwatch_event_rule.sla_alerts.name
  event_bus_name = module.interlock.event_bus_name
  target_id      = "sla-to-sns"
  arn            = aws_sns_topic.alerts.arn
}

Route All Events to CloudWatch Logs

resource "aws_cloudwatch_event_rule" "all_events" {
  name           = "interlock-all-events"
  event_bus_name = module.interlock.event_bus_name

  event_pattern = jsonencode({
    source = ["interlock"]
  })
}

resource "aws_cloudwatch_event_target" "to_cloudwatch" {
  rule           = aws_cloudwatch_event_rule.all_events.name
  event_bus_name = module.interlock.event_bus_name
  target_id      = "to-cloudwatch"
  arn            = aws_cloudwatch_log_group.interlock_events.arn
}

Route Job Failures to a Lambda for Custom Handling

resource "aws_cloudwatch_event_rule" "job_failures" {
  name           = "interlock-job-failures"
  event_bus_name = module.interlock.event_bus_name

  event_pattern = jsonencode({
    "detail-type" = ["JOB_FAILED", "JOB_TIMEOUT"]
  })
}

resource "aws_cloudwatch_event_target" "failure_handler" {
  rule           = aws_cloudwatch_event_rule.job_failures.name
  event_bus_name = module.interlock.event_bus_name
  target_id      = "failure-handler"
  arn            = aws_lambda_function.failure_handler.arn
}

Filter by Pipeline ID

resource "aws_cloudwatch_event_rule" "gold_pipeline_events" {
  name           = "interlock-gold-pipeline"
  event_bus_name = module.interlock.event_bus_name

  event_pattern = jsonencode({
    source      = ["interlock"]
    detail = {
      pipelineId = ["gold-revenue"]
    }
  })
}

Built-In Observability

Event Sink

The Terraform module deploys an event-sink Lambda that captures all EventBridge events to a DynamoDB events table. This provides a queryable audit log without any additional configuration.

The events table uses:

  • PK: PIPELINE#{pipelineId} — all events for a pipeline
  • SK: {tsMillis}#{eventType} — sorted by timestamp
  • GSI1: eventTypetimestamp — query all events of a given type

Records expire after a configurable TTL (default: 90 days via events_ttl_days variable).

Alert Dispatcher (Slack)

The module deploys an alert-dispatcher Lambda that reads from an SQS alert queue and posts formatted notifications to Slack using the Bot API (chat.postMessage).

Configuration:

VariableDescription
slack_bot_tokenBot token with chat:write scope (sensitive)
slack_channel_idChannel ID to post alerts to

Message threading: alerts for the same pipeline, schedule, and date are grouped into a single Slack thread. The first alert creates the thread; subsequent alerts reply in-thread. Thread records are stored in the events table and expire with the same TTL.

SLA warning suppression: when an SLA breach has already occurred, the fire-alert handler suppresses the corresponding SLA_WARNING to prevent duplicate notifications.

Consumer Patterns

Since EventBridge supports many target types, you can build any alert delivery pattern:

PatternEventBridge TargetUse Case
PagerDuty / SlackSNS topic with subscriptionOn-call alerting for SLA breaches
Audit logCloudWatch Logs log groupCompliance and debugging
Custom processingLambda functionEnrich, deduplicate, or aggregate events
Queue for batch processingSQS queueDownstream systems that process events in batches
Cross-account deliveryEventBridge in another accountCentralized observability

CloudWatch Alarms

The Terraform module creates CloudWatch alarms that monitor infrastructure health independently of pipeline events. Alarm state changes are reshaped into INFRA_ALARM events via EventBridge input transformers and routed to both event-sink and alert-dispatcher — no additional Go code required.

Alarm Categories

CategoryCountMetricThreshold
Lambda errors6 (one per function)Errors>= 1 per 5-minute period
Step Functions failures1ExecutionsFailed>= 1 per 5-minute period
DLQ depth3 (control, joblog, alert)ApproximateNumberOfMessagesVisible>= 1
Stream iterator age2 (control, joblog)IteratorAge>= 300,000ms (5 minutes)

How It Works

  1. CloudWatch detects a metric threshold breach and transitions the alarm to ALARM state
  2. The alarm state change publishes to the default EventBridge bus
  3. An EventBridge rule with an input transformer reshapes the alarm into an INFRA_ALARM event with the standard Interlock event structure
  4. The transformed event routes to both event-sink (→ events table) and the SQS alert queue (→ alert-dispatcher → Slack)

SNS Integration

Optionally route alarm notifications to an SNS topic for external consumers (PagerDuty, email, etc.):

module "interlock" {
  source = "path/to/interlock/deploy/terraform"

  sns_alarm_topic_arn = aws_sns_topic.ops_alerts.arn
  # ...
}

When sns_alarm_topic_arn is set, all alarms add the topic as an alarm action alongside the EventBridge route.