Alerting
Interlock publishes all lifecycle and alert events to a custom EventBridge event bus. The framework includes two built-in consumers: an event-sink Lambda that logs all events to a DynamoDB events table, and an alert-dispatcher Lambda that delivers Slack notifications with message threading. You can also create custom rules to route events to SNS, SQS, Lambda, CloudWatch Logs, or any other EventBridge target.
EventBridge Event Bus
The Terraform module creates a custom event bus named {environment}-interlock-events. All four Lambda functions publish events to this bus using events:PutEvents.
The bus name is available as a Terraform output:
output "event_bus_name" {
value = module.interlock.event_bus_name
}
output "event_bus_arn" {
value = module.interlock.event_bus_arn
}Detail Types
Each event is published with a detail-type field that classifies the event. Use these values in EventBridge rule patterns for filtering and routing.
SLA Events
Published by the sla-monitor Lambda via EventBridge Scheduler callbacks and during SLA cleanup in the Step Functions state machine.
| Detail Type | Meaning | When |
|---|---|---|
SLA_WARNING | SLA warning threshold reached | Pipeline has not completed by the warning timestamp |
SLA_BREACH | SLA deadline exceeded | Pipeline has not completed by the breach timestamp |
SLA_MET | Job completed before SLA warning deadline | Pipeline completed before any SLA alert fired |
Lifecycle Events
Published by the orchestrator and stream-router Lambdas during the pipeline lifecycle.
| Detail Type | Meaning | When |
|---|---|---|
VALIDATION_PASSED | All validation rules passed | Readiness evaluation succeeds, before trigger |
VALIDATION_EXHAUSTED | Evaluation window closed without passing | Max evaluation time exceeded |
JOB_TRIGGERED | Pipeline job was triggered | Trigger fired successfully |
JOB_COMPLETED | Triggered job completed successfully | Job polling detects success |
JOB_FAILED | Triggered job failed | Job polling detects failure |
JOB_TIMEOUT | Triggered job timed out | Job polling detects timeout |
RETRY_EXHAUSTED | All retry attempts consumed | Job failed maxRetries (or maxCodeRetries for permanent failures) times without success |
JOB_POLL_EXHAUSTED | Job polling window exceeded | Orchestrator stopped checking job status after jobPollWindowSeconds elapsed without a terminal result |
INFRA_FAILURE | Unrecoverable infrastructure error | Step Functions execution reaches Fail state |
SFN_TIMEOUT | Step Functions execution timed out | Global TimeoutSeconds exceeded (configurable via sfn_timeout_seconds Terraform variable) |
DATA_DRIFT | Post-run drift detected | Post-run evaluation detected data quality drift against baseline |
POST_RUN_DRIFT | Post-run sensor changed after completion | Sensor value drifted from baseline after job completed |
POST_RUN_DRIFT_INFLIGHT | Post-run sensor changed while job running | Informational drift detected during an active execution |
POST_RUN_SENSOR_MISSING | No post-run sensor data received | Watchdog detected no post-run sensor within sensorTimeout |
BASELINE_CAPTURE_FAILED | Baseline capture error | Error occurred while capturing the post-run baseline at trigger completion |
SENSOR_DEADLINE_EXPIRED | Sensor trigger window closed without pipeline starting | Sensor trigger window closed without pipeline starting; manual restart required via RERUN_REQUEST |
PIPELINE_EXCLUDED | Pipeline excluded by calendar | Sensor, rerun, job-failure, or post-run drift skipped due to calendar exclusion |
Rerun Events
Published by the stream-router when processing rerun requests and late data arrivals.
| Detail Type | Meaning | When |
|---|---|---|
LATE_DATA_ARRIVAL | Sensor updated after job completed | Sensor updatedAt is newer than joblog completedAt |
RERUN_REJECTED | Rerun request rejected by circuit breaker | No new sensor data since last job completion |
RERUN_ACCEPTED | Rerun request accepted | Rerun passed circuit breaker validation and trigger lock was reset |
Dry-Run Events
Published by the stream-router Lambda for pipelines with dryRun: true. These events record what Interlock would do without executing any jobs or starting Step Functions.
| Detail Type | Meaning | When |
|---|---|---|
DRY_RUN_WOULD_TRIGGER | All validation rules passed | Interlock would have triggered the pipeline job at this time |
DRY_RUN_LATE_DATA | Sensor updated after trigger point | Sensor data arrived after the dry-run trigger was recorded — would have triggered a re-run |
DRY_RUN_SLA_PROJECTION | Estimated completion vs. deadline | Projects whether the SLA would be met or breached based on expectedDuration and deadline |
DRY_RUN_DRIFT | Post-run sensor data changed | Sensor value drifted from baseline captured at trigger time — would have triggered a drift re-run |
DRY_RUN_COMPLETED | Observation loop closed | Terminal event for the evaluation period — carries the SLA verdict (met, breach, or n/a) |
The DRY_RUN_SLA_PROJECTION detail includes status ("met" or "breach"), estimatedCompletion, deadline, and marginSeconds fields.
The DRY_RUN_COMPLETED detail includes triggeredAt, slaStatus ("met", "breach", or "n/a"), and optionally estimatedCompletion and deadline when SLA is configured.
Watchdog Events
Published by the watchdog Lambda, invoked on an EventBridge schedule (default: every 5 minutes).
| Detail Type | Meaning | When |
|---|---|---|
SCHEDULE_MISSED | No evaluation started by the schedule deadline | Watchdog detects absence of expected pipeline activity |
TRIGGER_RECOVERED | Sensor trigger condition met but no trigger existed | Watchdog re-evaluated sensor data and self-healed a missed trigger |
Event Payload Structure
Events follow the standard EventBridge envelope format:
{
"version": "0",
"id": "abc123",
"source": "interlock",
"detail-type": "SLA_WARNING",
"time": "2026-03-01T10:00:00Z",
"region": "us-east-1",
"resources": [],
"detail": {
"pipelineId": "gold-revenue",
"scheduleId": "daily",
"date": "2026-03-01",
"message": "Pipeline gold-revenue has not completed by warning deadline 10:00 UTC"
}
}The detail object contains pipeline-specific context. The exact fields vary by detail type but always include pipelineId.
Creating EventBridge Rules
Subscribe to events by creating EventBridge rules that match on detail-type. Route matched events to any supported EventBridge target.
Route SLA Alerts to SNS
resource "aws_cloudwatch_event_rule" "sla_alerts" {
name = "interlock-sla-alerts"
event_bus_name = module.interlock.event_bus_name
event_pattern = jsonencode({
"detail-type" = ["SLA_WARNING", "SLA_BREACH"]
})
}
resource "aws_cloudwatch_event_target" "sla_to_sns" {
rule = aws_cloudwatch_event_rule.sla_alerts.name
event_bus_name = module.interlock.event_bus_name
target_id = "sla-to-sns"
arn = aws_sns_topic.alerts.arn
}Route All Events to CloudWatch Logs
resource "aws_cloudwatch_event_rule" "all_events" {
name = "interlock-all-events"
event_bus_name = module.interlock.event_bus_name
event_pattern = jsonencode({
source = ["interlock"]
})
}
resource "aws_cloudwatch_event_target" "to_cloudwatch" {
rule = aws_cloudwatch_event_rule.all_events.name
event_bus_name = module.interlock.event_bus_name
target_id = "to-cloudwatch"
arn = aws_cloudwatch_log_group.interlock_events.arn
}Route Job Failures to a Lambda for Custom Handling
resource "aws_cloudwatch_event_rule" "job_failures" {
name = "interlock-job-failures"
event_bus_name = module.interlock.event_bus_name
event_pattern = jsonencode({
"detail-type" = ["JOB_FAILED", "JOB_TIMEOUT"]
})
}
resource "aws_cloudwatch_event_target" "failure_handler" {
rule = aws_cloudwatch_event_rule.job_failures.name
event_bus_name = module.interlock.event_bus_name
target_id = "failure-handler"
arn = aws_lambda_function.failure_handler.arn
}Filter by Pipeline ID
resource "aws_cloudwatch_event_rule" "gold_pipeline_events" {
name = "interlock-gold-pipeline"
event_bus_name = module.interlock.event_bus_name
event_pattern = jsonencode({
source = ["interlock"]
detail = {
pipelineId = ["gold-revenue"]
}
})
}Built-In Observability
Event Sink
The Terraform module deploys an event-sink Lambda that captures all EventBridge events to a DynamoDB events table. This provides a queryable audit log without any additional configuration.
The events table uses:
- PK:
PIPELINE#{pipelineId}— all events for a pipeline - SK:
{tsMillis}#{eventType}— sorted by timestamp - GSI1:
eventType→timestamp— query all events of a given type
Records expire after a configurable TTL (default: 90 days via events_ttl_days variable).
Alert Dispatcher (Slack)
The module deploys an alert-dispatcher Lambda that reads from an SQS alert queue and posts formatted notifications to Slack using the Bot API (chat.postMessage).
Configuration:
| Variable | Description |
|---|---|
slack_bot_token | Bot token with chat:write scope (sensitive) |
slack_channel_id | Channel ID to post alerts to |
Message threading: alerts for the same pipeline, schedule, and date are grouped into a single Slack thread. The first alert creates the thread; subsequent alerts reply in-thread. Thread records are stored in the events table and expire with the same TTL.
SLA warning suppression: when an SLA breach has already occurred, the fire-alert handler suppresses the corresponding SLA_WARNING to prevent duplicate notifications.
Consumer Patterns
Since EventBridge supports many target types, you can build any alert delivery pattern:
| Pattern | EventBridge Target | Use Case |
|---|---|---|
| PagerDuty / Slack | SNS topic with subscription | On-call alerting for SLA breaches |
| Audit log | CloudWatch Logs log group | Compliance and debugging |
| Custom processing | Lambda function | Enrich, deduplicate, or aggregate events |
| Queue for batch processing | SQS queue | Downstream systems that process events in batches |
| Cross-account delivery | EventBridge in another account | Centralized observability |
CloudWatch Alarms
The Terraform module creates CloudWatch alarms that monitor infrastructure health independently of pipeline events. Alarm state changes are reshaped into INFRA_ALARM events via EventBridge input transformers and routed to both event-sink and alert-dispatcher — no additional Go code required.
Alarm Categories
| Category | Count | Metric | Threshold |
|---|---|---|---|
| Lambda errors | 6 (one per function) | Errors | >= 1 per 5-minute period |
| Step Functions failures | 1 | ExecutionsFailed | >= 1 per 5-minute period |
| DLQ depth | 3 (control, joblog, alert) | ApproximateNumberOfMessagesVisible | >= 1 |
| Stream iterator age | 2 (control, joblog) | IteratorAge | >= 300,000ms (5 minutes) |
How It Works
- CloudWatch detects a metric threshold breach and transitions the alarm to
ALARMstate - The alarm state change publishes to the default EventBridge bus
- An EventBridge rule with an input transformer reshapes the alarm into an
INFRA_ALARMevent with the standard Interlock event structure - The transformed event routes to both event-sink (→ events table) and the SQS alert queue (→ alert-dispatcher → Slack)
SNS Integration
Optionally route alarm notifications to an SNS topic for external consumers (PagerDuty, email, etc.):
module "interlock" {
source = "path/to/interlock/deploy/terraform"
sns_alarm_topic_arn = aws_sns_topic.ops_alerts.arn
# ...
}When sns_alarm_topic_arn is set, all alarms add the topic as an alarm action alongside the EventBridge route.