Step Functions State Machine
Step Functions State Machine
This guide explains the ASL patterns used in Interlock’s Step Functions state machine. The state machine uses 18 sequential states to orchestrate pipeline evaluation, job triggering, and SLA monitoring via EventBridge Scheduler.
Architecture Overview
The state machine is defined in deploy/statemachine.asl.json. The Terraform module provisions it automatically with Lambda ARN substitution.
InitEvalLoop → Evaluate → IsReady
→ (passed) Trigger → CheckSLAConfig → ScheduleSLAAlerts → HasTriggerResult
→ WaitForJob → CheckJob → IsJobDone
→ (terminal) CheckCancelSLA → CancelSLASchedules → Done
→ (running) WaitForJob (loop)
→ (not ready) WaitInterval → IncrementElapsed → CheckWindowExhausted
→ (window remaining) Evaluate (loop)
→ (window exhausted) ValidationExhausted → CheckSLAConfig → ... → DoneSLA monitoring uses EventBridge Scheduler instead of a parallel branch. The sla-monitor Lambda creates one-time Scheduler entries that fire alerts at exact timestamps. When the job completes, unfired entries are cancelled.
Evaluation Loop
The evaluation loop calls the orchestrator Lambda with mode=evaluate to check validation rules against sensor data.
Evaluate and Poll
{
"Evaluate": {
"Type": "Task",
"Resource": "${orchestrator_arn}",
"Parameters": {
"mode": "evaluate",
"pipelineId.$": "$.pipelineId",
"scheduleId.$": "$.scheduleId",
"date.$": "$.date"
},
"ResultPath": "$.evaluateResult",
"Next": "IsReady"
},
"IsReady": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.evaluateResult.status",
"StringEquals": "passed",
"Next": "Trigger"
}
],
"Default": "WaitInterval"
},
"WaitInterval": {
"Type": "Wait",
"SecondsPath": "$.config.evaluationIntervalSeconds",
"Next": "IncrementElapsed"
}
}How it works
- Evaluate calls the orchestrator with
mode: evaluate. The orchestrator reads sensor data from DynamoDB and evaluates all validation rules defined in the pipeline config. - IsReady branches on the result: if
passed, proceed to trigger; otherwise, wait and re-evaluate. - WaitInterval uses
SecondsPathfor configurable poll intervals (set viaevaluation.intervalin the pipeline config). - IncrementElapsed tracks total elapsed time using
States.MathAdd. - CheckWindowExhausted compares elapsed time against
evaluationWindowSeconds. If the window is exhausted, the orchestrator publishes aVALIDATION_EXHAUSTEDevent.
Trigger and Job Polling
{
"Trigger": {
"Type": "Task",
"Resource": "${orchestrator_arn}",
"Parameters": {
"mode": "trigger",
"pipelineId.$": "$.pipelineId",
"scheduleId.$": "$.scheduleId",
"date.$": "$.date"
},
"ResultPath": "$.triggerResult",
"Next": "CheckSLAConfig",
"Retry": [
{
"ErrorEquals": [
"Lambda.ServiceException",
"Lambda.AWSLambdaException",
"Lambda.TooManyRequestsException",
"States.TaskFailed"
],
"IntervalSeconds": 30,
"MaxAttempts": 4,
"BackoffRate": 2
}
],
"Catch": [
{
"ErrorEquals": ["States.ALL"],
"ResultPath": "$.errorInfo",
"Next": "CheckCancelSLA"
}
]
},
"CheckJob": {
"Type": "Task",
"Resource": "${orchestrator_arn}",
"Parameters": {
"mode": "check-job",
"pipelineId.$": "$.pipelineId",
"scheduleId.$": "$.scheduleId",
"date.$": "$.date",
"runId.$": "$.triggerResult.runId",
"metadata.$": "$.triggerResult.metadata"
},
"ResultPath": "$.checkJobResult",
"Next": "IsJobDone"
},
"IsJobDone": {
"Type": "Choice",
"Choices": [
{
"Not": {
"Variable": "$.checkJobResult.event",
"IsPresent": true
},
"Next": "WaitForJob"
},
{
"Variable": "$.checkJobResult.event",
"StringEquals": "success",
"Next": "CheckCancelSLA"
},
{
"Variable": "$.checkJobResult.event",
"StringEquals": "fail",
"Next": "CheckCancelSLA"
},
{
"Variable": "$.checkJobResult.event",
"StringEquals": "timeout",
"Next": "CheckCancelSLA"
}
],
"Default": "WaitForJob"
}
}The Trigger state has a dedicated retry policy for infrastructure failures. When the orchestrator’s trigger execution fails (e.g., Glue ConcurrentRunsExceededException), it returns a Lambda error. Step Functions retries 4 times with exponential backoff (30s, 60s, 120s, 240s). If all retries exhaust, the Catch routes to CheckCancelSLA for graceful SLA cleanup instead of crashing.
Once triggered, the state machine polls job status via check-job mode. The IsJobDone Choice state first checks whether the event field is present (missing means still running), then routes terminal events (success, fail, timeout) to SLA cleanup.
SLA Scheduling
{
"CheckSLAConfig": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.config.sla",
"IsPresent": true,
"Next": "ScheduleSLAAlerts"
}
],
"Default": "HasTriggerResult"
},
"ScheduleSLAAlerts": {
"Type": "Task",
"Resource": "${sla_monitor_arn}",
"Parameters": {
"mode": "schedule",
"pipelineId.$": "$.pipelineId",
"scheduleId.$": "$.scheduleId",
"date.$": "$.date",
"deadline.$": "$.config.sla.deadline",
"expectedDuration.$": "$.config.sla.expectedDuration"
},
"ResultPath": "$.slaSchedule",
"Next": "HasTriggerResult"
},
"CancelSLASchedules": {
"Type": "Task",
"Resource": "${sla_monitor_arn}",
"Parameters": {
"mode": "cancel",
"pipelineId.$": "$.pipelineId",
"scheduleId.$": "$.scheduleId",
"date.$": "$.date",
"warningAt.$": "$.slaSchedule.warningAt",
"breachAt.$": "$.slaSchedule.breachAt"
},
"ResultPath": "$.slaResult",
"Next": "Done"
}
}How it works
- CheckSLAConfig skips SLA scheduling if no
slaconfig is present in the pipeline config. - ScheduleSLAAlerts creates two one-time EventBridge Scheduler entries — one for the warning timestamp and one for the breach timestamp. Each entry invokes the sla-monitor Lambda with
mode=fire-alertat the exact time. Entries auto-delete after execution. - After the job reaches a terminal state, CheckCancelSLA checks if SLA was configured (by checking for
$.slaSchedule). - CancelSLASchedules deletes any unfired Scheduler entries and publishes
SLA_METif the job completed before the warning deadline.
Error Handling
All Task states include Retry and Catch blocks:
{
"Retry": [
{
"ErrorEquals": [
"Lambda.ServiceException",
"Lambda.AWSLambdaException",
"Lambda.TooManyRequestsException",
"States.TaskFailed"
],
"IntervalSeconds": 2,
"MaxAttempts": 3,
"BackoffRate": 2
}
],
"Catch": [
{
"ErrorEquals": ["States.ALL"],
"ResultPath": "$.errorInfo",
"Next": "InfraFailure"
}
]
}The default retry policy handles transient Lambda errors with exponential backoff (2s, 4s, 8s). The Trigger state overrides this with a longer retry (30s, 60s, 120s, 240s) because infrastructure failures like Glue concurrency limits benefit from more time between attempts.
Catch routing varies by state:
- Most states: Catch routes to
InfraFailure(Fail state — terminates execution) - Trigger: Catch routes to
CheckCancelSLA(graceful termination with SLA cleanup) - ScheduleSLAAlerts: Catch routes to
HasTriggerResult(SLA failure is non-fatal) - CancelSLASchedules: Catch routes to
Done(cleanup failure is non-fatal)
Customization
The ASL is a template file (deploy/statemachine.asl.json) processed by Terraform’s templatefile function. The two substitution variables are:
| Variable | Terraform Resource |
|---|---|
${orchestrator_arn} | aws_lambda_function.orchestrator.arn |
${sla_monitor_arn} | aws_lambda_function.sla_monitor.arn |
The evaluation interval, window duration, and job check interval are driven by pipeline config values passed in the Step Functions input ($.config), not hardcoded in the ASL.