Triggers
When a pipeline’s validation rules pass, the orchestrator Lambda executes the configured job to start the downstream workload. The job block in the pipeline YAML defines the job type and its configuration.
Job Configuration
job:
type: glue
config:
jobName: gold-revenue-etl
maxRetries: 2Common Fields
| Field | Type | Default | Description |
|---|---|---|---|
type | string | – | Job type (required, see types below) |
config | map | – | Type-specific configuration |
maxRetries | int | 0 | Maximum retry attempts on failure |
Job Types
http
Sends an HTTP request to an arbitrary endpoint.
job:
type: http
config:
method: POST
url: https://api.example.com/jobs/start
headers:
Authorization: "Bearer ${TOKEN}"
Content-Type: application/json
body: '{"pipeline": "${PIPELINE}", "date": "${DATE}"}'| Field | Type | Description |
|---|---|---|
method | string | HTTP method (GET, POST, PUT) |
url | string | Target URL |
headers | map | HTTP headers |
body | string | Request body (supports template variables) |
command
Executes a shell command. Intended for use in development and testing scenarios.
job:
type: command
config:
command: /usr/local/bin/run-etl --pipeline my-pipeline
timeout: 300| Field | Type | Description |
|---|---|---|
command | string | Shell command to execute |
timeout | int | Execution timeout in seconds |
airflow
Triggers an Apache Airflow DAG run and polls for completion.
job:
type: airflow
config:
url: https://airflow.example.com
dagID: my_dag
headers:
Authorization: "Bearer ${AIRFLOW_TOKEN}"
pollInterval: 1m| Field | Type | Description |
|---|---|---|
url | string | Airflow webserver URL |
dagID | string | DAG identifier |
headers | map | Authentication headers |
pollInterval | string | Status poll interval (e.g. 1m) |
glue
Starts an AWS Glue job run.
job:
type: glue
config:
jobName: my-glue-etl
arguments:
"--pipeline": my-pipeline
"--date": "${DATE}"| Field | Type | Description |
|---|---|---|
jobName | string | Glue job name |
arguments | map | Job arguments (keys prefixed with --) |
Returned metadata: jobRunId – used by the orchestrator to poll GetJobRun.
emr
Submits a step to an existing EMR cluster.
job:
type: emr
config:
clusterID: j-1234567890ABC
arguments:
"--class": com.example.ETLJob
"--jar": s3://bucket/etl.jar| Field | Type | Description |
|---|---|---|
clusterID | string | EMR cluster ID |
arguments | map | Step arguments |
Returned metadata: clusterID, stepId – used by the orchestrator to poll DescribeStep.
emr-serverless
Starts a job run on EMR Serverless.
job:
type: emr-serverless
config:
applicationID: "00f00abcde12345"
arguments:
"--class": com.example.ETLJob
"--s3-input": s3://bucket/input/| Field | Type | Description |
|---|---|---|
applicationID | string | EMR Serverless application ID |
arguments | map | Job run arguments |
Returned metadata: applicationId, jobRunId – used by the orchestrator to poll GetJobRun.
step-function
Starts an AWS Step Functions execution.
job:
type: step-function
config:
stateMachineARN: arn:aws:states:us-east-1:123456789:stateMachine:my-workflow
input: '{"key": "value"}'| Field | Type | Description |
|---|---|---|
stateMachineARN | string | Step Function state machine ARN |
input | string | JSON input to the execution |
Returned metadata: executionArn – used by the orchestrator to poll DescribeExecution.
databricks
Submits a job run to Databricks via REST API 2.1.
job:
type: databricks
config:
workspaceURL: https://my-workspace.cloud.databricks.com
headers:
Authorization: "Bearer ${DATABRICKS_TOKEN}"
body: '{"job_id": 12345, "notebook_params": {"date": "${DATE}"}}'| Field | Type | Description |
|---|---|---|
workspaceURL | string | Databricks workspace URL |
headers | map | Authentication headers |
body | string | JSON request body (Jobs API 2.1 format) |
Returned metadata: runId – used by the orchestrator to poll via GET /api/2.1/jobs/runs/get.
Template Variables
The body, input, and arguments fields support template substitution. Variables are resolved at trigger time:
| Variable | Value |
|---|---|
${DATE} | Current date (YYYY-MM-DD) |
${PIPELINE} | Pipeline ID |
${SCHEDULE} | Schedule ID |
${RUN_ID} | Run ID (UUID v4) |
Status Polling
After a job executes, the orchestrator Lambda polls for completion using the returned metadata. Each job type implements CheckStatus():
| Type | SDK Call | Status Mapping |
|---|---|---|
glue | GetJobRun + CloudWatch RCA verification | RUNNING -> running, SUCCEEDED -> succeeded (verified), FAILED/STOPPED -> failed |
emr | DescribeStep | RUNNING/PENDING -> running, COMPLETED -> succeeded, FAILED/CANCELLED -> failed |
emr-serverless | GetJobRun | RUNNING/SUBMITTED -> running, SUCCESS -> succeeded, FAILED/CANCELLED -> failed |
step-function | DescribeExecution | RUNNING -> running, SUCCEEDED -> succeeded, FAILED/TIMED_OUT/ABORTED -> failed |
databricks | GET /runs/get | RUNNING/PENDING -> running, TERMINATED(SUCCESS) -> succeeded, others -> failed |
airflow | GET /dags/{id}/dagRuns | running -> running, success -> succeeded, failed -> failed |
The http and command types are fire-and-forget – they do not support status polling.
Glue RCA Verification
AWS Glue can report SUCCEEDED via the GetJobRun API even when the Spark job failed internally (the driver process catches the exception and exits cleanly). When a Glue job reports SUCCEEDED, the orchestrator cross-checks the CloudWatch RCA (root cause analysis) log stream for GlueExceptionAnalysisJobFailed events. If the RCA log confirms a failure, the job is marked as failed with the actual failure reason.
This check requires logs:FilterLogEvents permission on the Glue log group (granted automatically when enable_glue_trigger = true). If the permission is missing or the RCA log stream does not exist, the check degrades gracefully and trusts the Glue API response.
IAM Permissions
The Terraform module provides opt-in variables to grant the orchestrator Lambda permission to invoke each AWS job type:
| Variable | Default | Grants |
|---|---|---|
enable_glue_trigger | false | glue:StartJobRun, glue:GetJobRun, logs:FilterLogEvents (scoped to /aws-glue/jobs/logs-v2) |
enable_emr_trigger | false | elasticmapreduce:AddJobFlowSteps, elasticmapreduce:DescribeStep |
enable_emr_serverless_trigger | false | emr-serverless:StartJobRun, emr-serverless:GetJobRun |
enable_sfn_trigger | false | states:StartExecution, states:DescribeExecution |
Enable only the trigger types your pipelines use to maintain least-privilege IAM policies.