Skip to content

Triggers

When a pipeline’s validation rules pass, the orchestrator Lambda executes the configured job to start the downstream workload. The job block in the pipeline YAML defines the job type and its configuration.

Job Configuration

job:
  type: glue
  config:
    jobName: gold-revenue-etl
  maxRetries: 2

Common Fields

FieldTypeDefaultDescription
typestringJob type (required, see types below)
configmapType-specific configuration
maxRetriesint0Maximum retry attempts on failure

Job Types

http

Sends an HTTP request to an arbitrary endpoint.

job:
  type: http
  config:
    method: POST
    url: https://api.example.com/jobs/start
    headers:
      Authorization: "Bearer ${TOKEN}"
      Content-Type: application/json
    body: '{"pipeline": "${PIPELINE}", "date": "${DATE}"}'
FieldTypeDescription
methodstringHTTP method (GET, POST, PUT)
urlstringTarget URL
headersmapHTTP headers
bodystringRequest body (supports template variables)

command

Executes a shell command. Intended for use in development and testing scenarios.

job:
  type: command
  config:
    command: /usr/local/bin/run-etl --pipeline my-pipeline
    timeout: 300
FieldTypeDescription
commandstringShell command to execute
timeoutintExecution timeout in seconds

airflow

Triggers an Apache Airflow DAG run and polls for completion.

job:
  type: airflow
  config:
    url: https://airflow.example.com
    dagID: my_dag
    headers:
      Authorization: "Bearer ${AIRFLOW_TOKEN}"
    pollInterval: 1m
FieldTypeDescription
urlstringAirflow webserver URL
dagIDstringDAG identifier
headersmapAuthentication headers
pollIntervalstringStatus poll interval (e.g. 1m)

glue

Starts an AWS Glue job run.

job:
  type: glue
  config:
    jobName: my-glue-etl
    arguments:
      "--pipeline": my-pipeline
      "--date": "${DATE}"
FieldTypeDescription
jobNamestringGlue job name
argumentsmapJob arguments (keys prefixed with --)

Returned metadata: jobRunId – used by the orchestrator to poll GetJobRun.

emr

Submits a step to an existing EMR cluster.

job:
  type: emr
  config:
    clusterID: j-1234567890ABC
    arguments:
      "--class": com.example.ETLJob
      "--jar": s3://bucket/etl.jar
FieldTypeDescription
clusterIDstringEMR cluster ID
argumentsmapStep arguments

Returned metadata: clusterID, stepId – used by the orchestrator to poll DescribeStep.

emr-serverless

Starts a job run on EMR Serverless.

job:
  type: emr-serverless
  config:
    applicationID: "00f00abcde12345"
    arguments:
      "--class": com.example.ETLJob
      "--s3-input": s3://bucket/input/
FieldTypeDescription
applicationIDstringEMR Serverless application ID
argumentsmapJob run arguments

Returned metadata: applicationId, jobRunId – used by the orchestrator to poll GetJobRun.

step-function

Starts an AWS Step Functions execution.

job:
  type: step-function
  config:
    stateMachineARN: arn:aws:states:us-east-1:123456789:stateMachine:my-workflow
    input: '{"key": "value"}'
FieldTypeDescription
stateMachineARNstringStep Function state machine ARN
inputstringJSON input to the execution

Returned metadata: executionArn – used by the orchestrator to poll DescribeExecution.

databricks

Submits a job run to Databricks via REST API 2.1.

job:
  type: databricks
  config:
    workspaceURL: https://my-workspace.cloud.databricks.com
    headers:
      Authorization: "Bearer ${DATABRICKS_TOKEN}"
    body: '{"job_id": 12345, "notebook_params": {"date": "${DATE}"}}'
FieldTypeDescription
workspaceURLstringDatabricks workspace URL
headersmapAuthentication headers
bodystringJSON request body (Jobs API 2.1 format)

Returned metadata: runId – used by the orchestrator to poll via GET /api/2.1/jobs/runs/get.

Template Variables

The body, input, and arguments fields support template substitution. Variables are resolved at trigger time:

VariableValue
${DATE}Current date (YYYY-MM-DD)
${PIPELINE}Pipeline ID
${SCHEDULE}Schedule ID
${RUN_ID}Run ID (UUID v4)

Status Polling

After a job executes, the orchestrator Lambda polls for completion using the returned metadata. Each job type implements CheckStatus():

TypeSDK CallStatus Mapping
glueGetJobRun + CloudWatch RCA verificationRUNNING -> running, SUCCEEDED -> succeeded (verified), FAILED/STOPPED -> failed
emrDescribeStepRUNNING/PENDING -> running, COMPLETED -> succeeded, FAILED/CANCELLED -> failed
emr-serverlessGetJobRunRUNNING/SUBMITTED -> running, SUCCESS -> succeeded, FAILED/CANCELLED -> failed
step-functionDescribeExecutionRUNNING -> running, SUCCEEDED -> succeeded, FAILED/TIMED_OUT/ABORTED -> failed
databricksGET /runs/getRUNNING/PENDING -> running, TERMINATED(SUCCESS) -> succeeded, others -> failed
airflowGET /dags/{id}/dagRunsrunning -> running, success -> succeeded, failed -> failed

The http and command types are fire-and-forget – they do not support status polling.

Glue RCA Verification

AWS Glue can report SUCCEEDED via the GetJobRun API even when the Spark job failed internally (the driver process catches the exception and exits cleanly). When a Glue job reports SUCCEEDED, the orchestrator cross-checks the CloudWatch RCA (root cause analysis) log stream for GlueExceptionAnalysisJobFailed events. If the RCA log confirms a failure, the job is marked as failed with the actual failure reason.

This check requires logs:FilterLogEvents permission on the Glue log group (granted automatically when enable_glue_trigger = true). If the permission is missing or the RCA log stream does not exist, the check degrades gracefully and trusts the Glue API response.

IAM Permissions

The Terraform module provides opt-in variables to grant the orchestrator Lambda permission to invoke each AWS job type:

VariableDefaultGrants
enable_glue_triggerfalseglue:StartJobRun, glue:GetJobRun, logs:FilterLogEvents (scoped to /aws-glue/jobs/logs-v2)
enable_emr_triggerfalseelasticmapreduce:AddJobFlowSteps, elasticmapreduce:DescribeStep
enable_emr_serverless_triggerfalseemr-serverless:StartJobRun, emr-serverless:GetJobRun
enable_sfn_triggerfalsestates:StartExecution, states:DescribeExecution

Enable only the trigger types your pipelines use to maintain least-privilege IAM policies.