Production Deployment

This guide covers deploying Spark Pipeline Framework applications to production cluster environments.

Deployment Modes

Spark Pipeline Framework supports all standard Spark deployment modes:

Mode	Use Case	Resource Manager
Local	Development, testing	None
YARN	Hadoop clusters	YARN ResourceManager
Kubernetes	Cloud-native deployments	K8s API Server
Standalone	Dedicated Spark clusters	Spark Master
Spark Connect	Remote thin client	Spark Connect Server

Spark Connect Deployment (Spark 3.4+)

Spark Connect enables thin client deployments where the Spark driver runs on a remote server, not embedded in your application.

Benefits of Spark Connect

Thin Client Applications:

Smaller application JARs (no embedded Spark driver)
Faster deployment and container startup
Reduced memory footprint on client machines

Better Resource Isolation:

Client application failures don't crash the Spark cluster
Multiple clients can share a single Spark Connect server
Easier to scale clients independently of the cluster

Cloud-Native Architecture:

Works with Databricks Connect, AWS EMR, and other managed services
Simplified authentication and connection management
Better suited for serverless and containerized deployments

Spark Connect Configuration

Update your pipeline configuration to use a Spark Connect server:

spark {
  app-name = "My Pipeline"
  connect-string = "sc://spark-connect-server:15002"

  config {
    "spark.executor.memory" = "4g"
    "spark.executor.cores" = "2"
  }
}

Running with Spark Connect

The application runs as a lightweight client:

# No need for spark-submit - run as a regular Java application
java -Dconfig.file=/path/to/pipeline.conf \
  -jar my-pipeline-assembly.jar

Or with minimal Spark dependencies:

spark-submit \
  --master local[1] \
  --class io.github.dwsmith1983.spark.pipeline.runner.SimplePipelineRunner \
  --driver-memory 512m \
  my-pipeline-assembly.jar \
  -Dconfig.file=/path/to/pipeline.conf

Databricks Connect

For Databricks, use the workspace URL as the connection string:

spark {
  app-name = "My Pipeline"
  connect-string = "sc://your-workspace.cloud.databricks.com"

  config {
    "spark.databricks.token" = ${?DATABRICKS_TOKEN}
    "spark.databricks.cluster.id" = "your-cluster-id"
  }
}

Run with authentication:

export DATABRICKS_TOKEN=dapi123456789...
java -Dconfig.file=/path/to/pipeline.conf \
  -jar my-pipeline-assembly.jar

Spark Connect Server Setup

To run your own Spark Connect server:

# Start Spark Connect server
$SPARK_HOME/sbin/start-connect-server.sh \
  --packages org.apache.spark:spark-connect_2.13:3.5.0 \
  --master yarn \
  --conf spark.connect.grpc.binding.port=15002

Or in Kubernetes:

apiVersion: v1
kind: Service
metadata:
  name: spark-connect
spec:
  type: LoadBalancer
  ports:
    - port: 15002
      targetPort: 15002
  selector:
    app: spark-connect
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: spark-connect
spec:
  replicas: 1
  selector:
    matchLabels:
      app: spark-connect
  template:
    metadata:
      labels:
        app: spark-connect
    spec:
      containers:
        - name: spark-connect
          image: apache/spark:3.5.0
          command:
            - /opt/spark/sbin/start-connect-server.sh
          args:
            - --master
            - k8s://https://kubernetes.default.svc
            - --conf
            - spark.connect.grpc.binding.port=15002
          ports:
            - containerPort: 15002

Version Requirements

Spark 3.4 or later required for Spark Connect
Framework automatically detects version compatibility at runtime
Clear error message if attempting to use with earlier Spark versions

Packaging Your Application

Fat JAR with sbt-assembly

For production deployments, build a fat JAR containing your pipeline components:

// build.sbt
lazy val myPipeline = (project in file("."))
  .settings(
    assembly / mainClass := Some("io.github.dwsmith1983.spark.pipeline.runner.SimplePipelineRunner"),
    assembly / assemblyMergeStrategy := {
      case PathList("META-INF", _*) => MergeStrategy.discard
      case "reference.conf"         => MergeStrategy.concat
      case _                        => MergeStrategy.first
    }
  )

Build the assembly:

sbt assembly

Dependencies

Your fat JAR should include:

Your pipeline components
spark-pipeline-core
spark-pipeline-runtime
spark-pipeline-runner

Spark libraries are provided scope and supplied by the cluster.

YARN Deployment

Client Mode

spark-submit \
  --master yarn \
  --deploy-mode client \
  --class io.github.dwsmith1983.spark.pipeline.runner.SimplePipelineRunner \
  --driver-memory 2g \
  --executor-memory 4g \
  --executor-cores 2 \
  --num-executors 10 \
  --conf spark.yarn.maxAppAttempts=1 \
  /path/to/my-pipeline-assembly.jar \
  -Dconfig.file=/path/to/pipeline.conf

Cluster Mode

spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --class io.github.dwsmith1983.spark.pipeline.runner.SimplePipelineRunner \
  --driver-memory 2g \
  --executor-memory 4g \
  --executor-cores 2 \
  --num-executors 10 \
  --files /path/to/pipeline.conf \
  --conf spark.driver.extraJavaOptions=-Dconfig.file=pipeline.conf \
  /path/to/my-pipeline-assembly.jar

Note: In cluster mode, config files must be distributed via --files and referenced by filename only.

YARN Configuration Tips

# pipeline.conf
spark {
  app-name = "Production Pipeline"
  config {
    "spark.yarn.queue" = "production"
    "spark.yarn.tags" = "pipeline,batch"
    "spark.dynamicAllocation.enabled" = "true"
    "spark.dynamicAllocation.minExecutors" = "5"
    "spark.dynamicAllocation.maxExecutors" = "50"
  }
}

Kubernetes Deployment

Prerequisites

Kubernetes cluster with Spark Operator or spark-submit support
Docker image with Spark and your pipeline JAR
ServiceAccount with appropriate RBAC permissions

Dockerfile Example

FROM apache/spark:3.5.0

USER root

# Copy pipeline JAR
COPY target/scala-2.13/my-pipeline-assembly.jar /opt/spark/jars/

# Copy config files
COPY conf/ /opt/spark/conf/

USER spark

spark-submit to Kubernetes

spark-submit \
  --master k8s://https://<k8s-api-server>:6443 \
  --deploy-mode cluster \
  --name my-pipeline \
  --class io.github.dwsmith1983.spark.pipeline.runner.SimplePipelineRunner \
  --conf spark.kubernetes.container.image=my-registry/my-pipeline:latest \
  --conf spark.kubernetes.namespace=spark-jobs \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
  --conf spark.kubernetes.driver.request.cores=1 \
  --conf spark.kubernetes.executor.request.cores=2 \
  --conf spark.executor.instances=5 \
  --conf spark.driver.extraJavaOptions=-Dconfig.file=/opt/spark/conf/pipeline.conf \
  local:///opt/spark/jars/my-pipeline-assembly.jar

Spark Operator (SparkApplication CRD)

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: my-pipeline
  namespace: spark-jobs
spec:
  type: Scala
  mode: cluster
  image: my-registry/my-pipeline:latest
  imagePullPolicy: Always
  mainClass: io.github.dwsmith1983.spark.pipeline.runner.SimplePipelineRunner
  mainApplicationFile: local:///opt/spark/jars/my-pipeline-assembly.jar
  sparkVersion: "3.5.0"
  driver:
    cores: 1
    memory: "2g"
    serviceAccount: spark
    javaOptions: "-Dconfig.file=/opt/spark/conf/pipeline.conf"
  executor:
    cores: 2
    instances: 5
    memory: "4g"
  restartPolicy:
    type: OnFailure
    onFailureRetries: 3
    onFailureRetryInterval: 10

Amazon EMR

EMR on EC2

aws emr add-steps \
  --cluster-id j-XXXXXXXXXXXXX \
  --steps Type=Spark,Name="My Pipeline",ActionOnFailure=CONTINUE,Args=[
    --class,io.github.dwsmith1983.spark.pipeline.runner.SimplePipelineRunner,
    --deploy-mode,cluster,
    --driver-memory,2g,
    --executor-memory,4g,
    --conf,spark.driver.extraJavaOptions=-Dconfig.file=s3://my-bucket/config/pipeline.conf,
    s3://my-bucket/jars/my-pipeline-assembly.jar
  ]

EMR on EKS

aws emr-containers start-job-run \
  --virtual-cluster-id <virtual-cluster-id> \
  --name "my-pipeline" \
  --execution-role-arn <execution-role-arn> \
  --release-label emr-6.15.0-latest \
  --job-driver '{
    "sparkSubmitJobDriver": {
      "entryPoint": "s3://my-bucket/jars/my-pipeline-assembly.jar",
      "entryPointArguments": [],
      "sparkSubmitParameters": "--class io.github.dwsmith1983.spark.pipeline.runner.SimplePipelineRunner --conf spark.driver.extraJavaOptions=-Dconfig.file=s3://my-bucket/config/pipeline.conf"
    }
  }'

EMR Serverless

aws emr-serverless start-job-run \
  --application-id <application-id> \
  --execution-role-arn <execution-role-arn> \
  --job-driver '{
    "sparkSubmit": {
      "entryPoint": "s3://my-bucket/jars/my-pipeline-assembly.jar",
      "entryPointArguments": [],
      "sparkSubmitParameters": "--class io.github.dwsmith1983.spark.pipeline.runner.SimplePipelineRunner --conf spark.driver.extraJavaOptions=-Dconfig.file=s3://my-bucket/config/pipeline.conf"
    }
  }'

Configuration Management

Environment-Specific Configs

Organize configs by environment:

configs/
  base.conf          # Common settings
  dev.conf           # Development overrides
  staging.conf       # Staging overrides
  production.conf    # Production settings

Use HOCON includes:

# production.conf
include "base.conf"

spark {
  app-name = "Pipeline (Production)"
  config {
    "spark.executor.memory" = "8g"
  }
}

Secrets Management

Never commit secrets to config files. Use environment variables:

# pipeline.conf
pipeline {
  pipeline-components = [
    {
      instance-type = "com.mycompany.DatabaseLoader"
      instance-config {
        jdbc-url = ${JDBC_URL}
        username = ${DB_USERNAME}
        password = ${DB_PASSWORD}
      }
    }
  ]
}

Or use secrets managers:

AWS: Secrets Manager, SSM Parameter Store
Kubernetes: Secrets mounted as files/env vars
HashiCorp Vault: Via environment injection

Monitoring Production Pipelines

MetricsHooks with Micrometer

import io.github.dwsmith1983.spark.pipeline.config.MetricsHooks
import io.micrometer.prometheus.PrometheusConfig
import io.micrometer.prometheus.PrometheusMeterRegistry

val registry = new PrometheusMeterRegistry(PrometheusConfig.DEFAULT)
val hooks = MetricsHooks(registry)

runner.run(config, hooks)

// Expose metrics endpoint for Prometheus scraping

AuditHooks for Compliance

import io.github.dwsmith1983.spark.pipeline.config.audit._

val auditSink = new FileAuditSink(Paths.get("/var/log/pipeline-audit.jsonl"))
val hooks = AuditHooks(auditSink)

runner.run(config, hooks)

Logging Best Practices

Configure Log4j2 for production:

<!-- log4j2.xml -->
<Configuration>
  <Appenders>
    <Console name="Console" target="SYSTEM_OUT">
      <JsonLayout compact="true" eventEol="true"/>
    </Console>
  </Appenders>
  <Loggers>
    <Logger name="io.github.dwsmith1983.spark.pipeline" level="INFO"/>
    <Root level="WARN">
      <AppenderRef ref="Console"/>
    </Root>
  </Loggers>
</Configuration>

Health Checks and Alerting

Dry-Run Validation

Before deploying config changes, validate with dry-run:

val result = runner.dryRun(config)
result match {
  case DryRunResult.Valid(components) =>
    println(s"Config valid: ${components.size} components")
  case DryRunResult.Invalid(errors) =>
    errors.foreach(e => println(s"Error: ${e.message}"))
    System.exit(1)
}

Exit Codes

SimplePipelineRunner returns appropriate exit codes:

Code	Meaning
0	Success
1	Pipeline failure
2	Configuration error

Use these for alerting in orchestration tools (Airflow, Argo, etc.).

Next Steps

Scope & Design - Understand framework capabilities and limitations
Lifecycle Hooks - Production monitoring with MetricsHooks and AuditHooks
Configuration - Advanced HOCON configuration patterns

Deployment Modes​

Spark Connect Deployment (Spark 3.4+)​

Benefits of Spark Connect​

Spark Connect Configuration​

Running with Spark Connect​

Databricks Connect​

Spark Connect Server Setup​

Version Requirements​

Packaging Your Application​

Fat JAR with sbt-assembly​

Dependencies​

YARN Deployment​

Client Mode​

Cluster Mode​

YARN Configuration Tips​

Kubernetes Deployment​

Prerequisites​

Dockerfile Example​

spark-submit to Kubernetes​

Spark Operator (SparkApplication CRD)​

Amazon EMR​

EMR on EC2​

EMR on EKS​

EMR Serverless​

Configuration Management​

Environment-Specific Configs​

Secrets Management​

Monitoring Production Pipelines​

MetricsHooks with Micrometer​

AuditHooks for Compliance​

Logging Best Practices​

Health Checks and Alerting​

Dry-Run Validation​

Exit Codes​

Next Steps​