Skip to main content

Production Deployment

This guide covers deploying Spark Pipeline Framework applications to production cluster environments.

Deployment Modes

Spark Pipeline Framework supports all standard Spark deployment modes:

ModeUse CaseResource Manager
LocalDevelopment, testingNone
YARNHadoop clustersYARN ResourceManager
KubernetesCloud-native deploymentsK8s API Server
StandaloneDedicated Spark clustersSpark Master
Spark ConnectRemote thin clientSpark Connect Server

Spark Connect Deployment (Spark 3.4+)

Spark Connect enables thin client deployments where the Spark driver runs on a remote server, not embedded in your application.

Benefits of Spark Connect

Thin Client Applications:

  • Smaller application JARs (no embedded Spark driver)
  • Faster deployment and container startup
  • Reduced memory footprint on client machines

Better Resource Isolation:

  • Client application failures don't crash the Spark cluster
  • Multiple clients can share a single Spark Connect server
  • Easier to scale clients independently of the cluster

Cloud-Native Architecture:

  • Works with Databricks Connect, AWS EMR, and other managed services
  • Simplified authentication and connection management
  • Better suited for serverless and containerized deployments

Spark Connect Configuration

Update your pipeline configuration to use a Spark Connect server:

spark {
app-name = "My Pipeline"
connect-string = "sc://spark-connect-server:15002"

config {
"spark.executor.memory" = "4g"
"spark.executor.cores" = "2"
}
}

Running with Spark Connect

The application runs as a lightweight client:

# No need for spark-submit - run as a regular Java application
java -Dconfig.file=/path/to/pipeline.conf \
-jar my-pipeline-assembly.jar

Or with minimal Spark dependencies:

spark-submit \
--master local[1] \
--class io.github.dwsmith1983.spark.pipeline.runner.SimplePipelineRunner \
--driver-memory 512m \
my-pipeline-assembly.jar \
-Dconfig.file=/path/to/pipeline.conf

Databricks Connect

For Databricks, use the workspace URL as the connection string:

spark {
app-name = "My Pipeline"
connect-string = "sc://your-workspace.cloud.databricks.com"

config {
"spark.databricks.token" = ${?DATABRICKS_TOKEN}
"spark.databricks.cluster.id" = "your-cluster-id"
}
}

Run with authentication:

export DATABRICKS_TOKEN=dapi123456789...
java -Dconfig.file=/path/to/pipeline.conf \
-jar my-pipeline-assembly.jar

Spark Connect Server Setup

To run your own Spark Connect server:

# Start Spark Connect server
$SPARK_HOME/sbin/start-connect-server.sh \
--packages org.apache.spark:spark-connect_2.13:3.5.0 \
--master yarn \
--conf spark.connect.grpc.binding.port=15002

Or in Kubernetes:

apiVersion: v1
kind: Service
metadata:
name: spark-connect
spec:
type: LoadBalancer
ports:
- port: 15002
targetPort: 15002
selector:
app: spark-connect
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: spark-connect
spec:
replicas: 1
selector:
matchLabels:
app: spark-connect
template:
metadata:
labels:
app: spark-connect
spec:
containers:
- name: spark-connect
image: apache/spark:3.5.0
command:
- /opt/spark/sbin/start-connect-server.sh
args:
- --master
- k8s://https://kubernetes.default.svc
- --conf
- spark.connect.grpc.binding.port=15002
ports:
- containerPort: 15002

Version Requirements

  • Spark 3.4 or later required for Spark Connect
  • Framework automatically detects version compatibility at runtime
  • Clear error message if attempting to use with earlier Spark versions

Packaging Your Application

Fat JAR with sbt-assembly

For production deployments, build a fat JAR containing your pipeline components:

// build.sbt
lazy val myPipeline = (project in file("."))
.settings(
assembly / mainClass := Some("io.github.dwsmith1983.spark.pipeline.runner.SimplePipelineRunner"),
assembly / assemblyMergeStrategy := {
case PathList("META-INF", _*) => MergeStrategy.discard
case "reference.conf" => MergeStrategy.concat
case _ => MergeStrategy.first
}
)

Build the assembly:

sbt assembly

Dependencies

Your fat JAR should include:

  • Your pipeline components
  • spark-pipeline-core
  • spark-pipeline-runtime
  • spark-pipeline-runner

Spark libraries are provided scope and supplied by the cluster.

YARN Deployment

Client Mode

spark-submit \
--master yarn \
--deploy-mode client \
--class io.github.dwsmith1983.spark.pipeline.runner.SimplePipelineRunner \
--driver-memory 2g \
--executor-memory 4g \
--executor-cores 2 \
--num-executors 10 \
--conf spark.yarn.maxAppAttempts=1 \
/path/to/my-pipeline-assembly.jar \
-Dconfig.file=/path/to/pipeline.conf

Cluster Mode

spark-submit \
--master yarn \
--deploy-mode cluster \
--class io.github.dwsmith1983.spark.pipeline.runner.SimplePipelineRunner \
--driver-memory 2g \
--executor-memory 4g \
--executor-cores 2 \
--num-executors 10 \
--files /path/to/pipeline.conf \
--conf spark.driver.extraJavaOptions=-Dconfig.file=pipeline.conf \
/path/to/my-pipeline-assembly.jar

Note: In cluster mode, config files must be distributed via --files and referenced by filename only.

YARN Configuration Tips

# pipeline.conf
spark {
app-name = "Production Pipeline"
config {
"spark.yarn.queue" = "production"
"spark.yarn.tags" = "pipeline,batch"
"spark.dynamicAllocation.enabled" = "true"
"spark.dynamicAllocation.minExecutors" = "5"
"spark.dynamicAllocation.maxExecutors" = "50"
}
}

Kubernetes Deployment

Prerequisites

  • Kubernetes cluster with Spark Operator or spark-submit support
  • Docker image with Spark and your pipeline JAR
  • ServiceAccount with appropriate RBAC permissions

Dockerfile Example

FROM apache/spark:3.5.0

USER root

# Copy pipeline JAR
COPY target/scala-2.13/my-pipeline-assembly.jar /opt/spark/jars/

# Copy config files
COPY conf/ /opt/spark/conf/

USER spark

spark-submit to Kubernetes

spark-submit \
--master k8s://https://<k8s-api-server>:6443 \
--deploy-mode cluster \
--name my-pipeline \
--class io.github.dwsmith1983.spark.pipeline.runner.SimplePipelineRunner \
--conf spark.kubernetes.container.image=my-registry/my-pipeline:latest \
--conf spark.kubernetes.namespace=spark-jobs \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.driver.request.cores=1 \
--conf spark.kubernetes.executor.request.cores=2 \
--conf spark.executor.instances=5 \
--conf spark.driver.extraJavaOptions=-Dconfig.file=/opt/spark/conf/pipeline.conf \
local:///opt/spark/jars/my-pipeline-assembly.jar

Spark Operator (SparkApplication CRD)

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: my-pipeline
namespace: spark-jobs
spec:
type: Scala
mode: cluster
image: my-registry/my-pipeline:latest
imagePullPolicy: Always
mainClass: io.github.dwsmith1983.spark.pipeline.runner.SimplePipelineRunner
mainApplicationFile: local:///opt/spark/jars/my-pipeline-assembly.jar
sparkVersion: "3.5.0"
driver:
cores: 1
memory: "2g"
serviceAccount: spark
javaOptions: "-Dconfig.file=/opt/spark/conf/pipeline.conf"
executor:
cores: 2
instances: 5
memory: "4g"
restartPolicy:
type: OnFailure
onFailureRetries: 3
onFailureRetryInterval: 10

Amazon EMR

EMR on EC2

aws emr add-steps \
--cluster-id j-XXXXXXXXXXXXX \
--steps Type=Spark,Name="My Pipeline",ActionOnFailure=CONTINUE,Args=[
--class,io.github.dwsmith1983.spark.pipeline.runner.SimplePipelineRunner,
--deploy-mode,cluster,
--driver-memory,2g,
--executor-memory,4g,
--conf,spark.driver.extraJavaOptions=-Dconfig.file=s3://my-bucket/config/pipeline.conf,
s3://my-bucket/jars/my-pipeline-assembly.jar
]

EMR on EKS

aws emr-containers start-job-run \
--virtual-cluster-id <virtual-cluster-id> \
--name "my-pipeline" \
--execution-role-arn <execution-role-arn> \
--release-label emr-6.15.0-latest \
--job-driver '{
"sparkSubmitJobDriver": {
"entryPoint": "s3://my-bucket/jars/my-pipeline-assembly.jar",
"entryPointArguments": [],
"sparkSubmitParameters": "--class io.github.dwsmith1983.spark.pipeline.runner.SimplePipelineRunner --conf spark.driver.extraJavaOptions=-Dconfig.file=s3://my-bucket/config/pipeline.conf"
}
}'

EMR Serverless

aws emr-serverless start-job-run \
--application-id <application-id> \
--execution-role-arn <execution-role-arn> \
--job-driver '{
"sparkSubmit": {
"entryPoint": "s3://my-bucket/jars/my-pipeline-assembly.jar",
"entryPointArguments": [],
"sparkSubmitParameters": "--class io.github.dwsmith1983.spark.pipeline.runner.SimplePipelineRunner --conf spark.driver.extraJavaOptions=-Dconfig.file=s3://my-bucket/config/pipeline.conf"
}
}'

Configuration Management

Environment-Specific Configs

Organize configs by environment:

configs/
base.conf # Common settings
dev.conf # Development overrides
staging.conf # Staging overrides
production.conf # Production settings

Use HOCON includes:

# production.conf
include "base.conf"

spark {
app-name = "Pipeline (Production)"
config {
"spark.executor.memory" = "8g"
}
}

Secrets Management

Never commit secrets to config files. Use environment variables:

# pipeline.conf
pipeline {
pipeline-components = [
{
instance-type = "com.mycompany.DatabaseLoader"
instance-config {
jdbc-url = ${JDBC_URL}
username = ${DB_USERNAME}
password = ${DB_PASSWORD}
}
}
]
}

Or use secrets managers:

  • AWS: Secrets Manager, SSM Parameter Store
  • Kubernetes: Secrets mounted as files/env vars
  • HashiCorp Vault: Via environment injection

Monitoring Production Pipelines

MetricsHooks with Micrometer

import io.github.dwsmith1983.spark.pipeline.config.MetricsHooks
import io.micrometer.prometheus.PrometheusConfig
import io.micrometer.prometheus.PrometheusMeterRegistry

val registry = new PrometheusMeterRegistry(PrometheusConfig.DEFAULT)
val hooks = MetricsHooks(registry)

runner.run(config, hooks)

// Expose metrics endpoint for Prometheus scraping

AuditHooks for Compliance

import io.github.dwsmith1983.spark.pipeline.config.audit._

val auditSink = new FileAuditSink(Paths.get("/var/log/pipeline-audit.jsonl"))
val hooks = AuditHooks(auditSink)

runner.run(config, hooks)

Logging Best Practices

Configure Log4j2 for production:

<!-- log4j2.xml -->
<Configuration>
<Appenders>
<Console name="Console" target="SYSTEM_OUT">
<JsonLayout compact="true" eventEol="true"/>
</Console>
</Appenders>
<Loggers>
<Logger name="io.github.dwsmith1983.spark.pipeline" level="INFO"/>
<Root level="WARN">
<AppenderRef ref="Console"/>
</Root>
</Loggers>
</Configuration>

Health Checks and Alerting

Dry-Run Validation

Before deploying config changes, validate with dry-run:

val result = runner.dryRun(config)
result match {
case DryRunResult.Valid(components) =>
println(s"Config valid: ${components.size} components")
case DryRunResult.Invalid(errors) =>
errors.foreach(e => println(s"Error: ${e.message}"))
System.exit(1)
}

Exit Codes

SimplePipelineRunner returns appropriate exit codes:

CodeMeaning
0Success
1Pipeline failure
2Configuration error

Use these for alerting in orchestration tools (Airflow, Argo, etc.).

Next Steps