Production Deployment
This guide covers deploying Spark Pipeline Framework applications to production cluster environments.
Deployment Modes
Spark Pipeline Framework supports all standard Spark deployment modes:
| Mode | Use Case | Resource Manager |
|---|---|---|
| Local | Development, testing | None |
| YARN | Hadoop clusters | YARN ResourceManager |
| Kubernetes | Cloud-native deployments | K8s API Server |
| Standalone | Dedicated Spark clusters | Spark Master |
| Spark Connect | Remote thin client | Spark Connect Server |
Spark Connect Deployment (Spark 3.4+)
Spark Connect enables thin client deployments where the Spark driver runs on a remote server, not embedded in your application.
Benefits of Spark Connect
Thin Client Applications:
- Smaller application JARs (no embedded Spark driver)
- Faster deployment and container startup
- Reduced memory footprint on client machines
Better Resource Isolation:
- Client application failures don't crash the Spark cluster
- Multiple clients can share a single Spark Connect server
- Easier to scale clients independently of the cluster
Cloud-Native Architecture:
- Works with Databricks Connect, AWS EMR, and other managed services
- Simplified authentication and connection management
- Better suited for serverless and containerized deployments
Spark Connect Configuration
Update your pipeline configuration to use a Spark Connect server:
spark {
app-name = "My Pipeline"
connect-string = "sc://spark-connect-server:15002"
config {
"spark.executor.memory" = "4g"
"spark.executor.cores" = "2"
}
}
Running with Spark Connect
The application runs as a lightweight client:
# No need for spark-submit - run as a regular Java application
java -Dconfig.file=/path/to/pipeline.conf \
-jar my-pipeline-assembly.jar
Or with minimal Spark dependencies:
spark-submit \
--master local[1] \
--class io.github.dwsmith1983.spark.pipeline.runner.SimplePipelineRunner \
--driver-memory 512m \
my-pipeline-assembly.jar \
-Dconfig.file=/path/to/pipeline.conf
Databricks Connect
For Databricks, use the workspace URL as the connection string:
spark {
app-name = "My Pipeline"
connect-string = "sc://your-workspace.cloud.databricks.com"
config {
"spark.databricks.token" = ${?DATABRICKS_TOKEN}
"spark.databricks.cluster.id" = "your-cluster-id"
}
}
Run with authentication:
export DATABRICKS_TOKEN=dapi123456789...
java -Dconfig.file=/path/to/pipeline.conf \
-jar my-pipeline-assembly.jar
Spark Connect Server Setup
To run your own Spark Connect server:
# Start Spark Connect server
$SPARK_HOME/sbin/start-connect-server.sh \
--packages org.apache.spark:spark-connect_2.13:3.5.0 \
--master yarn \
--conf spark.connect.grpc.binding.port=15002
Or in Kubernetes:
apiVersion: v1
kind: Service
metadata:
name: spark-connect
spec:
type: LoadBalancer
ports:
- port: 15002
targetPort: 15002
selector:
app: spark-connect
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: spark-connect
spec:
replicas: 1
selector:
matchLabels:
app: spark-connect
template:
metadata:
labels:
app: spark-connect
spec:
containers:
- name: spark-connect
image: apache/spark:3.5.0
command:
- /opt/spark/sbin/start-connect-server.sh
args:
- --master
- k8s://https://kubernetes.default.svc
- --conf
- spark.connect.grpc.binding.port=15002
ports:
- containerPort: 15002
Version Requirements
- Spark 3.4 or later required for Spark Connect
- Framework automatically detects version compatibility at runtime
- Clear error message if attempting to use with earlier Spark versions
Packaging Your Application
Fat JAR with sbt-assembly
For production deployments, build a fat JAR containing your pipeline components:
// build.sbt
lazy val myPipeline = (project in file("."))
.settings(
assembly / mainClass := Some("io.github.dwsmith1983.spark.pipeline.runner.SimplePipelineRunner"),
assembly / assemblyMergeStrategy := {
case PathList("META-INF", _*) => MergeStrategy.discard
case "reference.conf" => MergeStrategy.concat
case _ => MergeStrategy.first
}
)
Build the assembly:
sbt assembly
Dependencies
Your fat JAR should include:
- Your pipeline components
spark-pipeline-corespark-pipeline-runtimespark-pipeline-runner
Spark libraries are provided scope and supplied by the cluster.
YARN Deployment
Client Mode
spark-submit \
--master yarn \
--deploy-mode client \
--class io.github.dwsmith1983.spark.pipeline.runner.SimplePipelineRunner \
--driver-memory 2g \
--executor-memory 4g \
--executor-cores 2 \
--num-executors 10 \
--conf spark.yarn.maxAppAttempts=1 \
/path/to/my-pipeline-assembly.jar \
-Dconfig.file=/path/to/pipeline.conf
Cluster Mode
spark-submit \
--master yarn \
--deploy-mode cluster \
--class io.github.dwsmith1983.spark.pipeline.runner.SimplePipelineRunner \
--driver-memory 2g \
--executor-memory 4g \
--executor-cores 2 \
--num-executors 10 \
--files /path/to/pipeline.conf \
--conf spark.driver.extraJavaOptions=-Dconfig.file=pipeline.conf \
/path/to/my-pipeline-assembly.jar
Note: In cluster mode, config files must be distributed via --files and referenced by filename only.
YARN Configuration Tips
# pipeline.conf
spark {
app-name = "Production Pipeline"
config {
"spark.yarn.queue" = "production"
"spark.yarn.tags" = "pipeline,batch"
"spark.dynamicAllocation.enabled" = "true"
"spark.dynamicAllocation.minExecutors" = "5"
"spark.dynamicAllocation.maxExecutors" = "50"
}
}
Kubernetes Deployment
Prerequisites
- Kubernetes cluster with Spark Operator or spark-submit support
- Docker image with Spark and your pipeline JAR
- ServiceAccount with appropriate RBAC permissions
Dockerfile Example
FROM apache/spark:3.5.0
USER root
# Copy pipeline JAR
COPY target/scala-2.13/my-pipeline-assembly.jar /opt/spark/jars/
# Copy config files
COPY conf/ /opt/spark/conf/
USER spark
spark-submit to Kubernetes
spark-submit \
--master k8s://https://<k8s-api-server>:6443 \
--deploy-mode cluster \
--name my-pipeline \
--class io.github.dwsmith1983.spark.pipeline.runner.SimplePipelineRunner \
--conf spark.kubernetes.container.image=my-registry/my-pipeline:latest \
--conf spark.kubernetes.namespace=spark-jobs \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.driver.request.cores=1 \
--conf spark.kubernetes.executor.request.cores=2 \
--conf spark.executor.instances=5 \
--conf spark.driver.extraJavaOptions=-Dconfig.file=/opt/spark/conf/pipeline.conf \
local:///opt/spark/jars/my-pipeline-assembly.jar
Spark Operator (SparkApplication CRD)
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: my-pipeline
namespace: spark-jobs
spec:
type: Scala
mode: cluster
image: my-registry/my-pipeline:latest
imagePullPolicy: Always
mainClass: io.github.dwsmith1983.spark.pipeline.runner.SimplePipelineRunner
mainApplicationFile: local:///opt/spark/jars/my-pipeline-assembly.jar
sparkVersion: "3.5.0"
driver:
cores: 1
memory: "2g"
serviceAccount: spark
javaOptions: "-Dconfig.file=/opt/spark/conf/pipeline.conf"
executor:
cores: 2
instances: 5
memory: "4g"
restartPolicy:
type: OnFailure
onFailureRetries: 3
onFailureRetryInterval: 10
Amazon EMR
EMR on EC2
aws emr add-steps \
--cluster-id j-XXXXXXXXXXXXX \
--steps Type=Spark,Name="My Pipeline",ActionOnFailure=CONTINUE,Args=[
--class,io.github.dwsmith1983.spark.pipeline.runner.SimplePipelineRunner,
--deploy-mode,cluster,
--driver-memory,2g,
--executor-memory,4g,
--conf,spark.driver.extraJavaOptions=-Dconfig.file=s3://my-bucket/config/pipeline.conf,
s3://my-bucket/jars/my-pipeline-assembly.jar
]
EMR on EKS
aws emr-containers start-job-run \
--virtual-cluster-id <virtual-cluster-id> \
--name "my-pipeline" \
--execution-role-arn <execution-role-arn> \
--release-label emr-6.15.0-latest \
--job-driver '{
"sparkSubmitJobDriver": {
"entryPoint": "s3://my-bucket/jars/my-pipeline-assembly.jar",
"entryPointArguments": [],
"sparkSubmitParameters": "--class io.github.dwsmith1983.spark.pipeline.runner.SimplePipelineRunner --conf spark.driver.extraJavaOptions=-Dconfig.file=s3://my-bucket/config/pipeline.conf"
}
}'
EMR Serverless
aws emr-serverless start-job-run \
--application-id <application-id> \
--execution-role-arn <execution-role-arn> \
--job-driver '{
"sparkSubmit": {
"entryPoint": "s3://my-bucket/jars/my-pipeline-assembly.jar",
"entryPointArguments": [],
"sparkSubmitParameters": "--class io.github.dwsmith1983.spark.pipeline.runner.SimplePipelineRunner --conf spark.driver.extraJavaOptions=-Dconfig.file=s3://my-bucket/config/pipeline.conf"
}
}'
Configuration Management
Environment-Specific Configs
Organize configs by environment:
configs/
base.conf # Common settings
dev.conf # Development overrides
staging.conf # Staging overrides
production.conf # Production settings
Use HOCON includes:
# production.conf
include "base.conf"
spark {
app-name = "Pipeline (Production)"
config {
"spark.executor.memory" = "8g"
}
}
Secrets Management
Never commit secrets to config files. Use environment variables:
# pipeline.conf
pipeline {
pipeline-components = [
{
instance-type = "com.mycompany.DatabaseLoader"
instance-config {
jdbc-url = ${JDBC_URL}
username = ${DB_USERNAME}
password = ${DB_PASSWORD}
}
}
]
}
Or use secrets managers:
- AWS: Secrets Manager, SSM Parameter Store
- Kubernetes: Secrets mounted as files/env vars
- HashiCorp Vault: Via environment injection
Monitoring Production Pipelines
MetricsHooks with Micrometer
import io.github.dwsmith1983.spark.pipeline.config.MetricsHooks
import io.micrometer.prometheus.PrometheusConfig
import io.micrometer.prometheus.PrometheusMeterRegistry
val registry = new PrometheusMeterRegistry(PrometheusConfig.DEFAULT)
val hooks = MetricsHooks(registry)
runner.run(config, hooks)
// Expose metrics endpoint for Prometheus scraping
AuditHooks for Compliance
import io.github.dwsmith1983.spark.pipeline.config.audit._
val auditSink = new FileAuditSink(Paths.get("/var/log/pipeline-audit.jsonl"))
val hooks = AuditHooks(auditSink)
runner.run(config, hooks)
Logging Best Practices
Configure Log4j2 for production:
<!-- log4j2.xml -->
<Configuration>
<Appenders>
<Console name="Console" target="SYSTEM_OUT">
<JsonLayout compact="true" eventEol="true"/>
</Console>
</Appenders>
<Loggers>
<Logger name="io.github.dwsmith1983.spark.pipeline" level="INFO"/>
<Root level="WARN">
<AppenderRef ref="Console"/>
</Root>
</Loggers>
</Configuration>
Health Checks and Alerting
Dry-Run Validation
Before deploying config changes, validate with dry-run:
val result = runner.dryRun(config)
result match {
case DryRunResult.Valid(components) =>
println(s"Config valid: ${components.size} components")
case DryRunResult.Invalid(errors) =>
errors.foreach(e => println(s"Error: ${e.message}"))
System.exit(1)
}
Exit Codes
SimplePipelineRunner returns appropriate exit codes:
| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | Pipeline failure |
| 2 | Configuration error |
Use these for alerting in orchestration tools (Airflow, Argo, etc.).
Next Steps
- Scope & Design - Understand framework capabilities and limitations
- Lifecycle Hooks - Production monitoring with MetricsHooks and AuditHooks
- Configuration - Advanced HOCON configuration patterns