Getting Started

Spark Pipeline Framework is a configuration-driven framework for building Apache Spark pipelines using HOCON config files and PureConfig.

Features

Type-safe configuration via PureConfig with automatic case class binding
SparkSession management from config files
Dynamic component instantiation via reflection (no compile-time coupling)
Lifecycle hooks for monitoring, metrics, and custom error handling
Dry-run validation to verify configs without executing pipelines
Flexible error handling with fail-fast or continue-on-error modes
Cross-compilation support for Spark 3.x and 4.x

Installation

Add the dependency to your build.sbt:

// For Spark 3.5.x with Scala 2.13
libraryDependencies += "io.github.dwsmith1983" %% "spark-pipeline-runtime-spark3" % "<version>"

// For Spark 4.0.x with Scala 2.13
libraryDependencies += "io.github.dwsmith1983" %% "spark-pipeline-runtime-spark4" % "<version>"

No resolver needed - artifacts are available on Maven Central.

Cross-Compilation Matrix

Artifact	Spark	Scala	Java
`*-spark3_2.12`	3.5.7	2.12.18	17+
`*-spark3_2.13`	3.5.7	2.13.12	17+
`*-spark4_2.13`	4.0.1	2.13.12	17+

Modules

Module	Description	Spark Dependency
`spark-pipeline-core`	Traits, config models, instantiation	None
`spark-pipeline-runtime`	SparkSessionWrapper, DataFlow trait	Yes (provided)
`spark-pipeline-runner`	SimplePipelineRunner entry point	Yes (provided)

Quick Example

1. Create a pipeline component

import io.github.dwsmith1983.spark.pipeline.config.ConfigurableInstance
import io.github.dwsmith1983.spark.pipeline.runtime.DataFlow
import pureconfig._
import pureconfig.generic.auto._

object MyComponent extends ConfigurableInstance {
  case class Config(inputTable: String, outputPath: String)

  override def createFromConfig(conf: com.typesafe.config.Config): MyComponent = {
    new MyComponent(ConfigSource.fromConfig(conf).loadOrThrow[Config])
  }
}

class MyComponent(conf: MyComponent.Config) extends DataFlow {
  override def run(): Unit = {
    logger.info(s"Processing ${conf.inputTable}")
    spark.table(conf.inputTable).write.parquet(conf.outputPath)
  }
}

2. Create a pipeline config file

# pipeline.conf
spark {
  app-name = "My Pipeline"
  config {
    "spark.executor.memory" = "4g"
  }
}

pipeline {
  pipeline-name = "My Data Pipeline"
  pipeline-components = [
    {
      instance-type = "com.mycompany.MyComponent"
      instance-name = "MyComponent(prod)"
      instance-config {
        input-table = "raw_data"
        output-path = "/data/processed"
      }
    }
  ]
}

3. Run with spark-submit

spark-submit \
  --class io.github.dwsmith1983.spark.pipeline.runner.SimplePipelineRunner \
  --jars /path/to/my-pipeline.jar \
  /path/to/spark-pipeline-runner-spark3_2.13.jar \
  -Dconfig.file=/path/to/pipeline.conf

Next Steps

Configuration - Learn about HOCON configuration options
Components - Build custom pipeline components
Lifecycle Hooks - Monitor and extend pipeline execution

Features​

Installation​

Cross-Compilation Matrix​

Modules​

Quick Example​

1. Create a pipeline component​

2. Create a pipeline config file​

3. Run with spark-submit​

Next Steps​