Skip to main content

Scope & Design

This page describes what Spark Pipeline Framework is designed for, its intentional limitations, and how it compares to alternative approaches.

Design Philosophy

Spark Pipeline Framework is built on a single principle: configuration-driven simplicity for batch and streaming Spark workloads.

The framework enables teams to build production Spark pipelines (batch and streaming) using HOCON configuration files, with minimal boilerplate and maximum clarity. It prioritizes:

  • Simplicity over flexibility - Sequential execution, not DAGs
  • Configuration over code - Pipelines defined in HOCON, not Scala/Python
  • Composition over inheritance - Small, focused components assembled via config
  • Observability by default - Built-in hooks for logging, metrics, and audit

What This Framework IS

Simple Sequential Pipelines

Spark Pipeline Framework executes components sequentially, in order. This is intentional:

Component A → Component B → Component C → Done

This model works well for:

  • Batch ETL pipelines with clear input → transform → output stages
  • Streaming pipelines with sequential data transformations
  • Data ingestion from source systems to data lakes
  • Feature engineering pipelines with ordered transformations
  • Report generation with sequential aggregations
  • Data quality pipelines with validation steps

Configuration-Driven Architecture

Pipelines are defined entirely in HOCON configuration:

pipeline {
pipeline-name = "Daily Sales ETL"
pipeline-components = [
{ instance-type = "com.company.ExtractSales", ... },
{ instance-type = "com.company.TransformSales", ... },
{ instance-type = "com.company.LoadToWarehouse", ... }
]
}

Benefits:

  • No recompilation to change pipeline structure
  • Environment-specific configs via HOCON includes
  • Auditable changes via version control
  • Non-developers can modify pipeline parameters

Production-Ready Observability

Built-in hooks provide enterprise-grade observability:

HookPurpose
LoggingHooksStructured JSON logging with correlation IDs
MetricsHooksMicrometer integration (Prometheus, Datadog, etc.)
AuditHooksPersistent audit trail with security filtering

Flexible Error Handling

Two error handling modes:

  • Fail-fast (default): Stop on first error
  • Continue-on-error: Run all components, collect failures
pipeline {
fail-fast = false // Continue even if components fail
}

What This Framework is NOT

Not a DAG Executor

The framework does not support:

  • Parallel component execution
  • Dependency graphs between components
  • Conditional branching based on component outputs
  • Dynamic pipeline modification at runtime

If you need DAGs, consider:

  • Apache Airflow
  • Prefect
  • Dagster
  • Apache Argo Workflows

Optional Schema Contracts

Components can optionally declare input/output schemas via the SchemaContract trait. When enabled:

  • Runtime validation checks schema compatibility between adjacent components
  • Clear error messages when schemas don't match
  • Validation is disabled by default for backward compatibility

See Schema Contracts for details.

For advanced data quality validation, combine with:

  • Great Expectations
  • Deequ
  • Custom validation components

Not a Workflow Orchestrator

The framework runs a single pipeline to completion. It does not:

  • Schedule pipelines
  • Manage dependencies between pipelines
  • Retry failed pipelines automatically
  • Provide a web UI for monitoring

For orchestration, use:

  • Apache Airflow
  • Prefect
  • Dagster
  • AWS Step Functions
  • Kubernetes CronJobs

When to Use This Framework

Good Fit

ScenarioWhy It Works
Batch ETL pipelinesSequential stages, clear flow
Streaming ETL pipelinesStructured Streaming with ordered transforms
Data migration jobsOrdered steps, config-driven
Batch report generationPredictable execution order
Feature engineeringTransform chains without branches
Data quality pipelinesSequential validation checks

Not a Good Fit

ScenarioBetter Alternative
Complex DAG workflowsAirflow, Prefect, Dagster
Complex event processingApache Flink
ML training pipelinesMLflow, Kubeflow, SageMaker
Multi-pipeline orchestrationAirflow, Argo Workflows
Interactive data explorationNotebooks, Spark Shell

Architecture Decisions

Why Sequential Execution?

  1. Simplicity: No complexity of dependency resolution or parallel scheduling
  2. Debuggability: Easy to understand execution order and trace failures
  3. Predictability: Same config always produces same execution order
  4. Resource efficiency: No need for sophisticated resource allocation

Why HOCON Configuration?

  1. Human-readable: Easier than JSON, more powerful than YAML
  2. Type-safe: PureConfig provides compile-time validation
  3. Composable: Includes and substitutions for environment management
  4. Familiar: Standard in Scala/Akka ecosystem

Why Reflection-Based Instantiation?

  1. Decoupling: Components don't depend on each other at compile time
  2. Pluggability: Add new components without modifying framework
  3. Configuration-driven: Component selection via config, not code

Future Roadmap

The framework's core features are now complete. Future development will focus on refinements and cloud-native enhancements:

Completed Features

  • v1.1.0: Configuration validation, Secrets management, Streaming core
  • v1.2.0: Schema contracts, Checkpointing, Retry logic, Data quality hooks
  • v1.3.0: Complete streaming sources/sinks (Kafka, Kinesis, EventHubs, File, Delta, Iceberg)

Under Consideration

  • Additional streaming source/sink connectors
  • Enhanced observability integrations
  • Cloud-specific optimizations

Comparison with Alternatives

FeatureThis FrameworkAirflowPrefectCustom Spark
ExecutionSequentialDAGDAGAny
StreamingYes (Structured Streaming)NoNoYes
Config formatHOCONPythonPythonCode
SchedulingExternalBuilt-inBuilt-inExternal
Learning curveLowMediumMediumLow
Spark integrationNativeOperatorTaskNative
ObservabilityBuilt-inBuilt-inBuilt-inDIY

Summary

Spark Pipeline Framework is the right choice when you need:

  • Simple, sequential batch or streaming pipelines
  • Configuration-driven architecture
  • Production-ready observability
  • Optional schema contracts between components
  • Structured Streaming integration
  • Minimal operational complexity

It is not the right choice when you need:

  • Complex DAG workflows with parallel execution
  • Complex event processing or stateful stream operations
  • Built-in scheduling and orchestration

The framework embraces the Unix philosophy: do one thing well. For sequential Spark pipelines (batch or streaming) with configuration-driven flexibility, it provides a clean, production-ready solution.