Scope & Design
This page describes what Spark Pipeline Framework is designed for, its intentional limitations, and how it compares to alternative approaches.
Design Philosophy
Spark Pipeline Framework is built on a single principle: configuration-driven simplicity for batch and streaming Spark workloads.
The framework enables teams to build production Spark pipelines (batch and streaming) using HOCON configuration files, with minimal boilerplate and maximum clarity. It prioritizes:
- Simplicity over flexibility - Sequential execution, not DAGs
- Configuration over code - Pipelines defined in HOCON, not Scala/Python
- Composition over inheritance - Small, focused components assembled via config
- Observability by default - Built-in hooks for logging, metrics, and audit
What This Framework IS
Simple Sequential Pipelines
Spark Pipeline Framework executes components sequentially, in order. This is intentional:
Component A → Component B → Component C → Done
This model works well for:
- Batch ETL pipelines with clear input → transform → output stages
- Streaming pipelines with sequential data transformations
- Data ingestion from source systems to data lakes
- Feature engineering pipelines with ordered transformations
- Report generation with sequential aggregations
- Data quality pipelines with validation steps
Configuration-Driven Architecture
Pipelines are defined entirely in HOCON configuration:
pipeline {
pipeline-name = "Daily Sales ETL"
pipeline-components = [
{ instance-type = "com.company.ExtractSales", ... },
{ instance-type = "com.company.TransformSales", ... },
{ instance-type = "com.company.LoadToWarehouse", ... }
]
}
Benefits:
- No recompilation to change pipeline structure
- Environment-specific configs via HOCON includes
- Auditable changes via version control
- Non-developers can modify pipeline parameters
Production-Ready Observability
Built-in hooks provide enterprise-grade observability:
| Hook | Purpose |
|---|---|
LoggingHooks | Structured JSON logging with correlation IDs |
MetricsHooks | Micrometer integration (Prometheus, Datadog, etc.) |
AuditHooks | Persistent audit trail with security filtering |
Flexible Error Handling
Two error handling modes:
- Fail-fast (default): Stop on first error
- Continue-on-error: Run all components, collect failures
pipeline {
fail-fast = false // Continue even if components fail
}
What This Framework is NOT
Not a DAG Executor
The framework does not support:
- Parallel component execution
- Dependency graphs between components
- Conditional branching based on component outputs
- Dynamic pipeline modification at runtime
If you need DAGs, consider:
- Apache Airflow
- Prefect
- Dagster
- Apache Argo Workflows
Optional Schema Contracts
Components can optionally declare input/output schemas via the SchemaContract trait. When enabled:
- Runtime validation checks schema compatibility between adjacent components
- Clear error messages when schemas don't match
- Validation is disabled by default for backward compatibility
See Schema Contracts for details.
For advanced data quality validation, combine with:
- Great Expectations
- Deequ
- Custom validation components
Not a Workflow Orchestrator
The framework runs a single pipeline to completion. It does not:
- Schedule pipelines
- Manage dependencies between pipelines
- Retry failed pipelines automatically
- Provide a web UI for monitoring
For orchestration, use:
- Apache Airflow
- Prefect
- Dagster
- AWS Step Functions
- Kubernetes CronJobs
When to Use This Framework
Good Fit
| Scenario | Why It Works |
|---|---|
| Batch ETL pipelines | Sequential stages, clear flow |
| Streaming ETL pipelines | Structured Streaming with ordered transforms |
| Data migration jobs | Ordered steps, config-driven |
| Batch report generation | Predictable execution order |
| Feature engineering | Transform chains without branches |
| Data quality pipelines | Sequential validation checks |
Not a Good Fit
| Scenario | Better Alternative |
|---|---|
| Complex DAG workflows | Airflow, Prefect, Dagster |
| Complex event processing | Apache Flink |
| ML training pipelines | MLflow, Kubeflow, SageMaker |
| Multi-pipeline orchestration | Airflow, Argo Workflows |
| Interactive data exploration | Notebooks, Spark Shell |
Architecture Decisions
Why Sequential Execution?
- Simplicity: No complexity of dependency resolution or parallel scheduling
- Debuggability: Easy to understand execution order and trace failures
- Predictability: Same config always produces same execution order
- Resource efficiency: No need for sophisticated resource allocation
Why HOCON Configuration?
- Human-readable: Easier than JSON, more powerful than YAML
- Type-safe: PureConfig provides compile-time validation
- Composable: Includes and substitutions for environment management
- Familiar: Standard in Scala/Akka ecosystem
Why Reflection-Based Instantiation?
- Decoupling: Components don't depend on each other at compile time
- Pluggability: Add new components without modifying framework
- Configuration-driven: Component selection via config, not code
Future Roadmap
The framework's core features are now complete. Future development will focus on refinements and cloud-native enhancements:
Completed Features
- v1.1.0: Configuration validation, Secrets management, Streaming core
- v1.2.0: Schema contracts, Checkpointing, Retry logic, Data quality hooks
- v1.3.0: Complete streaming sources/sinks (Kafka, Kinesis, EventHubs, File, Delta, Iceberg)
Under Consideration
- Additional streaming source/sink connectors
- Enhanced observability integrations
- Cloud-specific optimizations
Comparison with Alternatives
| Feature | This Framework | Airflow | Prefect | Custom Spark |
|---|---|---|---|---|
| Execution | Sequential | DAG | DAG | Any |
| Streaming | Yes (Structured Streaming) | No | No | Yes |
| Config format | HOCON | Python | Python | Code |
| Scheduling | External | Built-in | Built-in | External |
| Learning curve | Low | Medium | Medium | Low |
| Spark integration | Native | Operator | Task | Native |
| Observability | Built-in | Built-in | Built-in | DIY |
Summary
Spark Pipeline Framework is the right choice when you need:
- Simple, sequential batch or streaming pipelines
- Configuration-driven architecture
- Production-ready observability
- Optional schema contracts between components
- Structured Streaming integration
- Minimal operational complexity
It is not the right choice when you need:
- Complex DAG workflows with parallel execution
- Complex event processing or stateful stream operations
- Built-in scheduling and orchestration
The framework embraces the Unix philosophy: do one thing well. For sequential Spark pipelines (batch or streaming) with configuration-driven flexibility, it provides a clean, production-ready solution.