This repository contains a comprehensive PySpark sample pipeline that demonstrates various PySpark capabilities and data processing patterns.
The pipeline consists of the following main components:
- Generates synthetic dataset in Parquet format
- Includes columns with all possible data types (string, integer, float, boolean, timestamp, decimal, array, map, struct)
- Dataset size is optimized to demonstrate PySpark capabilities on a local machine
- Reads data from Parquet files
- Performs various transformations using PySpark DataFrame and RDD operations
- Demonstrates common PySpark patterns and best practices
- Handles data quality checks and validation
- Writes processed results to Avro format
- Generates journal status records in JSONL format
- Maintains data lineage and processing metadata
- Comprehensive Data Types: Demonstrates handling of all PySpark-supported data types
- Real-world Transformations: Includes filtering, aggregation, joins, and complex operations
- Error Handling: Implements robust error handling and data validation
- Performance Considerations: Optimized for local execution while demonstrating scalability patterns
- Data Quality: Includes data validation and quality checks
- Journaling: Maintains processing status and metadata for auditability
The pipeline is designed to run on a single machine (laptop/desktop) without requiring distributed infrastructure. All PySpark nuances and patterns are demonstrated in a sandbox environment that can be executed locally.
pyspark_sample/
├── README.md # This file
├── requirements.txt # Python dependencies
├── src/ # Source code directory
│ ├── data_generator.py # Synthetic data generation script
│ ├── pipeline.py # Core pipeline processing code
│ └── main.py # Main orchestration script
├── data/ # Data directory
│ ├── input/ # Generated synthetic Parquet files
│ └── output/ # Processed Avro files and JSONL journal records
└── tests/ # Unit tests and validation scripts
└── test_pipeline.py # Pipeline validation tests
- Python 3.7+
- Java 8+ (required for PySpark)
- Apache Spark 3.3+ (will be installed via pip)
pip install -r requirements.txtpython -c "import pyspark; print(f'PySpark version: {pyspark.__version__}')"Execute the main orchestration script to run the entire pipeline:
python src/main.pyThis will:
- Check dependencies and setup directories
- Generate synthetic data
- Run the main pipeline
- Run validation tests
- Generate output reports
python src/data_generator.pypython src/pipeline.pypython tests/test_pipeline.py- Generates 10,000 synthetic employee records
- Includes all PySpark data types:
- Basic: string, integer, float, boolean, date, timestamp, decimal
- Complex: array, map, struct
- Saves data as Parquet files in
data/input/
The pipeline performs the following operations:
-
Data Quality Checks
- Removes records with critical null values
- Validates data integrity
-
Basic Transformations
- Adds computed columns (age groups, salary categories)
- Creates performance indicators
-
Aggregations and Grouping
- Department-level statistics
- Average salary calculations
- Project completion metrics
-
Window Functions
- Department ranking
- Salary percentiles
- Comparative analysis
-
Complex Transformations
- Array operations (skill analysis)
- Map operations (metadata flattening)
- Struct field extraction
-
Filtering and Joins
- High performer identification
- Underperformer analysis
- Mentor relationship modeling
-
Time Series Analysis
- Hiring trends by month/year
- Login pattern analysis
- Temporal aggregations
- Avro Files: Processed data in optimized format
- JSONL Journal: Processing metadata and status records
- Multiple Output Files:
department_aggregations.avroskill_statistics.avroflattened_employee_data.avrohigh_performers.avrounderperformers.avroemployee_with_mentor_flags.avrohiring_trends.avrologin_patterns.avroprocessing_journal.jsonl
The pipeline includes comprehensive unit tests:
python tests/test_pipeline.pyTest coverage includes:
- Data loading and validation
- Transformation operations
- Aggregation functions
- Filtering operations
- File I/O operations
- Schema validation
- Memory: Configured for 2-4GB local execution
- Adaptive Query Execution: Enabled for optimization
- Partitioning: Automatically managed by Spark
- Serialization: Kryo serializer for better performance
-
Java Not Found
Error: JAVA_HOME is not set and no 'java' command could be foundSolution: Install Java 8+ and set JAVA_HOME environment variable.
-
PySpark Import Error
ImportError: No module named 'pyspark'Solution: Install dependencies with
pip install -r requirements.txt -
Memory Issues
SparkException: Could not create executorSolution: Reduce memory configuration in the scripts or increase available memory.
-
File Not Found
FileNotFoundError: [Errno 2] No such file or directorySolution: Ensure data generation has been run before pipeline execution.
For debugging, you can modify the Spark configuration in the scripts to enable verbose logging:
.config("spark.log.level", "DEBUG")
.config("spark.sql.adaptive.enabled", "false") # Disable for detailed query plansThe pipeline is designed to be easily extensible:
- Add New Data Types: Extend the schema in
data_generator.py - Custom Transformations: Add methods to the
PySparkPipelineclass - Additional Output Formats: Modify the
save_outputsmethod - Enhanced Validation: Add more data quality checks
- Real-time Processing: Adapt for streaming scenarios
This project is for educational and demonstration purposes.