Data Engineering Take-Home Challenge

Welcome to the FastCommerce Data Engineering Challenge!

Overview

You'll be building a data pipeline for an e-commerce company that processes orders, manages inventory, and generates analytics. This challenge should take 4-6 hours to complete.

What's Provided

Sample Data: Representative datasets in Evaluation-Files/
Challenge Instructions: Detailed requirements in challenge_instructions.md
Evaluation Rubric: Scoring criteria in evaluation_rubric.md
Data Generator: Script to create additional test data if needed

Your Task

Design and implement a data pipeline solution using your preferred technology stack. You have complete freedom in:

Architecture design
Technology choices (Python, Node.js, SQL databases, etc.)
Folder structure and organization
Implementation approach

Getting Started

🍴 Step 1: Fork This Repository

Click the "Fork" button at the top of this GitHub repository

Clone your forked repository to your local machine:

git clone https://github.com/YOUR_USERNAME/lifi-data-engineer-test.git
cd lifi-data-engineer-test

Create a new branch for your solution:
```
git checkout -b solution
```

📋 Step 2: Complete the Challenge

Read through challenge_instructions.md carefully
Explore the sample data to understand the structure
Design your solution architecture
Implement your pipeline
Document your approach and decisions

📤 Step 3: Submit Your Solution

Commit your work to your solution branch:

git add .
git commit -m "Complete data engineering challenge"
git push origin solution

Create a pull request from your solution branch to your main branch
Share the link to your forked repository for evaluation

Sample Data Overview

📊 Sample Data for Development

1,000 orders across 30 days (scalable to 10K+ orders/day requirement)
50 products with full catalog information
50 inventory records across multiple warehouses
Multiple channels: web, mobile, API
Various order statuses: confirmed, pending, cancelled

📁 Files Provided

Evaluation-Files/
├── orders_stream.jsonl          # 1,000 orders in JSONL format (~372KB)
├── inventory_updates.csv        # 50 inventory records (~4KB) 
├── product_catalog.json         # 50 products with metadata (~8KB)
├── broken_pipeline_code.py      # Bonus debugging challenge
├── data_generator_js.js         # Generate additional test data
└── package_json.json           # Node.js dependencies for generator

Data Formats

Orders (JSONL)

{
  "order_id": "ORD-123456",
  "customer_id": "CUST-789", 
  "timestamp": "2024-01-15T10:30:00Z",
  "channel": "web",
  "items": [
    {
      "product_id": "PROD-001",
      "quantity": 2,
      "unit_price": 29.99
    }
  ],
  "shipping_address": {
    "country": "US",
    "state": "CA",
    "city": "San Francisco"  
  },
  "status": "confirmed"
}

Inventory (CSV)

product_id,available_quantity,warehouse_location,last_updated
PROD-001,150,WH-SF,2024-01-15T09:00:00Z
PROD-002,75,WH-NY,2024-01-15T08:45:00Z

Product Catalog (JSON)

[
  {
    "product_id": "PROD-001",
    "category": "Electronics",
    "brand": "TechCorp", 
    "price": 299.99,
    "launch_date": "2023-06-15T00:00:00.000Z",
    "description": "High-performance wireless headphones"
  }
]

Key Requirements Summary

🔧 Part 1: Data Pipeline (40%)

Ingest streaming orders and batch inventory updates
Enrich orders with product catalog data
Handle schema evolution and data validation
Ensure exactly-once processing

📊 Part 2: Data Quality & Monitoring (25%)

Implement data completeness and business rule validation
Build anomaly detection for unusual patterns
Create pipeline health monitoring
Set up alerting for data quality issues

📈 Part 3: Analytics & Reporting (25%)

Daily revenue by channel and region
Top products and customer insights
Inventory analysis and low-stock alerts
Operational metrics and performance tracking

🚀 Part 4: Production Readiness (10%)

Design for 10x scalability
Address deployment and security concerns
Plan for failure recovery and monitoring
Document operational procedures

🎯 Bonus: Code Review (+10%)

Debug and fix broken_pipeline_code.py
Improve code quality and add error handling
Document all issues found and fixes made

Submission Guidelines

📋 How to Submit

Fork this repository and work on your solution branch
Commit all your work with clear commit messages
Create a pull request from solution to main in your fork
Share your forked repository URL for evaluation
Ensure your solution runs with clear setup instructions

📁 What to Include

Working Code: Complete pipeline implementation
Documentation: Architecture overview and setup instructions
Analysis: Performance considerations and design decisions
Tests: Validation of your solution
README: Clear instructions on how to run your solution

What We're Looking For

Technical Excellence: Clean, scalable, maintainable code
Problem Solving: Complete solution addressing all requirements
Communication: Clear documentation and decision justification
Production Mindset: Considerations for scale, monitoring, and reliability

Time Management Tips

1 hour: Requirements analysis and architecture design
2-3 hours: Core pipeline implementation
1 hour: Data quality and monitoring
1 hour: Analytics queries and documentation

Questions?

If anything is unclear, make reasonable assumptions and document them. We're interested in seeing your thought process and engineering judgment.

Good luck! 🚀

This challenge tests real-world data engineering skills including pipeline design, data quality, analytics, and production readiness.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Evaluation-Files		Evaluation-Files
.gitattributes		.gitattributes
AI_DETECTION_GUIDE.md		AI_DETECTION_GUIDE.md
Data_Engineering_Challenge_Evaluation.md		Data_Engineering_Challenge_Evaluation.md
README.md		README.md
REVIEWER_GUIDE.md		REVIEWER_GUIDE.md
ai_detection.py		ai_detection.py
challenge_instructions.md		challenge_instructions.md
evaluation_rubric.md		evaluation_rubric.md
submission_reviewer.py		submission_reviewer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineering Take-Home Challenge

Overview

What's Provided

Your Task

Getting Started

🍴 Step 1: Fork This Repository

📋 Step 2: Complete the Challenge

📤 Step 3: Submit Your Solution

Sample Data Overview

📊 Sample Data for Development

📁 Files Provided

Data Formats

Orders (JSONL)

Inventory (CSV)

Product Catalog (JSON)

Key Requirements Summary

🔧 Part 1: Data Pipeline (40%)

📊 Part 2: Data Quality & Monitoring (25%)

📈 Part 3: Analytics & Reporting (25%)

🚀 Part 4: Production Readiness (10%)

🎯 Bonus: Code Review (+10%)

Submission Guidelines

📋 How to Submit

📁 What to Include

What We're Looking For

Time Management Tips

Questions?

About

Uh oh!

Releases

Packages

Languages

lifinance/lifi-data-engineer-test

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Take-Home Challenge

Overview

What's Provided

Your Task

Getting Started

🍴 Step 1: Fork This Repository

📋 Step 2: Complete the Challenge

📤 Step 3: Submit Your Solution

Sample Data Overview

📊 Sample Data for Development

📁 Files Provided

Data Formats

Orders (JSONL)

Inventory (CSV)

Product Catalog (JSON)

Key Requirements Summary

🔧 Part 1: Data Pipeline (40%)

📊 Part 2: Data Quality & Monitoring (25%)

📈 Part 3: Analytics & Reporting (25%)

🚀 Part 4: Production Readiness (10%)

🎯 Bonus: Code Review (+10%)

Submission Guidelines

📋 How to Submit

📁 What to Include

What We're Looking For

Time Management Tips

Questions?

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages