feat: Expand Apache Arrow tutorial with advanced examples and performance benchmarks by thliang01 · Pull Request #117 · marimo-team/learn

thliang01 · 2025-07-13T04:27:39Z

📝 Summary

This pull request enhances the existing Apache Arrow integration tutorial by adding comprehensive examples demonstrating conversions between DuckDB, Arrow, and Polars/Pandas DataFrames. The tutorial now includes advanced multi-source data joining capabilities and performance benchmarks with a 1M row dataset to showcase the zero-copy benefits of Apache Arrow's columnar format.

Key additions include:

Bidirectional conversion examples between all supported formats
Real-world scenario combining heterogeneous data sources in a single query
Performance measurements demonstrating query execution speed
Improved documentation explaining Arrow's architectural advantages

This enhancement helps users understand practical applications of Apache Arrow for building high-performance data pipelines across different analytical tools.

📋 Checklist

I have included package dependencies in the notebook file using --sandbox
If adding a course, include a README.md
Keep language direct and simple.

- Create comprehensive tutorial demonstrating Apache Arrow usage with DuckDB - Cover Arrow table creation from DuckDB queries - Demonstrate loading Arrow tables into DuckDB for zero-copy operations - Include examples of interoperability with Polars and Pandas DataFrames - Add marimo notebook with interactive SQL queries and data transformations - Configure dependencies: duckdb==1.2.1, pyarrow==19.0.1, polars==1.25.2, pandas==2.2.3 This tutorial helps users understand how to leverage Apache Arrow's columnar format for efficient data transfer between DuckDB and other data processing libraries, enabling high-performance analytical workflows.

…ance benchmarks - Add comprehensive examples for converting between DuckDB, Arrow, and Polars/Pandas DataFrames - Add advanced multi-source data joining example combining DuckDB tables, Polars DataFrames, and Pandas DataFrames - Include performance demonstration with 1M row dataset showcasing zero-copy benefits - Enhance documentation with detailed explanations of Arrow's columnar format advantages - Demonstrate zero-copy conversions using .to_arrow(), pl.from_arrow(), and .to_pandas() methods - Improve code organization with hidden cells for better notebook readability - Include timing measurements to demonstrate query performance on large datasets - Expand summary section highlighting key learning outcomes This enhancement provides users with more comprehensive examples of Apache Arrow's capabilities, including real-world scenarios for combining heterogeneous data sources and quantifiable performance benefits of the zero-copy architecture.

thliang01 · 2025-07-13T04:31:31Z

Hey @Haleshot!

I've enhanced the Apache Arrow tutorial with:

Examples for creating Arrow tables from DuckDB queries
Demonstrations of loading Arrow tables into DuckDB with zero-copy operations
Comprehensive conversions between DuckDB, Arrow, and Polars/Pandas DataFrames
Advanced multi-source data joining across different formats
Performance benchmarks showcasing the benefits of Arrow's columnar format

Would appreciate your review, especially on the new examples and documentation clarity. Thanks!

Haleshot

Apologies for the slight delay in responding to your PR (really appreciate your enthusiasm and volunteer contribs). Cool notebook walking over working w/ Arrow on the whole! The flow is great; have posted some nits/comments as part of the review. Let me know what you think.

Thanks again for reviving and continuing the momentum of the duckdb course! 🎉

duckdb/011_working_with_apache_arrow.py

- Import `sqlglot`, `psutil`, and `altair` - Add comprehensive performance comparisons between Arrow-based and traditional approaches demonstrating 2-10x speedup - Add memory efficiency analysis showing 20-40% memory savings with Arrow columnar format - Include complex query benchmarks with joins and window functions - Add memory usage tracking during zero-copy vs copy operations - Visualize performance differences using Altair charts - Fix AttributeError by updating altair_chart usage syntax - Update dependencies: duckdb 1.2.1→1.3.2, add sqlglot & psutil The enhanced tutorial now provides concrete evidence of Apache Arrow's benefits through measurable benchmarks, helping users understand the real-world performance advantages of using Arrow's columnar format and zero-copy operations in data processing workflows.

thliang01 · 2025-07-17T10:06:45Z

@Haleshot Thanks for the review! I've addressed all points in the latest commit:

Fixed the Altair chart error
Added comprehensive performance benchmarks (2-10x speedup demonstrated)
Added memory efficiency analysis (20-40% savings shown)
Included complex queries and visualizations
Updated all dependencies

The tutorial now includes concrete metrics and benchmarks as requested. Ready for re-review when convenient!

Haleshot

LGTM! 🚀

thliang01 added 2 commits July 13, 2025 00:08

Haleshot reviewed Jul 17, 2025

View reviewed changes

thliang01 requested a review from Haleshot July 17, 2025 10:06

Haleshot approved these changes Jul 18, 2025

View reviewed changes

Haleshot merged commit a100272 into marimo-team:main Jul 18, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Expand Apache Arrow tutorial with advanced examples and performance benchmarks#117

feat: Expand Apache Arrow tutorial with advanced examples and performance benchmarks#117
Haleshot merged 3 commits intomarimo-team:mainfrom
thliang01:Working-with-Apache-Arrow

thliang01 commented Jul 13, 2025

Uh oh!

thliang01 commented Jul 13, 2025

Uh oh!

Haleshot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thliang01 commented Jul 17, 2025

Uh oh!

Haleshot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

thliang01 commented Jul 13, 2025

📝 Summary

📋 Checklist

Uh oh!

thliang01 commented Jul 13, 2025

Uh oh!

Haleshot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thliang01 commented Jul 17, 2025

Uh oh!

Haleshot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Haleshot left a comment •

edited

Loading