feat: Expand Apache Arrow tutorial with advanced examples and performance benchmarks#117
Merged
Haleshot merged 3 commits intomarimo-team:mainfrom Jul 18, 2025
Merged
Conversation
- Create comprehensive tutorial demonstrating Apache Arrow usage with DuckDB - Cover Arrow table creation from DuckDB queries - Demonstrate loading Arrow tables into DuckDB for zero-copy operations - Include examples of interoperability with Polars and Pandas DataFrames - Add marimo notebook with interactive SQL queries and data transformations - Configure dependencies: duckdb==1.2.1, pyarrow==19.0.1, polars==1.25.2, pandas==2.2.3 This tutorial helps users understand how to leverage Apache Arrow's columnar format for efficient data transfer between DuckDB and other data processing libraries, enabling high-performance analytical workflows.
…ance benchmarks - Add comprehensive examples for converting between DuckDB, Arrow, and Polars/Pandas DataFrames - Add advanced multi-source data joining example combining DuckDB tables, Polars DataFrames, and Pandas DataFrames - Include performance demonstration with 1M row dataset showcasing zero-copy benefits - Enhance documentation with detailed explanations of Arrow's columnar format advantages - Demonstrate zero-copy conversions using .to_arrow(), pl.from_arrow(), and .to_pandas() methods - Improve code organization with hidden cells for better notebook readability - Include timing measurements to demonstrate query performance on large datasets - Expand summary section highlighting key learning outcomes This enhancement provides users with more comprehensive examples of Apache Arrow's capabilities, including real-world scenarios for combining heterogeneous data sources and quantifiable performance benefits of the zero-copy architecture.
Contributor
Author
|
Hey @Haleshot! I've enhanced the Apache Arrow tutorial with:
Would appreciate your review, especially on the new examples and documentation clarity. Thanks! |
Haleshot
reviewed
Jul 17, 2025
Contributor
There was a problem hiding this comment.
Apologies for the slight delay in responding to your PR (really appreciate your enthusiasm and volunteer contribs). Cool notebook walking over working w/ Arrow on the whole! The flow is great; have posted some nits/comments as part of the review. Let me know what you think.
Thanks again for reviving and continuing the momentum of the duckdb course! 🎉
- Import `sqlglot`, `psutil`, and `altair` - Add comprehensive performance comparisons between Arrow-based and traditional approaches demonstrating 2-10x speedup - Add memory efficiency analysis showing 20-40% memory savings with Arrow columnar format - Include complex query benchmarks with joins and window functions - Add memory usage tracking during zero-copy vs copy operations - Visualize performance differences using Altair charts - Fix AttributeError by updating altair_chart usage syntax - Update dependencies: duckdb 1.2.1→1.3.2, add sqlglot & psutil The enhanced tutorial now provides concrete evidence of Apache Arrow's benefits through measurable benchmarks, helping users understand the real-world performance advantages of using Arrow's columnar format and zero-copy operations in data processing workflows.
Contributor
Author
|
@Haleshot Thanks for the review! I've addressed all points in the latest commit:
The tutorial now includes concrete metrics and benchmarks as requested. Ready for re-review when convenient! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
📝 Summary
This pull request enhances the existing Apache Arrow integration tutorial by adding comprehensive examples demonstrating conversions between DuckDB, Arrow, and Polars/Pandas DataFrames. The tutorial now includes advanced multi-source data joining capabilities and performance benchmarks with a 1M row dataset to showcase the zero-copy benefits of Apache Arrow's columnar format.
Key additions include:
This enhancement helps users understand practical applications of Apache Arrow for building high-performance data pipelines across different analytical tools.
📋 Checklist
--sandboxREADME.md