Skip to content

feat: Expand Apache Arrow tutorial with advanced examples and performance benchmarks#117

Merged
Haleshot merged 3 commits intomarimo-team:mainfrom
thliang01:Working-with-Apache-Arrow
Jul 18, 2025
Merged

feat: Expand Apache Arrow tutorial with advanced examples and performance benchmarks#117
Haleshot merged 3 commits intomarimo-team:mainfrom
thliang01:Working-with-Apache-Arrow

Conversation

@thliang01
Copy link
Contributor

📝 Summary

This pull request enhances the existing Apache Arrow integration tutorial by adding comprehensive examples demonstrating conversions between DuckDB, Arrow, and Polars/Pandas DataFrames. The tutorial now includes advanced multi-source data joining capabilities and performance benchmarks with a 1M row dataset to showcase the zero-copy benefits of Apache Arrow's columnar format.

Key additions include:

  • Bidirectional conversion examples between all supported formats
  • Real-world scenario combining heterogeneous data sources in a single query
  • Performance measurements demonstrating query execution speed
  • Improved documentation explaining Arrow's architectural advantages

This enhancement helps users understand practical applications of Apache Arrow for building high-performance data pipelines across different analytical tools.

📋 Checklist

  • I have included package dependencies in the notebook file using --sandbox
  • If adding a course, include a README.md
  • Keep language direct and simple.

- Create comprehensive tutorial demonstrating Apache Arrow usage with DuckDB
- Cover Arrow table creation from DuckDB queries
- Demonstrate loading Arrow tables into DuckDB for zero-copy operations
- Include examples of interoperability with Polars and Pandas DataFrames
- Add marimo notebook with interactive SQL queries and data transformations
- Configure dependencies: duckdb==1.2.1, pyarrow==19.0.1, polars==1.25.2, pandas==2.2.3

This tutorial helps users understand how to leverage Apache Arrow's columnar
format for efficient data transfer between DuckDB and other data processing
libraries, enabling high-performance analytical workflows.
…ance benchmarks

- Add comprehensive examples for converting between DuckDB, Arrow, and Polars/Pandas DataFrames
- Add advanced multi-source data joining example combining DuckDB tables, Polars DataFrames, and Pandas DataFrames
- Include performance demonstration with 1M row dataset showcasing zero-copy benefits
- Enhance documentation with detailed explanations of Arrow's columnar format advantages
- Demonstrate zero-copy conversions using .to_arrow(), pl.from_arrow(), and .to_pandas() methods
- Improve code organization with hidden cells for better notebook readability
- Include timing measurements to demonstrate query performance on large datasets
- Expand summary section highlighting key learning outcomes

This enhancement provides users with more comprehensive examples of Apache Arrow's
capabilities, including real-world scenarios for combining heterogeneous data sources
and quantifiable performance benefits of the zero-copy architecture.
@thliang01
Copy link
Contributor Author

Hey @Haleshot!

I've enhanced the Apache Arrow tutorial with:

  • Examples for creating Arrow tables from DuckDB queries
  • Demonstrations of loading Arrow tables into DuckDB with zero-copy operations
  • Comprehensive conversions between DuckDB, Arrow, and Polars/Pandas DataFrames
  • Advanced multi-source data joining across different formats
  • Performance benchmarks showcasing the benefits of Arrow's columnar format

Would appreciate your review, especially on the new examples and documentation clarity. Thanks!

Copy link
Contributor

@Haleshot Haleshot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the slight delay in responding to your PR (really appreciate your enthusiasm and volunteer contribs). Cool notebook walking over working w/ Arrow on the whole! The flow is great; have posted some nits/comments as part of the review. Let me know what you think.

Thanks again for reviving and continuing the momentum of the duckdb course! 🎉

- Import `sqlglot`, `psutil`, and `altair`
- Add comprehensive performance comparisons between Arrow-based and
  traditional approaches demonstrating 2-10x speedup
- Add memory efficiency analysis showing 20-40% memory savings with
  Arrow columnar format
- Include complex query benchmarks with joins and window functions
- Add memory usage tracking during zero-copy vs copy operations
- Visualize performance differences using Altair charts
- Fix AttributeError by updating altair_chart usage syntax
- Update dependencies: duckdb 1.2.1→1.3.2, add sqlglot & psutil

The enhanced tutorial now provides concrete evidence of Apache Arrow's
benefits through measurable benchmarks, helping users understand the
real-world performance advantages of using Arrow's columnar format
and zero-copy operations in data processing workflows.
@thliang01
Copy link
Contributor Author

@Haleshot Thanks for the review! I've addressed all points in the latest commit:

  • Fixed the Altair chart error
  • Added comprehensive performance benchmarks (2-10x speedup demonstrated)
  • Added memory efficiency analysis (20-40% savings shown)
  • Included complex queries and visualizations
  • Updated all dependencies

The tutorial now includes concrete metrics and benchmarks as requested. Ready for re-review when convenient!

@thliang01 thliang01 requested a review from Haleshot July 17, 2025 10:06
Copy link
Contributor

@Haleshot Haleshot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🚀

@Haleshot Haleshot merged commit a100272 into marimo-team:main Jul 18, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants