Skip to content

The native Rust implementation for Apache Hudi, with Python API bindings.

License

Notifications You must be signed in to change notification settings

Kunal-Singh-Dadhwal/hudi-rs

 
 

Repository files navigation

Hudi logo

The native Rust implementation for Apache Hudi, with Python API bindings.

hudi-rs ci hudi-rs codecov join hudi slack follow hudi x/twitter follow hudi linkedin

The hudi-rs project aims to broaden the use of Apache Hudi for a diverse range of users and projects.

Source Installation Command
PyPi pip install hudi
Crates.io cargo add hudi

Example usage

Note

These examples expect a Hudi table exists at /tmp/trips_table, created using the quick start guide.

Python

Read a Hudi table into a PyArrow table.

from hudi import HudiTableBuilder
import pyarrow as pa

hudi_table = (
    HudiTableBuilder
    .from_base_uri("/tmp/trips_table")
    .with_option("hoodie.read.as.of.timestamp", "20241122010827898")
    .build()
)
records = hudi_table.read_snapshot(filters=[("city", "=", "san_francisco")])

arrow_table = pa.Table.from_batches(records)
result = arrow_table.select(["rider", "city", "ts", "fare"])
print(result)

Rust (DataFusion)

Add crate hudi with datafusion feature to your application to query a Hudi table.
cargo new my_project --bin && cd my_project
cargo add tokio@1 datafusion@42
cargo add hudi --features datafusion

Update src/main.rs with the code snippet below then cargo run.

use std::sync::Arc;

use datafusion::error::Result;
use datafusion::prelude::{DataFrame, SessionContext};
use hudi::HudiDataSource;

#[tokio::main]
async fn main() -> Result<()> {
    let ctx = SessionContext::new();
    let hudi = HudiDataSource::new_with_options(
        "/tmp/trips_table",
        [("hoodie.read.as.of.timestamp", "20241122010827898")]).await?;
    ctx.register_table("trips_table", Arc::new(hudi))?;
    let df: DataFrame = ctx.sql("SELECT * from trips_table where city = 'san_francisco'").await?;
    df.show().await?;
    Ok(())
}

Work with cloud storage

Ensure cloud storage credentials are set properly as environment variables, e.g., AWS_*, AZURE_*, or GOOGLE_*. Relevant storage environment variables will then be picked up. The target table's base uri with schemes such as s3://, az://, or gs:// will be processed accordingly.

Alternatively, you can pass the storage configuration as options to the HudiTableBuilder or HudiDataSource.

Python

from hudi import HudiTableBuilder

hudi_table = (
    HudiTableBuilder
    .from_base_uri("s3://bucket/trips_table")
    .with_option("aws_region", "us-west-2")
    .build()
)

Rust (DataFusion)

use hudi::HudiDataSource;

async fn main() -> Result<()> {
    let hudi = HudiDataSource::new_with_options(
        "s3://bucket/trips_table",
        [("aws_region", "us-west-2")]
    ).await?;
}

Contributing

Check out the contributing guide for all the details about making contributions to the project.

About

The native Rust implementation for Apache Hudi, with Python API bindings.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Rust 89.5%
  • Python 7.0%
  • Shell 2.9%
  • Makefile 0.6%