Skip to content

unionai-oss/rustGenerate1TB

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Rust Dataset Generator

A high-performance Rust application that generates banking transaction data as Parquet files and uploads them to AWS S3 in parallel. Specify the target size in GB as a command-line argument.

Features

  • Parallel Data Generation: Uses Tokio async runtime for concurrent file generation and uploads
  • Precise Size Tracking: Measures actual Parquet file sizes (not estimates) to accurately reach 1TB target
  • Arrow Format: Uses Apache Arrow for efficient in-memory data structures
  • Optimized File Sizes: Generates ~128MB Parquet files (optimal for S3 performance)
  • Configurable Size: Specify target dataset size in GB via command-line argument
  • Real-time Progress: Shows detailed progress including batch generation, file writing, and overall percentage complete
  • Banking Transaction Schema: Generates realistic transaction data with:
    • transaction_id (UUID)
    • datetime (timestamp)
    • customer_id (int64)
    • order_qty (int32)
    • order_amount (float64)

Prerequisites

  1. Rust: Install Rust from rustup.rs

    curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
  2. AWS Credentials (only if using S3 output): Configure AWS credentials for S3 access. You can use one of:

    • AWS CLI: aws configure
    • Environment variables: AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
    • IAM role (if running on EC2)
    • AWS credentials file: ~/.aws/credentials

    Note: By default, the tool writes to a local data/ directory. No AWS credentials are required unless you explicitly enable S3 output.

Installation

  1. Clone or navigate to the project directory:

    cd rustGenerate1TBdataset
  2. Build the project:

    cargo build --release

    This will download dependencies and compile the project. The first build may take several minutes.

Usage

Basic Usage (Local Output - Default)

By default, files are written to a local data/ directory. No AWS credentials are required.

Run the application with a target size in GB:

cargo run --release 1

This will generate 1 GB of data in the data/ directory. You can specify any size:

cargo run --release 0.5   # Generate 0.5 GB
cargo run --release 10    # Generate 10 GB
cargo run --release 1000  # Generate 1 TB (1000 GB)

Or if already built:

./target/release/rustGenerate1TBdataset 1

To specify a custom output directory:

OUTPUT_DIR=/path/to/output cargo run --release 1

S3 Output (Optional)

To write files to S3 instead of local disk, set the USE_S3 or the S3_BUCKET, AWS_S3_PREFIX environment variables:

USE_S3=1 cargo run --release 1
# or
S3_BUCKET=my-bucket-name AWS_S3_PREFIX=my-prefix cargo run --release 1

When using S3, ensure AWS credentials are configured (see Prerequisites).

Configuration

The application uses the following constants (defined in src/main.rs):

  • Target Size: Specified via command-line argument (in GB)
  • File Size: 128MB per file (optimal for S3)
  • Batch Size: 100,000 records per batch
  • Concurrent Uploads: 3 parallel uploads
  • Output Mode: Local files by default (to data/ directory), S3 if USE_S3 or S3_BUCKET environment variable is set
  • Local Output Directory: data/ (configurable via OUTPUT_DIR environment variable)
  • S3 Bucket: confessions-of-a-data-guy (when S3 mode is enabled)
  • S3 Path: s3://confessions-of-a-data-guy/transactions/ (when S3 mode is enabled)

To modify these, edit the constants at the top of src/main.rs:

const TARGET_FILE_SIZE_BYTES: u64 = 128 * 1024 * 1024; // 128MB
const BATCH_SIZE: usize = 100_000;
const MAX_CONCURRENT_UPLOADS: usize = 3;

Output

Local Output (Default)

Files are written to the local data/ directory (or the directory specified by OUTPUT_DIR):

data/part-00000001.parquet
data/part-00000002.parquet
...

S3 Output

When S3 mode is enabled, files are uploaded to:

s3://confessions-of-a-data-guy/transactions/part-00000001.parquet
s3://confessions-of-a-data-guy/transactions/part-00000002.parquet
...

Progress Output

The application provides detailed progress information:

Starting data generation to local directory: data
Target size: 1.00 GB
Target file size: 128MB
Max concurrent writes: 3

Generated batch 1: 100000 records, actual Parquet size: 12 MB (cumulative: 12 MB)
Generated batch 2: 100000 records, actual Parquet size: 12 MB (cumulative: 24 MB)
...
Writing 5 batches (500000 total records) to final Parquet file...
  Written batch 1/5 (100000 records) to Parquet buffer
  ...
Parquet file finalized: 5 batches, 500000 total records, 58 MB
Writing file 1 to local disk: data/part-00000001.parquet...
✓ Written file 1: 500000 records, 58 MB | Total: 0.06 GB / 1.00 GB (5.8%)

Performance

  • Parallel Processing: Up to 3 files can be generated and uploaded concurrently
  • Memory Efficient: Uses streaming writes and buffers data efficiently
  • Network Optimized: Files are sized for optimal S3 multipart upload performance

Troubleshooting

AWS Credentials Issues

If you see authentication errors:

# Verify AWS credentials
aws s3 ls s3://confessions-of-a-data-guy/

# Or set environment variables
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_DEFAULT_REGION=us-east-1

Compilation Issues

If you encounter compilation errors related to arrow-arith:

  • The code includes a patch for a known compatibility issue
  • Ensure you're using Rust 1.70+ and Cargo 1.70+
  • Try cargo clean && cargo build --release

Out of Memory

If you run out of memory:

  • Reduce MAX_CONCURRENT_UPLOADS in the code
  • Reduce TARGET_FILE_SIZE_BYTES to generate smaller files
  • Reduce BATCH_SIZE to use less memory per batch
  • Generate smaller datasets by specifying a smaller GB value

License

This project is provided as-is for data generation purposes.

About

generate 1TB of data into s3 with Rust.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Rust 100.0%