Rust Dataset Generator

A high-performance Rust application that generates banking transaction data as Parquet files and uploads them to AWS S3 in parallel. Specify the target size in GB as a command-line argument.

Features

Parallel Data Generation: Uses Tokio async runtime for concurrent file generation and uploads
Precise Size Tracking: Measures actual Parquet file sizes (not estimates) to accurately reach 1TB target
Arrow Format: Uses Apache Arrow for efficient in-memory data structures
Optimized File Sizes: Generates ~128MB Parquet files (optimal for S3 performance)
Configurable Size: Specify target dataset size in GB via command-line argument
Real-time Progress: Shows detailed progress including batch generation, file writing, and overall percentage complete
Banking Transaction Schema: Generates realistic transaction data with:
- transaction_id (UUID)
- datetime (timestamp)
- customer_id (int64)
- order_qty (int32)
- order_amount (float64)

Prerequisites

Rust: Install Rust from rustup.rs

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

AWS Credentials (only if using S3 output): Configure AWS credentials for S3 access. You can use one of:
- AWS CLI: aws configure
- Environment variables: AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
- IAM role (if running on EC2)
- AWS credentials file: ~/.aws/credentials
Note: By default, the tool writes to a local data/ directory. No AWS credentials are required unless you explicitly enable S3 output.

Installation

Clone or navigate to the project directory:
```
cd rustGenerate1TBdataset
```
Build the project:
```
cargo build --release
```
This will download dependencies and compile the project. The first build may take several minutes.

Usage

Basic Usage (Local Output - Default)

By default, files are written to a local data/ directory. No AWS credentials are required.

Run the application with a target size in GB:

cargo run --release 1

This will generate 1 GB of data in the data/ directory. You can specify any size:

cargo run --release 0.5   # Generate 0.5 GB
cargo run --release 10    # Generate 10 GB
cargo run --release 1000  # Generate 1 TB (1000 GB)

Or if already built:

./target/release/rustGenerate1TBdataset 1

To specify a custom output directory:

OUTPUT_DIR=/path/to/output cargo run --release 1

S3 Output (Optional)

To write files to S3 instead of local disk, set the USE_S3 or the S3_BUCKET, AWS_S3_PREFIX environment variables:

USE_S3=1 cargo run --release 1
# or
S3_BUCKET=my-bucket-name AWS_S3_PREFIX=my-prefix cargo run --release 1

When using S3, ensure AWS credentials are configured (see Prerequisites).

Configuration

The application uses the following constants (defined in src/main.rs):

Target Size: Specified via command-line argument (in GB)
File Size: 128MB per file (optimal for S3)
Batch Size: 100,000 records per batch
Concurrent Uploads: 3 parallel uploads
Output Mode: Local files by default (to data/ directory), S3 if USE_S3 or S3_BUCKET environment variable is set
Local Output Directory: data/ (configurable via OUTPUT_DIR environment variable)
S3 Bucket: confessions-of-a-data-guy (when S3 mode is enabled)
S3 Path: s3://confessions-of-a-data-guy/transactions/ (when S3 mode is enabled)

To modify these, edit the constants at the top of src/main.rs:

const TARGET_FILE_SIZE_BYTES: u64 = 128 * 1024 * 1024; // 128MB
const BATCH_SIZE: usize = 100_000;
const MAX_CONCURRENT_UPLOADS: usize = 3;

Output

Local Output (Default)

Files are written to the local data/ directory (or the directory specified by OUTPUT_DIR):

data/part-00000001.parquet
data/part-00000002.parquet
...

S3 Output

When S3 mode is enabled, files are uploaded to:

s3://confessions-of-a-data-guy/transactions/part-00000001.parquet
s3://confessions-of-a-data-guy/transactions/part-00000002.parquet
...

Progress Output

The application provides detailed progress information:

Starting data generation to local directory: data
Target size: 1.00 GB
Target file size: 128MB
Max concurrent writes: 3

Generated batch 1: 100000 records, actual Parquet size: 12 MB (cumulative: 12 MB)
Generated batch 2: 100000 records, actual Parquet size: 12 MB (cumulative: 24 MB)
...
Writing 5 batches (500000 total records) to final Parquet file...
  Written batch 1/5 (100000 records) to Parquet buffer
  ...
Parquet file finalized: 5 batches, 500000 total records, 58 MB
Writing file 1 to local disk: data/part-00000001.parquet...
✓ Written file 1: 500000 records, 58 MB | Total: 0.06 GB / 1.00 GB (5.8%)

Performance

Parallel Processing: Up to 3 files can be generated and uploaded concurrently
Memory Efficient: Uses streaming writes and buffers data efficiently
Network Optimized: Files are sized for optimal S3 multipart upload performance

Troubleshooting

AWS Credentials Issues

If you see authentication errors:

# Verify AWS credentials
aws s3 ls s3://confessions-of-a-data-guy/

# Or set environment variables
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_DEFAULT_REGION=us-east-1

Compilation Issues

If you encounter compilation errors related to arrow-arith:

The code includes a patch for a known compatibility issue
Ensure you're using Rust 1.70+ and Cargo 1.70+
Try cargo clean && cargo build --release

Out of Memory

If you run out of memory:

Reduce MAX_CONCURRENT_UPLOADS in the code
Reduce TARGET_FILE_SIZE_BYTES to generate smaller files
Reduce BATCH_SIZE to use less memory per batch
Generate smaller datasets by specifying a smaller GB value

License

This project is provided as-is for data generation purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src		src
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rust Dataset Generator

Features

Prerequisites

Installation

Usage

Basic Usage (Local Output - Default)

S3 Output (Optional)

Configuration

Output

Local Output (Default)

S3 Output

Progress Output

Performance

Troubleshooting

AWS Credentials Issues

Compilation Issues

Out of Memory

License

About

Uh oh!

Releases

Packages

Languages

unionai-oss/rustGenerate1TB

Folders and files

Latest commit

History

Repository files navigation

Rust Dataset Generator

Features

Prerequisites

Installation

Usage

Basic Usage (Local Output - Default)

S3 Output (Optional)

Configuration

Output

Local Output (Default)

S3 Output

Progress Output

Performance

Troubleshooting

AWS Credentials Issues

Compilation Issues

Out of Memory

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages