A high-performance Rust application that generates banking transaction data as Parquet files and uploads them to AWS S3 in parallel. Specify the target size in GB as a command-line argument.
- Parallel Data Generation: Uses Tokio async runtime for concurrent file generation and uploads
- Precise Size Tracking: Measures actual Parquet file sizes (not estimates) to accurately reach 1TB target
- Arrow Format: Uses Apache Arrow for efficient in-memory data structures
- Optimized File Sizes: Generates ~128MB Parquet files (optimal for S3 performance)
- Configurable Size: Specify target dataset size in GB via command-line argument
- Real-time Progress: Shows detailed progress including batch generation, file writing, and overall percentage complete
- Banking Transaction Schema: Generates realistic transaction data with:
transaction_id(UUID)datetime(timestamp)customer_id(int64)order_qty(int32)order_amount(float64)
-
Rust: Install Rust from rustup.rs
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
-
AWS Credentials (only if using S3 output): Configure AWS credentials for S3 access. You can use one of:
- AWS CLI:
aws configure - Environment variables:
AWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEY - IAM role (if running on EC2)
- AWS credentials file:
~/.aws/credentials
Note: By default, the tool writes to a local
data/directory. No AWS credentials are required unless you explicitly enable S3 output. - AWS CLI:
-
Clone or navigate to the project directory:
cd rustGenerate1TBdataset -
Build the project:
cargo build --release
This will download dependencies and compile the project. The first build may take several minutes.
By default, files are written to a local data/ directory. No AWS credentials are required.
Run the application with a target size in GB:
cargo run --release 1This will generate 1 GB of data in the data/ directory. You can specify any size:
cargo run --release 0.5 # Generate 0.5 GB
cargo run --release 10 # Generate 10 GB
cargo run --release 1000 # Generate 1 TB (1000 GB)Or if already built:
./target/release/rustGenerate1TBdataset 1To specify a custom output directory:
OUTPUT_DIR=/path/to/output cargo run --release 1To write files to S3 instead of local disk, set the USE_S3 or the S3_BUCKET, AWS_S3_PREFIX environment variables:
USE_S3=1 cargo run --release 1
# or
S3_BUCKET=my-bucket-name AWS_S3_PREFIX=my-prefix cargo run --release 1When using S3, ensure AWS credentials are configured (see Prerequisites).
The application uses the following constants (defined in src/main.rs):
- Target Size: Specified via command-line argument (in GB)
- File Size: 128MB per file (optimal for S3)
- Batch Size: 100,000 records per batch
- Concurrent Uploads: 3 parallel uploads
- Output Mode: Local files by default (to
data/directory), S3 ifUSE_S3orS3_BUCKETenvironment variable is set - Local Output Directory:
data/(configurable viaOUTPUT_DIRenvironment variable) - S3 Bucket:
confessions-of-a-data-guy(when S3 mode is enabled) - S3 Path:
s3://confessions-of-a-data-guy/transactions/(when S3 mode is enabled)
To modify these, edit the constants at the top of src/main.rs:
const TARGET_FILE_SIZE_BYTES: u64 = 128 * 1024 * 1024; // 128MB
const BATCH_SIZE: usize = 100_000;
const MAX_CONCURRENT_UPLOADS: usize = 3;Files are written to the local data/ directory (or the directory specified by OUTPUT_DIR):
data/part-00000001.parquet
data/part-00000002.parquet
...
When S3 mode is enabled, files are uploaded to:
s3://confessions-of-a-data-guy/transactions/part-00000001.parquet
s3://confessions-of-a-data-guy/transactions/part-00000002.parquet
...
The application provides detailed progress information:
Starting data generation to local directory: data
Target size: 1.00 GB
Target file size: 128MB
Max concurrent writes: 3
Generated batch 1: 100000 records, actual Parquet size: 12 MB (cumulative: 12 MB)
Generated batch 2: 100000 records, actual Parquet size: 12 MB (cumulative: 24 MB)
...
Writing 5 batches (500000 total records) to final Parquet file...
Written batch 1/5 (100000 records) to Parquet buffer
...
Parquet file finalized: 5 batches, 500000 total records, 58 MB
Writing file 1 to local disk: data/part-00000001.parquet...
✓ Written file 1: 500000 records, 58 MB | Total: 0.06 GB / 1.00 GB (5.8%)
- Parallel Processing: Up to 3 files can be generated and uploaded concurrently
- Memory Efficient: Uses streaming writes and buffers data efficiently
- Network Optimized: Files are sized for optimal S3 multipart upload performance
If you see authentication errors:
# Verify AWS credentials
aws s3 ls s3://confessions-of-a-data-guy/
# Or set environment variables
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_DEFAULT_REGION=us-east-1If you encounter compilation errors related to arrow-arith:
- The code includes a patch for a known compatibility issue
- Ensure you're using Rust 1.70+ and Cargo 1.70+
- Try
cargo clean && cargo build --release
If you run out of memory:
- Reduce
MAX_CONCURRENT_UPLOADSin the code - Reduce
TARGET_FILE_SIZE_BYTESto generate smaller files - Reduce
BATCH_SIZEto use less memory per batch - Generate smaller datasets by specifying a smaller GB value
This project is provided as-is for data generation purposes.