Tenso

Up to 35x faster than Apache Arrow on deserialization. 46x less CPU than SafeTensors.

Zero-copy, SIMD-aligned tensor protocol for high-performance ML infrastructure.

Why Tenso?

Most serialization formats are designed for general data or disk storage. Tenso is focused on network tensor transmission where every microsecond matters.

The Problem

Traditional formats waste CPU cycles during deserialization:

SafeTensors: 37.1% CPU usage (great for disk, overkill for network)
Pickle: 40.9% CPU usage + security vulnerabilities
Arrow: Faster on serialization, but up to 32x slower on deserialization for large tensors

The Solution

Tenso achieves true zero-copy with:

Minimalist Header: Fixed 8-byte header eliminates JSON parsing overhead.
64-byte Alignment: SIMD-ready padding ensures the data body is cache-line aligned.
Direct Memory Mapping: The CPU points directly to existing buffers without copying.

Result: 0.8% CPU usage vs >40% for SafeTensors/Pickle.

Benchmarks

System: Python 3.12.9, NumPy 2.3.5, 12 CPU cores, M4 Pro

1. In-Memory Serialization (LLM Layer - 64MB)

Format	Size	Serialize	Deserialize	Speedup (Deser)
Tenso	64.00 MB	3.51 ms	0.004 ms	1x
Arrow	64.00 MB	7.06 ms	0.011 ms	2.8x slower
SafeTensors	64.00 MB	8.14 ms	2.39 ms	597x slower
Pickle	64.00 MB	2.93 ms	2.71 ms	677x slower
MsgPack	64.00 MB	10.44 ms	3.05 ms	763x slower

Note: Tenso (Vect) variant is even faster with 0.000 ms deserialize time.

2. Disk I/O (256 MB Matrix)

Format	Write	Read
Tenso	29.41 ms	36.28 ms
NumPy .npy	24.83 ms	43.08 ms
Pickle	49.90 ms	24.24 ms

3. Stream Reading (95 MB Packet)

Method	Time	Throughput	Speedup
Tenso read_stream	7.68 ms	12,417 MB/s	1x
Optimised Loop	13.89 ms	7,396 MB/s	1.9x slower

4. CPU Usage (Efficiency)

Format	Serialize CPU%	Deserialize CPU%
Tenso	117.3%	0.8%
Arrow	57.1%	1.0%
SafeTensors	67.1%	37.1%
Pickle	44.0%	40.9%

5. Arrow vs Tenso (Comparison)

Size	Tenso Ser	Arrow Ser	Tenso Des	Arrow Des	Speedup
Small	0.130ms	0.056ms	0.009ms	0.035ms	4.1x
Medium	0.972ms	0.912ms	0.020ms	0.040ms	2.0x
Large	3.166ms	3.655ms	0.019ms	0.222ms	11.8x
XLarge	19.086ms	28.726ms	0.023ms	0.733ms	32.0x

6. Network Performance

Packet Throughput: 89,183 packets/sec (over localhost TCP)
Latency: 11.2 µs/packet
Async Write Throughput: 88,397 MB/s (1.4M tensors/sec)

Installation

pip install tenso

Quick Start (v0.12.0)

Basic Serialization

import numpy as np
import tenso

# Create tensor
data = np.random.rand(1024, 1024).astype(np.float32)

# Serialize
packet = tenso.dumps(data)

# Deserialize (Zero-copy view)
restored = tenso.loads(packet)

Async I/O

import asyncio
import tenso

async def handle_client(reader, writer):
    # Asynchronously read a tensor from the stream
    data = await tenso.aread_stream(reader)
    
    # Process and write back
    await tenso.awrite_stream(data * 2, writer)

FastAPI Integration

from fastapi import FastAPI
import numpy as np
from tenso.fastapi import TensoResponse

app = FastAPI()

@app.get("/tensor")
async def get_tensor():
    data = np.ones((1024, 1024), dtype=np.float32)
    return TensoResponse(data) # Zero-copy streaming response

Advanced Features

GPU Acceleration (Direct Transfer)

Supports fast transfers between Tenso streams and device memory for CuPy, PyTorch, and JAX using pinned host memory.

import tenso.gpu as tgpu

# Read directly from a stream into a GPU tensor
torch_tensor = tgpu.read_to_device(stream, device_id=0)

Sparse Formats & Bundling

Tenso natively supports complex data structures beyond simple dense arrays:

Sparse Matrices: Direct serialization for COO, CSR, and CSC formats.
Dictionary Bundling: Pack multiple tensors into a single nested dictionary packet.
LZ4 Compression: Optional high-speed compression for sparse or redundant data.

Data Integrity (XXH3)

Protect your tensors against network corruption with ultra-fast 64-bit checksums:

# Serialize with 64-bit checksum footer
packet = tenso.dumps(data, check_integrity=True)

# Verification is automatic during loads()
restored = tenso.loads(packet)

gRPC Integration

Tenso provides built-in support for gRPC, allowing you to pass tensors between services with minimal overhead.

from tenso.grpc import tenso_msg_pb2, tenso_msg_pb2_grpc
import tenso

# In your Servicer
def Predict(self, request, context):
    data = tenso.loads(request.tensor_packet)
    result = data * 2
    return tenso_msg_pb2.PredictResponse(
        result_packet=bytes(tenso.dumps(result))
    )

Protocol Design

Tenso uses a minimalist structure designed for direct memory access:

┌─────────────┬──────────────┬──────────────┬────────────────────────┬──────────────┐
│   HEADER    │    SHAPE     │   PADDING    │    BODY (Raw Data)     │    FOOTER    │
│   8 bytes   │  Variable    │   0-63 bytes │   C-Contiguous Array   │   8 bytes*   │
└─────────────┴──────────────┴──────────────┴────────────────────────┴──────────────┘
                                                                        (*Optional)

The padding ensures the body starts at a 64-byte boundary, enabling AVX-512 vectorization and zero-copy memory mapping.

Use Cases

Model Serving APIs: Up to 35x faster deserialization with 46x less CPU saves massive overhead on inference nodes.
Distributed Training: Efficiently pass gradients or activations between nodes (Ray, Spark).
GPU-Direct Pipelines: Stream data from network cards to GPU memory with minimal host intervention.
Real-time Robotics: 10.2 µs latency for high-frequency sensor fusion (LIDAR, Radar).
High-Throughput Streaming: 89K packets/sec network transmission for real-time data pipelines.

Contributing

Contributions are welcome! We are currently looking for help with:

C++ / JavaScript Clients: Extending the protocol to other ecosystems.

License

Apache License 2.0 - see LICENSE file.

Citation

@software{tenso2025,
  author = {Khushiyant},
  title = {Tenso: High-Performance Zero-Copy Tensor Protocol},
  year = {2025},
  url = {https://github.com/Khushiyant/tenso}
}

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
.cargo		.cargo
.github/workflows		.github/workflows
docs		docs
examples		examples
src		src
tests		tests
.gitignore		.gitignore
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
benchmark.py		benchmark.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Tenso

Why Tenso?

The Problem

The Solution

Benchmarks

1. In-Memory Serialization (LLM Layer - 64MB)

2. Disk I/O (256 MB Matrix)

3. Stream Reading (95 MB Packet)

4. CPU Usage (Efficiency)

5. Arrow vs Tenso (Comparison)

6. Network Performance

Installation

Quick Start (v0.12.0)

Basic Serialization

Async I/O

FastAPI Integration

Advanced Features

GPU Acceleration (Direct Transfer)

Sparse Formats & Bundling

Data Integrity (XXH3)

gRPC Integration

Protocol Design

Use Cases

Contributing

License

Citation

About

Uh oh!

Releases 20

Contributors 2

Languages

License

Khushiyant/tenso

Folders and files

Latest commit

History

Repository files navigation

Tenso

Why Tenso?

The Problem

The Solution

Benchmarks

1. In-Memory Serialization (LLM Layer - 64MB)

2. Disk I/O (256 MB Matrix)

3. Stream Reading (95 MB Packet)

4. CPU Usage (Efficiency)

5. Arrow vs Tenso (Comparison)

6. Network Performance

Installation

Quick Start (v0.12.0)

Basic Serialization

Async I/O

FastAPI Integration

Advanced Features

GPU Acceleration (Direct Transfer)

Sparse Formats & Bundling

Data Integrity (XXH3)

gRPC Integration

Protocol Design

Use Cases

Contributing

License

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 20

Contributors 2

Languages