Find Data Fast. Build Projects Faster.
The easiest way to discover and select datasets in the SyftBox ecosystem. Every great Syft project starts with finding the right dataβsyft-datasets makes that effortless.
Search, select, and generate code for your datasets in seconds
Problem: Finding and working with datasets across SyftBox datasites is tedious
Solution: Beautiful interactive interface that makes dataset discovery a breeze
import syft_datasets as syd
# 1. Browse all available datasets
syd.datasets # Shows interactive table in Jupyter
# 2. Search for what you need
crop_data = syd.datasets.search("crop")
# 3. Select datasets visually with checkboxes
# 4. Click "Generate Code" β automatic copy to clipboard
# 5. Paste and use immediately
# Selected datasets:
datasets = [syd.datasets[i] for i in [0, 1, 5]]
- π¨ Interactive Jupyter UI - Beautiful HTML tables with real-time search
- π Smart Search - Find datasets by name, email, or keyword
- βοΈ Visual Selection - Checkbox interface with one-click code generation
- π Copy-Paste Ready - Generated code works immediately
- π Cross-Datasite - Discovers datasets across all your connected datasites
- π‘οΈ Privacy-First - Respects SyftBox security model
pip install syft-datasets
import syft_datasets as syd
# Interactive dataset browsing (run in Jupyter)
syd.datasets
# Search for specific datasets
financial_data = syd.datasets.search("financial")
andrew_datasets = syd.datasets.filter_by_email("andrew")
# Use datasets by index
first_dataset = syd.datasets[0]
first_five = syd.datasets[:5]
specific_ones = syd.datasets.get_by_indices([0, 2, 7])
Use your selected datasets with OpenAI-compatible chat:
import syft_datasets as syd
import syft_nsai as nsai
# 1. Find your data
selected_datasets = [syd.datasets[i] for i in [0, 1, 5]]
# 2. Analyze with AI
response = nsai.client.chat.completions.create(
model=selected_datasets,
messages=[{"role": "user", "content": "What insights can you find?"}]
)
- Real-time search box - Filter as you type
- Sortable columns - Email, dataset name, URLs
- Checkbox selection - Visual dataset picking
- One-click code generation - Automatic clipboard copy
- Connection status - See your SyftBox status at a glance
# Get dataset information
dataset = syd.datasets[0]
print(f"Dataset: {dataset.name}")
print(f"From: {dataset.email}")
print(f"URL: {dataset.syft_url}")
# Utility methods
syd.datasets.list_unique_emails() # All available emails
syd.datasets.list_unique_names() # All dataset names
len(syd.datasets) # Count datasets
syd.datasets.help() # Show help
# Find crop datasets from specific users
crop_data = syd.datasets.search("crop").filter_by_email("andrew")
# Select datasets matching patterns
ml_datasets = [ds for ds in syd.datasets
if any(kw in ds.name.lower() for kw in ['model', 'train'])]
# Group datasets by domain
from collections import defaultdict
by_domain = defaultdict(list)
for ds in syd.datasets:
domain = ds.email.split('@')[1]
by_domain[domain].append(ds)
- SyftBox installed and running
- Jupyter notebook environment
- Python 3.9+
Syft-datasets automatically checks:
- β SyftBox filesystem access
- β SyftBox app running status
- π Available datasites and datasets
# Development setup
git clone https://github.com/OpenMined/syft-datasets.git
cd syft-datasets
pip install -e ".[dev]"
# Run tests
pytest
# Format code
ruff format
Apache 2.0 License - see LICENSE file.
Start your next Syft project rightβfind your data first. π―