Skip to content

VIDA-NYU/atlas-profiler

Repository files navigation

atlas-profiler - CTA (Column Type Annotation)

A machine learning pipeline for spatial column type classification with rule-based validation.

Overview

This system classifies tabular columns into spatial types (latitude, longitude, BBL, BIN, zip codes, geometries, etc.) using a hybrid ML + rules approach.

Supported Column Types:

  • latitude, longitude - Geographic coordinates
  • x_coord, y_coord - Projected coordinates
  • bbl - Borough-Block-Lot (NYC property identifier)
  • bin - Building Identification Number
  • zip_code - Postal codes (worldwide)
  • borough_code - District/borough codes
  • city, state, address - Location strings
  • point, line, polygon, multi-polygon, multi-line - WKT geometries
  • non_spatial - Non-spatial identifiers

Disclaimer

This project reuses the structure and main logic of Datamart Profiler. Please give credit to:


PyPI Usage

Install from PyPI and call the exported function:

pip install atlas-profiler
from atlas_profiler import process_dataset

metadata = process_dataset("data.csv")

Key parameters for process_dataset:

metadata = process_dataset(
    data,
    geo_classifier=True,
    geo_classifier_threshold=0.5,
    include_sample=False,
    coverage=True,
    plots=False,
    indexes=True,
    load_max_size=None,
    metadata=None,
    nominatim=None,
)
  • data: Path to a dataset, a file-like object, or a pandas DataFrame.
  • geo_classifier: True to enable the default ML classifier, or pass a classifier instance.
  • geo_classifier_threshold: Confidence cutoff for classifier predictions; below this is treated as non_spatial.
  • include_sample: True to include a small random CSV sample in the output metadata.
  • coverage: Compute data ranges (spatial/temporal coverage) when True.
  • plots: Include plots in the output metadata when True.
  • indexes: If True, include non-default DataFrame indexes as columns.
  • load_max_size: Target max bytes to load; larger inputs are randomly sampled (defaults to MAX_SIZE, 5000000 bytes).
  • metadata: Optional seed metadata dict (e.g., from a discovery plugin).
  • nominatim: Base URL of a Nominatim server for resolving address strings.

Training (from source)

Clone the repo before running training scripts:

git clone https://github.com/VIDA-NYU/atlas-profiler.git
cd atlas-profiler

Training scripts and datasets live under training/.

Pipeline Workflow

┌─────────────────────────────────────────────────────────────────────────┐
│  1. DATA GENERATION                                                      │
│     training/generate_synthetic_cta.py                                   │
│     curated_spatial_cta.csv  ──►  LLM augmentation  ──►  synthetic_df.csv│
└───────────────────────────────────────┬─────────────────────────────────┘
                                        │
                                        ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  2. MODEL TRAINING                                                       │
│     training/train_cta_classifier.py                                     │
│     curated + synthetic data  ──►  BGE encoder + classifier  ──►  model/ │
└───────────────────────────────────────┬─────────────────────────────────┘
                                        │
                                        ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  3. INFERENCE + VALIDATION                                               │
│     training/inference_cta.py + training/rules_cta.py                    │
│     ML prediction  ──►  rule-based validation  ──►  final classification │
└─────────────────────────────────────────────────────────────────────────┘

Step 1: Generate Synthetic Training Data

Script: training/generate_synthetic_cta.py

Uses an LLM to augment curated examples with diverse variations:

  • Naming styles: snake_case, camelCase, abbreviations, short/ambiguous names
  • Value diversity: Worldwide locations (not limited to NYC)
  • Short name samples: Forces value-based learning (e.g., x, lt, coord)
# Generate synthetic data (default: 120 samples per class)
python training/generate_synthetic_cta.py \
    --target 120 \
    --curated-csv training/curated_spatial_cta.csv \
    --output training/synthetic_df.csv

# Custom settings
python training/generate_synthetic_cta.py \
    --target 150 \
    --max-stale 15 \
    --curated-csv training/curated_spatial_cta.csv \
    --output training/synthetic_df.csv

Inputs:

  • training/curated_spatial_cta.csv - Hand-labeled training examples

Outputs:

  • training/synthetic_df.csv - Augmented training data
  • training/synthetic_df_checkpoint.csv - Incremental checkpoint (for resuming)

Step 2: Train the CTA Classifier

Script: training/train_cta_classifier.py

Trains a transformer-based classifier using BGE-small as the encoder.

Training Modes

Mode Description Best For
classification Standard cross-entropy loss Fast baseline
contrastive Supervised contrastive learning (SupCon) Better embeddings
combined Contrastive + classification loss Recommended

Input Format

Uses structured tokens for emphasis:

[COL] column_name [COL] column_name [COL] column_name [VAL] val1 [VAL] val2 [VAL] val3

Column name is repeated (default: 3×) to emphasize its importance.

Usage

# Standard classification (fast)
python training/train_cta_classifier.py --mode classification --epochs 10

# Supervised contrastive learning (better embeddings)
python training/train_cta_classifier.py --mode contrastive --epochs 20 --temperature 0.07

# Combined training (recommended)
python training/train_cta_classifier.py --mode combined --epochs 15 --alpha 0.5

# Full configuration
python training/train_cta_classifier.py \
    --mode combined \
    --epochs 20 \
    --batch_size 32 \
    --lr 3e-5 \
    --temperature 0.07 \
    --alpha 0.5 \
    --name_repeat 3 \
    --output_dir profiler/model \
    --curated_path training/curated_spatial_cta.csv \
    --synthetic_path training/synthetic_df.csv

Key Arguments

Argument Default Description
--mode classification Training mode
--epochs 10 Number of training epochs
--batch_size 16 Batch size
--lr 2e-5 Learning rate
--temperature 0.07 Contrastive loss temperature
--alpha 0.5 Contrastive loss weight (combined mode)
--name_repeat 3 Column name repetition count
--output_dir ./model Model output directory

Outputs (in --output_dir, default ./model/):

  • model.pt - Trained model weights
  • label_encoder.json - Class labels and config
  • config.json - Encoder configuration
  • tokenizer_config.json - Tokenizer with special tokens

Step 3: Inference with Rule-Based Validation

Pure ML Inference

Script: training/inference_cta.py

# Text input
python training/inference_cta.py --model_dir profiler/model --text "lat: 40.71, 40.72, 40.73"

# Column + values input
python training/inference_cta.py --model_dir profiler/model --column "BOROUGH" --values "Manhattan, Brooklyn, Queens"

# With confidence threshold (returns non_spatial if below)
python training/inference_cta.py --model_dir profiler/model --text "col1: 123, 456" --threshold 0.5

# Get embeddings (contrastive/combined modes only)
python training/inference_cta.py --model_dir profiler/model --text "lat: 40.71" --embedding

Hybrid Classification (ML + Rules)

Script: training/rules_cta.py

The HybridCTAClassifier combines ML predictions with rule-based validation:

Note: imports below assume training/ is on your PYTHONPATH or you run from that directory.

from rules_cta import HybridCTAClassifier
from inference_cta import CTAClassifier

# Initialize
ml_classifier = CTAClassifier("profiler/model")
hybrid = HybridCTAClassifier(ml_classifier)

# Classify
result = hybrid.classify("BBL", [1001234567, 2005678901, 3012345678])
# Returns: [{"label": "bbl", "confidence": 0.95, "source": "ml+validated"}]

Validation Logic

For sensitive spatial types, ML predictions are validated against rules:

Type Validation Rule
bbl 10-digit number starting with 1-5 (NYC borough)
bin 7-digit number starting with 1-5
latitude Values in range [-90, 90]
longitude Values in range [-180, 180], some > 90
x_coord, y_coord Projected coordinates > 10,000
zip_code Matches postal code patterns (US, UK, CA, etc.)
point, polygon, etc. Valid WKT geometry format

Validation outcomes:

  • Passed: Return ML prediction with source: "ml+validated"
  • Failed: Return non_spatial with source: "ml:{type}→rule_rejected"

Standalone Rule-Based Classification

from rules_cta import RuleBasedCTA

classifier = RuleBasedCTA()

# Single column
result = classifier.classify("BBL", [1001234567, 2005678901])
# {"label": "bbl", "confidence": 0.95, "rule": "bbl_name_and_pattern"}

# Entire DataFrame
results = classifier.classify_dataframe(df, sample_size=100)

Project Structure

atlas-profiler/
├── profiler/                   # Library package
│   ├── core.py
│   ├── spatial.py
│   └── model/                  # Bundled model artifacts
│       ├── model.pt
│       ├── label_encoder.json
│       ├── config.json
│       └── tokenizer_config.json
├── training/                   # Training + inference scripts/data
│   ├── generate_synthetic_cta.py
│   ├── train_cta_classifier.py
│   ├── inference_cta.py
│   ├── rules_cta.py
│   ├── curated_spatial_cta.csv
│   └── synthetic_df.csv
├── output/
├── results/
├── README.md
└── pyproject.toml

Quick Start

Clone the repo and cd into it before running the training steps below.

# 1. Generate synthetic training data
python training/generate_synthetic_cta.py \
    --target 120 \
    --curated-csv training/curated_spatial_cta.csv \
    --output training/synthetic_df.csv

# 2. Train the model (combined mode recommended)
python training/train_cta_classifier.py \
    --mode combined \
    --epochs 15 \
    --output_dir profiler/model \
    --curated_path training/curated_spatial_cta.csv \
    --synthetic_path training/synthetic_df.csv

# 3. Run inference
python training/inference_cta.py --model_dir profiler/model --column "latitude" --values "40.71, 40.72"

Auctus Datamart Profiler Integration

This section describes how the CTA classifier integrates with the Auctus datamart profiler.

Original Profiler Workflow (Before Geo Classifier)

The original process_dataset() function in core.py followed this sequential workflow:

┌─────────────────────────────────────────────────────────────────────────────┐
│  process_dataset(data)                                                       │
│                                                                              │
│  ┌──────────────────┐                                                        │
│  │ 1. LOAD DATA     │  load_data() → pandas DataFrame                        │
│  │    - Read CSV    │  - Handle file size limits (MAX_SIZE = 5MB)            │
│  │    - Sample rows │  - Random sampling if file > max size                  │
│  └────────┬─────────┘                                                        │
│           │                                                                  │
│           ▼                                                                  │
│  ┌──────────────────┐                                                        │
│  │ 2. PROCESS COLS  │  For each column (sequential):                         │
│  │    (sequential)  │                                                        │
│  │                  │  process_column(array, column_meta)                    │
│  │                  │    │                                                   │
│  │                  │    ├─► identify_types()  ← regex + heuristics          │
│  │                  │    │     - Structural: INTEGER, FLOAT, TEXT, GEO_*     │
│  │                  │    │     - Semantic: LATITUDE, LONGITUDE, DATE_TIME,   │
│  │                  │    │                 ADDRESS, ADMIN, CATEGORICAL       │
│  │                  │    │                                                   │
│  │                  │    ├─► Compute numerical ranges & histograms           │
│  │                  │    ├─► Resolve addresses via Nominatim (HTTP calls)    │
│  │                  │    └─► Resolve admin areas via datamart_geo            │
│  └────────┬─────────┘                                                        │
│           │                                                                  │
│           ▼                                                                  │
│  ┌──────────────────┐                                                        │
│  │ 3. POST-PROCESS  │                                                        │
│  │                  │  - Index textual data with Lazo                        │
│  │                  │  - Pair lat/long columns (name matching)               │
│  │                  │  - Determine dataset types (spatial/temporal/etc.)     │
│  └────────┬─────────┘                                                        │
│           │                                                                  │
│           ▼                                                                  │
│  ┌──────────────────┐                                                        │
│  │ 4. COVERAGE      │  Compute spatial/temporal coverage:                    │
│  │                  │  - Lat/long pairs → geohashes + bounding boxes         │
│  │                  │  - WKT points → spatial ranges                         │
│  │                  │  - Addresses → resolved coordinates                    │
│  │                  │  - Admin areas → merged bounding boxes                 │
│  │                  │  - Datetime columns → temporal ranges                  │
│  └──────────────────┘                                                        │
└─────────────────────────────────────────────────────────────────────────────┘

Original Type Detection (identify_types)

The original identify_types() function used regex patterns and heuristics:

Detection Method Types Detected Limitations
Column name patterns latitude, longitude (via LATITUDE, LONGITUDE tuples) Only exact matches like lat, long, xcoord
Value regex DATE_TIME, GEO_POINT (WKT) Limited patterns
Statistical analysis INTEGER, FLOAT, CATEGORICAL No semantic understanding
Nominatim lookup ADDRESS Slow (HTTP calls per address)
datamart_geo lookup ADMIN areas Requires local geo database

Key limitations:

  • ❌ Failed to detect borough codes, BBL, BIN
  • ❌ No detection of projected coordinates (x_coord, y_coord)
  • ❌ Missed WKT polygons and multi-polygons
  • ❌ Sequential column processing (slow for large datasets)

Enhanced Workflow (With Geo Classifier)

The geo classifier adds a batch ML prediction phase before column processing:

┌─────────────────────────────────────────────────────────────────────────────┐
│  process_dataset(data, geo_classifier=HybridGeoClassifier)                   │
│                                                                              │
│  ┌──────────────────┐                                                        │
│  │ 1. LOAD DATA     │  (unchanged)                                           │
│  └────────┬─────────┘                                                        │
│           │                                                                  │
│           ▼                                                                  │
│  ┌──────────────────┐                                                        │
│  │ 2. BATCH ML      │  ★ NEW: Single forward pass for ALL columns            │
│  │    PREDICTION    │                                                        │
│  │                  │  geo_classifier.predict_batch([                        │
│  │                  │      (col_name, sample_values),                        │
│  │                  │      ...                                               │
│  │                  │  ])                                                    │
│  │                  │                                                        │
│  │                  │  → Returns: {col_idx: {"label", "confidence"}}         │
│  │                  │                                                        │
│  │                  │  Detected types:                                       │
│  │                  │    latitude, longitude, x_coord, y_coord,              │
│  │                  │    bbl, bin, zip_code, borough_code,                   │
│  │                  │    city, state, address, point, polygon, ...           │
│  └────────┬─────────┘                                                        │
│           │                                                                  │
│           ▼                                                                  │
│  ┌──────────────────┐                                                        │
│  │ 3. PROCESS COLS  │  ★ NOW PARALLEL (ThreadPoolExecutor)                   │
│  │    (parallel)    │                                                        │
│  │                  │  process_column(..., geo_prediction=pred)              │
│  │                  │    │                                                   │
│  │                  │    ├─► If geo_prediction exists & spatial type:        │
│  │                  │    │     Use ML result directly (skip identify_types)  │
│  │                  │    │                                                   │
│  │                  │    └─► Else: Fall back to identify_types()             │
│  └────────┬─────────┘                                                        │
│           │                                                                  │
│           ▼                                                                  │
│  ┌──────────────────┐                                                        │
│  │ 4-5. POST-PROC   │  (unchanged)                                           │
│  │    + COVERAGE    │                                                        │
│  └──────────────────┘                                                        │
└─────────────────────────────────────────────────────────────────────────────┘

Type Mapping (GEO_CLASSIFIER_SPATIAL_MAP)

The geo classifier maps ML labels to Auctus type system:

GEO_CLASSIFIER_SPATIAL_MAP = {
    # Coordinates
    "latitude":      (types.FLOAT, [types.LATITUDE]),
    "longitude":     (types.FLOAT, [types.LONGITUDE]),
    "x_coord":       (types.FLOAT, []),
    "y_coord":       (types.FLOAT, []),

    # Geometries
    "point":         (types.GEO_POINT, []),
    "polygon":       (types.GEO_POLYGON, []),
    "multi-polygon": (types.GEO_POLYGON, []),
    "line":          (types.GEO_POLYGON, []),

    # Addresses
    "zip_code":      (types.TEXT, [types.ADDRESS]),
    "address":       (types.TEXT, [types.ADDRESS]),

    # Administrative
    "borough_code":  (types.TEXT, [types.ADMIN]),
    "city":          (types.TEXT, [types.ADMIN]),
    "state":         (types.TEXT, [types.ADMIN]),

    # NYC-specific
    "bbl":           (types.INTEGER, [types.ID]),
    "bin":           (types.INTEGER, [types.ID]),
}

Performance Improvements

Aspect Original With Geo Classifier
Type detection Sequential regex per column Single batch forward pass
Column processing Sequential Parallel (ThreadPoolExecutor)
Spatial types Limited (lat/lon only) 15+ types including BBL, BIN, geometries
Accuracy Heuristic-based ML + rule validation

Usage in Auctus

from profiler import process_dataset
from profiler.spatial import GeoClassifier, HybridGeoClassifier

# Initialize classifier (auto-downloads model from NYU Box)
geo_clf = HybridGeoClassifier(GeoClassifier())

# Profile dataset with geo classifier
metadata = process_dataset(
    "data.csv",
    geo_classifier=geo_clf,  # Enable ML-based type detection
    coverage=True,
    plots=True,
)

# Results include geo_classifier metadata
for col in metadata["columns"]:
    if "geo_classifier" in col:
        print(f"{col['name']}: {col['geo_classifier']}")
        # {'label': 'latitude', 'confidence': 0.97, 'source': 'ml+validated'}

Model Auto-Download

The GeoClassifier automatically downloads model files from NYU Box on first use:

GEO_MODEL_FILES = {
    "model.pt":           "https://nyu.box.com/shared/static/...",
    "config.json":        "https://nyu.box.com/shared/static/...",
    "label_encoder.json": "https://nyu.box.com/shared/static/...",
}

Environment Variables

For synthetic data generation with LLM:

export PORTKEY_API_KEY="..."
export PROVIDER_API_KEY="..."

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages