Skip to content

AI Gateway Management for Agent Manager #287

@menakaj

Description

@menakaj

Discussed in #285

Originally posted by menakaj February 5, 2026

Problem

The Agent Manager currently lacks a comprehensive gateway management system that can:

  1. Support multiple deployment models: Organizations need flexibility to deploy gateways either on-premise (self-managed) or in the cloud (managed service), but the current architecture is tightly coupled to a single deployment approach.

  2. Manage gateway lifecycle: There is no centralized system to register, configure, monitor, and decommission gateway instances across different environments (development, staging, production).

  3. Enable environment-based organization: Gateways need to be logically grouped by deployment stage (dev/staging/prod) to support environment-specific deployment strategies and configurations.

  4. Support future gateway types: The system needs to handle both EGRESS gateways (for AI traffic) and future INGRESS gateways (for traditional APIs) without architectural changes.

  5. Abstract deployment complexity: Business logic should not be tightly coupled to specific gateway deployment mechanisms (HTTP REST, cloud APIs, etc.), making the system inflexible and hard to evolve.

User Stories

Platform Administrator

  • As a platform administrator, I want to register on-premise gateway instances with their control plane URLs so that Agent Manager can orchestrate deployments to them.

  • As a platform administrator, I want to provision cloud-managed gateways through Agent Manager so that I don't need to interact with cloud provider APIs directly.

  • As a platform administrator, I want to organize gateways into environments (dev, staging, prod) so that I can deploy resources to all gateways in an environment with a single operation.

  • As a platform administrator, I want to monitor gateway health and status from a central location so that I can quickly identify and troubleshoot issues.

  • As a platform administrator, I want to switch between on-premise and cloud deployment modes via configuration so that I can migrate deployment models without code changes.

Development Team

  • As a developer, I want to deploy AI resources to specific environments (e.g., "deploy to all staging gateways") so that I can test changes before production.

  • As a developer, I want the system to validate gateway connectivity before allowing registration so that I don't accidentally configure unreachable gateways.

  • As a developer, I want to query gateway metrics (resource count, request rates) so that I can understand gateway utilization.

SRE/Operations

  • As an SRE, I want to decommission gateways safely with checks for active deployments so that I don't accidentally break running services.

  • As an SRE, I want to update gateway configurations (URLs, display names) so that I can respond to infrastructure changes.

  • As an SRE, I want to see which AI resources are deployed to which gateways so that I can troubleshoot deployment issues.

Existing Solutions

N/A

Proposed Solution

Overview

This proposal introduces a pluggable gateway management architecture for Agent Manager that abstracts gateway operations behind a unified interface (IGatewayAdapter). The design supports multiple deployment models (on-premise, cloud, custom) through adapter implementations, while maintaining a single, clean API for gateway lifecycle management.

Key Concepts

Central Control Plane Pattern: Agent Manager serves as the single source of truth for gateway configurations and orchestrates all gateway operations, regardless of deployment model.

Pluggable Adapter Architecture: Gateway operations (register, deploy, health check) are defined through a common interface. Different adapter implementations (on-premise, cloud) handle deployment-specific details, selected at startup via configuration.

Environment-Based Organization: Gateways are logically grouped into environments (development, staging, production) with many-to-many relationships, enabling environment-level deployment strategies.

Gateway Types: The system distinguishes between EGRESS gateways (AI/LLM traffic) and future INGRESS gateways (traditional API traffic) at the data model level.

Single Active Adapter: Only one adapter type runs at a time (either on-premise OR cloud), configured at application startup. No runtime switching between deployment models.

Design

Architecture Changes

Three-Layer Architecture

┌─────────────────────────────────────────────────────────────┐
│                    AGENT MANAGER CORE                        │
│              (Deployment-Agnostic Business Logic)            │
│                                                              │
│  ┌────────────────────────────────────────────────────┐    │
│  │  Gateway Management Service                         │    │
│  │  - Gateway CRUD operations                          │    │
│  │  - Environment management                           │    │
│  │  - Health monitoring                                │    │
│  │  - Metrics aggregation                              │    │
│  └────────────────────────────────────────────────────┘    │
│                         ↓                                    │
│  ┌────────────────────────────────────────────────────┐    │
│  │  Gateway Abstraction Layer (IGatewayAdapter)       │    │
│  │  - RegisterGateway()                                │    │
│  │  - ListGateways()                                   │    │
│  │  - CheckHealth()                                    │    │
│  │  - GetMetrics()                                     │    │
│  └────────────────────────────────────────────────────┘    │
│                         ↓                                    │
│  ┌────────────────────────────────────────────────────┐    │
│  │  Adapter Selection (Configuration-Based)            │    │
│  │  ┌──────────────┐  OR  ┌──────────────┐           │    │
│  │  │ On-Premise   │      │    Cloud     │           │    │
│  │  │   Adapter    │      │   Adapter    │           │    │
│  │  └──────────────┘      └──────────────┘           │    │
│  └────────────────────────────────────────────────────┘    │
└──────────────────────────────────────────────────────────────┘

Layer 1 - Business Logic: Environment management, gateway CRUD operations, health monitoring. This layer is completely agnostic to deployment model.

Layer 2 - Abstraction Interface: IGatewayAdapter interface defines all gateway operations. Business logic depends only on this interface, never on concrete implementations.

Layer 3 - Adapter Implementations: Concrete adapters handle deployment-specific details. Only one adapter is active at runtime, selected via configuration.

Adapter Pattern

The adapter pattern enables multiple deployment models without coupling business logic to deployment details:

  • IGatewayAdapter Interface: Defines 20+ methods covering gateway lifecycle, health checks, and metrics
  • AdapterFactory: Creates the appropriate adapter based on configuration (type: on-premise | cloud | custom)
  • Common Data Models: Gateway, HealthStatus, GatewayMetrics shared across all adapters
  • Dependency Injection: Single adapter instance injected into all services at startup

API Surface

Environment Management Endpoints

POST   /api/v1/environments
  Body: { name, displayName, description }
  Response: { uuid, name, displayName, createdAt }

GET    /api/v1/environments
  Query: ?organizationId=<uuid>
  Response: { environments: [...] }

GET    /api/v1/environments/{id}
  Response: { uuid, name, displayName, description, createdAt, updatedAt }

PUT    /api/v1/environments/{id}
  Body: { displayName, description }
  Response: { uuid, name, displayName, updatedAt }

DELETE /api/v1/environments/{id}
  Response: 204 No Content

Gateway Management Endpoints

POST   /api/v1/gateways
  Body: {
    name, displayName, gatewayType: "EGRESS" | "INGRESS",
    environmentIds: [<uuid>],
    adapterConfig: {
      // On-premise: { controlPlaneUrl: "http://gw:9090" }
      // Cloud: { region: "us-east-1", tier: "premium", autoScaling: {...} }
    }
  }
  Response: { uuid, name, status, endpoint, createdAt }

GET    /api/v1/gateways
  Query: ?type=EGRESS&environment=<uuid>&status=ACTIVE
  Response: { gateways: [...] }

GET    /api/v1/gateways/{id}
  Response: { uuid, name, type, status, endpoint, environments: [...] }

PUT    /api/v1/gateways/{id}
  Body: { displayName, adapterConfig }
  Response: { uuid, name, updatedAt }

DELETE /api/v1/gateways/{id}
  Response: 204 No Content (fails if active deployments exist)

POST   /api/v1/gateways/{id}/environments/{envId}
  Response: 201 Created

DELETE /api/v1/gateways/{id}/environments/{envId}
  Response: 204 No Content

GET    /api/v1/gateways/{id}/environments
  Response: { environments: [...] }

GET    /api/v1/environments/{id}/gateways
  Response: { gateways: [...] }

GET    /api/v1/gateways/{id}/health
  Response: { status, lastHeartbeat, responseTime, errorMessage }

GET    /api/v1/gateways/{id}/metrics
  Response: { resourceCount, providerCount, proxyCount, requestRate, errorRate }

Request/Response Schemas

Create Gateway Request (On-Premise):

{
  "name": "prod-gateway-1",
  "displayName": "Production Gateway 1",
  "gatewayType": "EGRESS",
  "environmentIds": ["env-prod-uuid"],
  "adapterConfig": {
    "controlPlaneUrl": "http://gateway-1.internal:9090"
  }
}

Create Gateway Request (Cloud):

{
  "name": "cloud-gateway-us-east",
  "displayName": "Cloud Gateway US East",
  "gatewayType": "EGRESS",
  "environmentIds": ["env-prod-uuid"],
  "adapterConfig": {
    "region": "us-east-1",
    "tier": "premium",
    "autoScaling": {
      "minInstances": 2,
      "maxInstances": 10
    }
  }
}

Gateway Response:

{
  "uuid": "gw-uuid-123",
  "organizationId": "org-uuid",
  "name": "prod-gateway-1",
  "displayName": "Production Gateway 1",
  "type": "EGRESS",
  "status": "ACTIVE",
  "endpoint": "http://gateway-1.internal:9090",
  "region": "us-east-1",
  "environments": [
    {
      "uuid": "env-prod-uuid",
      "name": "production",
      "displayName": "Production"
    }
  ],
  "metadata": {},
  "createdAt": "2025-02-05T10:00:00Z",
  "updatedAt": "2025-02-05T10:00:00Z"
}

Health Status Response:

{
  "gatewayId": "gw-uuid-123",
  "status": "ACTIVE",
  "lastHeartbeat": "2025-02-05T10:05:00Z",
  "responseTime": "45ms",
  "errorMessage": null,
  "checkedAt": "2025-02-05T10:05:30Z"
}

Gateway Metrics Response:

{
  "gatewayId": "gw-uuid-123",
  "resourceCount": 15,
  "providerCount": 5,
  "proxyCount": 8,
  "mcpCount": 2,
  "requestRate": 125.5,
  "errorRate": 0.8,
  "averageLatency": "120ms",
  "timestamp": "2025-02-05T10:05:30Z"
}

Data Model Changes

New Tables

environments

CREATE TABLE environments (
    uuid UUID PRIMARY KEY,
    organization_uuid UUID NOT NULL REFERENCES organizations(uuid),
    name VARCHAR(64) NOT NULL,              -- "development", "staging", "production"
    display_name VARCHAR(128) NOT NULL,
    description TEXT,
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMP NOT NULL DEFAULT NOW(),

    UNIQUE(organization_uuid, name)
);

CREATE INDEX idx_environments_org ON environments(organization_uuid);

Purpose: Logical grouping of gateways by deployment stage. Enables environment-level operations like "deploy to all production gateways."

Modified Tables

gateways (Extended)

-- Add new columns to existing gateways table
ALTER TABLE gateways
  ADD COLUMN gateway_type VARCHAR(16) NOT NULL DEFAULT 'INGRESS'
    CHECK (gateway_type IN ('INGRESS', 'EGRESS')),
  ADD COLUMN control_plane_url TEXT,
  ADD COLUMN status VARCHAR(32) NOT NULL DEFAULT 'ACTIVE'
    CHECK (status IN ('ACTIVE', 'INACTIVE', 'PROVISIONING', 'ERROR')),
  ADD COLUMN region VARCHAR(64),
  ADD COLUMN adapter_config JSONB,
  ADD COLUMN endpoint TEXT;

-- Make control_plane_url required for EGRESS gateways
-- (enforced at application level, not database constraint)

CREATE INDEX idx_gateways_type ON gateways(gateway_type);
CREATE INDEX idx_gateways_status ON gateways(status);
CREATE INDEX idx_gateways_org_type ON gateways(organization_uuid, gateway_type);

New Fields:

  • gateway_type: INGRESS (future traditional APIs) or EGRESS (AI resources)
  • control_plane_url: Gateway controller endpoint (on-premise mode)
  • status: Current gateway state (ACTIVE, INACTIVE, PROVISIONING, ERROR)
  • region: Geographic region (cloud mode)
  • adapter_config: JSON blob for adapter-specific configuration
  • endpoint: Gateway API endpoint (may differ from control plane URL)

New Junction Tables

gateway_environment_mappings

CREATE TABLE gateway_environment_mappings (
    id SERIAL PRIMARY KEY,
    gateway_uuid UUID NOT NULL REFERENCES gateways(uuid) ON DELETE CASCADE,
    environment_uuid UUID NOT NULL REFERENCES environments(uuid) ON DELETE CASCADE,
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),

    UNIQUE(gateway_uuid, environment_uuid)
);

CREATE INDEX idx_gem_gateway ON gateway_environment_mappings(gateway_uuid);
CREATE INDEX idx_gem_environment ON gateway_environment_mappings(environment_uuid);

Purpose: Many-to-many relationship between gateways and environments. Supports:

  • Single gateway in multiple environments (shared dev/test gateway)
  • Multiple gateways in one environment (horizontal scaling)

Component Interactions

Gateway Registration Flow (On-Premise)

User/CLI → Agent Manager API → Gateway Service → OnPremiseAdapter
                                                        ↓
                                      1. Validate control plane URL reachable
                                      2. Perform health check (GET /health)
                                      3. Create gateway record in PostgreSQL
                                      4. Create environment mappings
                                      5. Return gateway details

Gateway Registration Flow (Cloud)

User/CLI → Agent Manager API → Gateway Service → CloudAdapter
                                                        ↓
                                      1. Call cloud provider API
                                      2. Wait for gateway provisioning
                                      3. Create gateway record in PostgreSQL
                                      4. Create environment mappings
                                      5. Return gateway details with endpoint

Health Check Flow

Background Job (every 30s) → Gateway Service → Adapter.CheckHealth()
                                                        ↓
                          On-Premise: HTTP GET to control plane /health
                          Cloud: Query cloud provider status API
                                                        ↓
                                      Update gateway status in PostgreSQL
                                      Record last heartbeat timestamp

Environment-Based Query Flow

API Request: GET /api/v1/gateways?environment=prod&type=EGRESS
                ↓
Gateway Service → Database Query:
  SELECT g.* FROM gateways g
  JOIN gateway_environment_mappings gem ON g.uuid = gem.gateway_uuid
  JOIN environments e ON gem.environment_uuid = e.uuid
  WHERE e.name = 'production'
    AND g.gateway_type = 'EGRESS'
    AND g.status = 'ACTIVE'
                ↓
Return filtered gateway list

Adapter Selection Flow (Startup)

Application Startup → Load Configuration (YAML/ENV)
                                ↓
                    Gateway Adapter Type = "on-premise" | "cloud"
                                ↓
                    AdapterFactory.CreateAdapter(config)
                                ↓
                    Initialize adapter (HTTP client / cloud SDK)
                                ↓
                    Inject adapter into GatewayService
                                ↓
                    Application ready with single active adapter

Configuration Schema

On-Premise Configuration:

gateway:
  adapter:
    type: on-premise
    onPremise:
      defaultTimeout: 30s
      retryPolicy:
        maxAttempts: 3
        backoffMultiplier: 2
        maxBackoff: 30s
      healthCheck:
        interval: 30s
        timeout: 5s

Cloud Configuration:

gateway:
  adapter:
    type: cloud
    cloud:
      provider: wso2-cloud
      apiEndpoint: https://api.wso2.cloud/v1
      authentication:
        type: oauth2
        clientId: ${CLOUD_CLIENT_ID}
        clientSecret: ${CLOUD_CLIENT_SECRET}
      defaultRegion: us-east-1

Database Migration Strategy

Migration 001: Add Environments

  • Create environments table
  • Add indexes

Migration 002: Extend Gateways

  • Add gateway_type, control_plane_url, status, region, adapter_config, endpoint columns
  • Add indexes on new columns

Migration 003: Gateway-Environment Mappings

  • Create gateway_environment_mappings table
  • Add foreign keys and indexes

Migration 004: Backfill Existing Gateways

  • Set default gateway_type = 'INGRESS' for existing gateways
  • Create default "production" environment
  • Map all existing gateways to production environment

Migration 005: Add Triggers

  • Add trigger for updated_at timestamp on gateways and environments

Out of Scope

Not Included in This Proposal

  1. Gateway Configuration Management: Detailed gateway configuration (routes, policies, rate limits) is handled by AI Resource Management. This proposal only covers gateway instance registration and lifecycle.

  2. xDS Protocol Details: Agent Manager uses gateway-controller's REST API. xDS implementation remains in gateway-controller.

  3. Gateway-to-Gateway Communication: This proposal does not cover service mesh or gateway federation scenarios.

  4. Multi-Tenancy Isolation: Organization-level isolation is assumed to exist. This proposal does not add new multi-tenancy mechanisms.

  5. Gateway Autoscaling: While cloud adapters support autoscaling configuration, the autoscaling logic itself is handled by cloud providers.

  6. Gateway Monitoring/Observability: Advanced monitoring (logs, traces, detailed metrics) is out of scope. Only basic health checks and resource counts are included.

  7. Gateway Authentication/Authorization: This proposal assumes gateways have existing auth mechanisms. It does not introduce new auth flows between Agent Manager and gateways.

  8. WebSocket Connections: Current implementation focuses on REST API interactions. WebSocket-based real-time updates are not included.

  9. Gateway Backup/Restore: Disaster recovery and gateway configuration backup/restore are not covered.

  10. Custom Adapter Plugin System: While the architecture supports custom adapters, a formal plugin mechanism (dynamic loading, versioning) is not included in the initial implementation.

Alternatives Considered

No response

Open Questions

No response

Milestones

Milestones

Phase Scope Target
Phase 1: Database & Models Create environments table, extend gateways table, create gateway-environment mappings, write migrations, create Go models Database schema deployed
Phase 2: Adapter Interface Design IGatewayAdapter interface, create common types (Gateway, HealthStatus, GatewayMetrics), implement AdapterFactory Adapter interface complete
Phase 3: On-Premise Adapter Implement OnPremiseAdapter with HTTP client, gateway registration, lifecycle operations, health checks, retry logic On-premise adapter functional
Phase 4: Services & APIs Implement Environment and Gateway repositories/services, create REST controllers (17 endpoints), add validation All APIs live
Phase 5: Configuration Create YAML configuration, implement adapter selection at startup, Docker Compose setup, deployment scripts System deployable

Success Criteria

  • On-premise adapter communicates with gateway-controller
  • All 17 API endpoints functional
  • Environment-gateway many-to-many relationship working
  • Application starts with configured adapter type
  • End-to-end tests pass

Tasks


Risks

Risk Mitigation
Gateway-controller API changes Version API calls, backward compatibility
On-premise connectivity issues Retry logic, clear error messages

Metadata

Metadata

Assignees

Labels

Type/EpicDenotes an epic, which is a large body of work that encompasses multiple tasks

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions