-
Notifications
You must be signed in to change notification settings - Fork 17
Description
Discussed in #285
Originally posted by menakaj February 5, 2026
Problem
The Agent Manager currently lacks a comprehensive gateway management system that can:
-
Support multiple deployment models: Organizations need flexibility to deploy gateways either on-premise (self-managed) or in the cloud (managed service), but the current architecture is tightly coupled to a single deployment approach.
-
Manage gateway lifecycle: There is no centralized system to register, configure, monitor, and decommission gateway instances across different environments (development, staging, production).
-
Enable environment-based organization: Gateways need to be logically grouped by deployment stage (dev/staging/prod) to support environment-specific deployment strategies and configurations.
-
Support future gateway types: The system needs to handle both EGRESS gateways (for AI traffic) and future INGRESS gateways (for traditional APIs) without architectural changes.
-
Abstract deployment complexity: Business logic should not be tightly coupled to specific gateway deployment mechanisms (HTTP REST, cloud APIs, etc.), making the system inflexible and hard to evolve.
User Stories
Platform Administrator
-
As a platform administrator, I want to register on-premise gateway instances with their control plane URLs so that Agent Manager can orchestrate deployments to them.
-
As a platform administrator, I want to provision cloud-managed gateways through Agent Manager so that I don't need to interact with cloud provider APIs directly.
-
As a platform administrator, I want to organize gateways into environments (dev, staging, prod) so that I can deploy resources to all gateways in an environment with a single operation.
-
As a platform administrator, I want to monitor gateway health and status from a central location so that I can quickly identify and troubleshoot issues.
-
As a platform administrator, I want to switch between on-premise and cloud deployment modes via configuration so that I can migrate deployment models without code changes.
Development Team
-
As a developer, I want to deploy AI resources to specific environments (e.g., "deploy to all staging gateways") so that I can test changes before production.
-
As a developer, I want the system to validate gateway connectivity before allowing registration so that I don't accidentally configure unreachable gateways.
-
As a developer, I want to query gateway metrics (resource count, request rates) so that I can understand gateway utilization.
SRE/Operations
-
As an SRE, I want to decommission gateways safely with checks for active deployments so that I don't accidentally break running services.
-
As an SRE, I want to update gateway configurations (URLs, display names) so that I can respond to infrastructure changes.
-
As an SRE, I want to see which AI resources are deployed to which gateways so that I can troubleshoot deployment issues.
Existing Solutions
N/A
Proposed Solution
Overview
This proposal introduces a pluggable gateway management architecture for Agent Manager that abstracts gateway operations behind a unified interface (IGatewayAdapter). The design supports multiple deployment models (on-premise, cloud, custom) through adapter implementations, while maintaining a single, clean API for gateway lifecycle management.
Key Concepts
Central Control Plane Pattern: Agent Manager serves as the single source of truth for gateway configurations and orchestrates all gateway operations, regardless of deployment model.
Pluggable Adapter Architecture: Gateway operations (register, deploy, health check) are defined through a common interface. Different adapter implementations (on-premise, cloud) handle deployment-specific details, selected at startup via configuration.
Environment-Based Organization: Gateways are logically grouped into environments (development, staging, production) with many-to-many relationships, enabling environment-level deployment strategies.
Gateway Types: The system distinguishes between EGRESS gateways (AI/LLM traffic) and future INGRESS gateways (traditional API traffic) at the data model level.
Single Active Adapter: Only one adapter type runs at a time (either on-premise OR cloud), configured at application startup. No runtime switching between deployment models.
Design
Architecture Changes
Three-Layer Architecture
┌─────────────────────────────────────────────────────────────┐
│ AGENT MANAGER CORE │
│ (Deployment-Agnostic Business Logic) │
│ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Gateway Management Service │ │
│ │ - Gateway CRUD operations │ │
│ │ - Environment management │ │
│ │ - Health monitoring │ │
│ │ - Metrics aggregation │ │
│ └────────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Gateway Abstraction Layer (IGatewayAdapter) │ │
│ │ - RegisterGateway() │ │
│ │ - ListGateways() │ │
│ │ - CheckHealth() │ │
│ │ - GetMetrics() │ │
│ └────────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Adapter Selection (Configuration-Based) │ │
│ │ ┌──────────────┐ OR ┌──────────────┐ │ │
│ │ │ On-Premise │ │ Cloud │ │ │
│ │ │ Adapter │ │ Adapter │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ └────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘
Layer 1 - Business Logic: Environment management, gateway CRUD operations, health monitoring. This layer is completely agnostic to deployment model.
Layer 2 - Abstraction Interface: IGatewayAdapter interface defines all gateway operations. Business logic depends only on this interface, never on concrete implementations.
Layer 3 - Adapter Implementations: Concrete adapters handle deployment-specific details. Only one adapter is active at runtime, selected via configuration.
Adapter Pattern
The adapter pattern enables multiple deployment models without coupling business logic to deployment details:
- IGatewayAdapter Interface: Defines 20+ methods covering gateway lifecycle, health checks, and metrics
- AdapterFactory: Creates the appropriate adapter based on configuration (
type: on-premise | cloud | custom) - Common Data Models: Gateway, HealthStatus, GatewayMetrics shared across all adapters
- Dependency Injection: Single adapter instance injected into all services at startup
API Surface
Environment Management Endpoints
POST /api/v1/environments
Body: { name, displayName, description }
Response: { uuid, name, displayName, createdAt }
GET /api/v1/environments
Query: ?organizationId=<uuid>
Response: { environments: [...] }
GET /api/v1/environments/{id}
Response: { uuid, name, displayName, description, createdAt, updatedAt }
PUT /api/v1/environments/{id}
Body: { displayName, description }
Response: { uuid, name, displayName, updatedAt }
DELETE /api/v1/environments/{id}
Response: 204 No Content
Gateway Management Endpoints
POST /api/v1/gateways
Body: {
name, displayName, gatewayType: "EGRESS" | "INGRESS",
environmentIds: [<uuid>],
adapterConfig: {
// On-premise: { controlPlaneUrl: "http://gw:9090" }
// Cloud: { region: "us-east-1", tier: "premium", autoScaling: {...} }
}
}
Response: { uuid, name, status, endpoint, createdAt }
GET /api/v1/gateways
Query: ?type=EGRESS&environment=<uuid>&status=ACTIVE
Response: { gateways: [...] }
GET /api/v1/gateways/{id}
Response: { uuid, name, type, status, endpoint, environments: [...] }
PUT /api/v1/gateways/{id}
Body: { displayName, adapterConfig }
Response: { uuid, name, updatedAt }
DELETE /api/v1/gateways/{id}
Response: 204 No Content (fails if active deployments exist)
POST /api/v1/gateways/{id}/environments/{envId}
Response: 201 Created
DELETE /api/v1/gateways/{id}/environments/{envId}
Response: 204 No Content
GET /api/v1/gateways/{id}/environments
Response: { environments: [...] }
GET /api/v1/environments/{id}/gateways
Response: { gateways: [...] }
GET /api/v1/gateways/{id}/health
Response: { status, lastHeartbeat, responseTime, errorMessage }
GET /api/v1/gateways/{id}/metrics
Response: { resourceCount, providerCount, proxyCount, requestRate, errorRate }
Request/Response Schemas
Create Gateway Request (On-Premise):
{
"name": "prod-gateway-1",
"displayName": "Production Gateway 1",
"gatewayType": "EGRESS",
"environmentIds": ["env-prod-uuid"],
"adapterConfig": {
"controlPlaneUrl": "http://gateway-1.internal:9090"
}
}Create Gateway Request (Cloud):
{
"name": "cloud-gateway-us-east",
"displayName": "Cloud Gateway US East",
"gatewayType": "EGRESS",
"environmentIds": ["env-prod-uuid"],
"adapterConfig": {
"region": "us-east-1",
"tier": "premium",
"autoScaling": {
"minInstances": 2,
"maxInstances": 10
}
}
}Gateway Response:
{
"uuid": "gw-uuid-123",
"organizationId": "org-uuid",
"name": "prod-gateway-1",
"displayName": "Production Gateway 1",
"type": "EGRESS",
"status": "ACTIVE",
"endpoint": "http://gateway-1.internal:9090",
"region": "us-east-1",
"environments": [
{
"uuid": "env-prod-uuid",
"name": "production",
"displayName": "Production"
}
],
"metadata": {},
"createdAt": "2025-02-05T10:00:00Z",
"updatedAt": "2025-02-05T10:00:00Z"
}Health Status Response:
{
"gatewayId": "gw-uuid-123",
"status": "ACTIVE",
"lastHeartbeat": "2025-02-05T10:05:00Z",
"responseTime": "45ms",
"errorMessage": null,
"checkedAt": "2025-02-05T10:05:30Z"
}Gateway Metrics Response:
{
"gatewayId": "gw-uuid-123",
"resourceCount": 15,
"providerCount": 5,
"proxyCount": 8,
"mcpCount": 2,
"requestRate": 125.5,
"errorRate": 0.8,
"averageLatency": "120ms",
"timestamp": "2025-02-05T10:05:30Z"
}Data Model Changes
New Tables
environments
CREATE TABLE environments (
uuid UUID PRIMARY KEY,
organization_uuid UUID NOT NULL REFERENCES organizations(uuid),
name VARCHAR(64) NOT NULL, -- "development", "staging", "production"
display_name VARCHAR(128) NOT NULL,
description TEXT,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
updated_at TIMESTAMP NOT NULL DEFAULT NOW(),
UNIQUE(organization_uuid, name)
);
CREATE INDEX idx_environments_org ON environments(organization_uuid);Purpose: Logical grouping of gateways by deployment stage. Enables environment-level operations like "deploy to all production gateways."
Modified Tables
gateways (Extended)
-- Add new columns to existing gateways table
ALTER TABLE gateways
ADD COLUMN gateway_type VARCHAR(16) NOT NULL DEFAULT 'INGRESS'
CHECK (gateway_type IN ('INGRESS', 'EGRESS')),
ADD COLUMN control_plane_url TEXT,
ADD COLUMN status VARCHAR(32) NOT NULL DEFAULT 'ACTIVE'
CHECK (status IN ('ACTIVE', 'INACTIVE', 'PROVISIONING', 'ERROR')),
ADD COLUMN region VARCHAR(64),
ADD COLUMN adapter_config JSONB,
ADD COLUMN endpoint TEXT;
-- Make control_plane_url required for EGRESS gateways
-- (enforced at application level, not database constraint)
CREATE INDEX idx_gateways_type ON gateways(gateway_type);
CREATE INDEX idx_gateways_status ON gateways(status);
CREATE INDEX idx_gateways_org_type ON gateways(organization_uuid, gateway_type);New Fields:
gateway_type: INGRESS (future traditional APIs) or EGRESS (AI resources)control_plane_url: Gateway controller endpoint (on-premise mode)status: Current gateway state (ACTIVE, INACTIVE, PROVISIONING, ERROR)region: Geographic region (cloud mode)adapter_config: JSON blob for adapter-specific configurationendpoint: Gateway API endpoint (may differ from control plane URL)
New Junction Tables
gateway_environment_mappings
CREATE TABLE gateway_environment_mappings (
id SERIAL PRIMARY KEY,
gateway_uuid UUID NOT NULL REFERENCES gateways(uuid) ON DELETE CASCADE,
environment_uuid UUID NOT NULL REFERENCES environments(uuid) ON DELETE CASCADE,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
UNIQUE(gateway_uuid, environment_uuid)
);
CREATE INDEX idx_gem_gateway ON gateway_environment_mappings(gateway_uuid);
CREATE INDEX idx_gem_environment ON gateway_environment_mappings(environment_uuid);Purpose: Many-to-many relationship between gateways and environments. Supports:
- Single gateway in multiple environments (shared dev/test gateway)
- Multiple gateways in one environment (horizontal scaling)
Component Interactions
Gateway Registration Flow (On-Premise)
User/CLI → Agent Manager API → Gateway Service → OnPremiseAdapter
↓
1. Validate control plane URL reachable
2. Perform health check (GET /health)
3. Create gateway record in PostgreSQL
4. Create environment mappings
5. Return gateway details
Gateway Registration Flow (Cloud)
User/CLI → Agent Manager API → Gateway Service → CloudAdapter
↓
1. Call cloud provider API
2. Wait for gateway provisioning
3. Create gateway record in PostgreSQL
4. Create environment mappings
5. Return gateway details with endpoint
Health Check Flow
Background Job (every 30s) → Gateway Service → Adapter.CheckHealth()
↓
On-Premise: HTTP GET to control plane /health
Cloud: Query cloud provider status API
↓
Update gateway status in PostgreSQL
Record last heartbeat timestamp
Environment-Based Query Flow
API Request: GET /api/v1/gateways?environment=prod&type=EGRESS
↓
Gateway Service → Database Query:
SELECT g.* FROM gateways g
JOIN gateway_environment_mappings gem ON g.uuid = gem.gateway_uuid
JOIN environments e ON gem.environment_uuid = e.uuid
WHERE e.name = 'production'
AND g.gateway_type = 'EGRESS'
AND g.status = 'ACTIVE'
↓
Return filtered gateway list
Adapter Selection Flow (Startup)
Application Startup → Load Configuration (YAML/ENV)
↓
Gateway Adapter Type = "on-premise" | "cloud"
↓
AdapterFactory.CreateAdapter(config)
↓
Initialize adapter (HTTP client / cloud SDK)
↓
Inject adapter into GatewayService
↓
Application ready with single active adapter
Configuration Schema
On-Premise Configuration:
gateway:
adapter:
type: on-premise
onPremise:
defaultTimeout: 30s
retryPolicy:
maxAttempts: 3
backoffMultiplier: 2
maxBackoff: 30s
healthCheck:
interval: 30s
timeout: 5sCloud Configuration:
gateway:
adapter:
type: cloud
cloud:
provider: wso2-cloud
apiEndpoint: https://api.wso2.cloud/v1
authentication:
type: oauth2
clientId: ${CLOUD_CLIENT_ID}
clientSecret: ${CLOUD_CLIENT_SECRET}
defaultRegion: us-east-1Database Migration Strategy
Migration 001: Add Environments
- Create
environmentstable - Add indexes
Migration 002: Extend Gateways
- Add
gateway_type,control_plane_url,status,region,adapter_config,endpointcolumns - Add indexes on new columns
Migration 003: Gateway-Environment Mappings
- Create
gateway_environment_mappingstable - Add foreign keys and indexes
Migration 004: Backfill Existing Gateways
- Set default
gateway_type= 'INGRESS' for existing gateways - Create default "production" environment
- Map all existing gateways to production environment
Migration 005: Add Triggers
- Add trigger for
updated_attimestamp on gateways and environments
Out of Scope
Not Included in This Proposal
-
Gateway Configuration Management: Detailed gateway configuration (routes, policies, rate limits) is handled by AI Resource Management. This proposal only covers gateway instance registration and lifecycle.
-
xDS Protocol Details: Agent Manager uses gateway-controller's REST API. xDS implementation remains in gateway-controller.
-
Gateway-to-Gateway Communication: This proposal does not cover service mesh or gateway federation scenarios.
-
Multi-Tenancy Isolation: Organization-level isolation is assumed to exist. This proposal does not add new multi-tenancy mechanisms.
-
Gateway Autoscaling: While cloud adapters support autoscaling configuration, the autoscaling logic itself is handled by cloud providers.
-
Gateway Monitoring/Observability: Advanced monitoring (logs, traces, detailed metrics) is out of scope. Only basic health checks and resource counts are included.
-
Gateway Authentication/Authorization: This proposal assumes gateways have existing auth mechanisms. It does not introduce new auth flows between Agent Manager and gateways.
-
WebSocket Connections: Current implementation focuses on REST API interactions. WebSocket-based real-time updates are not included.
-
Gateway Backup/Restore: Disaster recovery and gateway configuration backup/restore are not covered.
-
Custom Adapter Plugin System: While the architecture supports custom adapters, a formal plugin mechanism (dynamic loading, versioning) is not included in the initial implementation.
Alternatives Considered
No response
Open Questions
No response
Milestones
Milestones
| Phase | Scope | Target |
|---|---|---|
| Phase 1: Database & Models | Create environments table, extend gateways table, create gateway-environment mappings, write migrations, create Go models | Database schema deployed |
| Phase 2: Adapter Interface | Design IGatewayAdapter interface, create common types (Gateway, HealthStatus, GatewayMetrics), implement AdapterFactory | Adapter interface complete |
| Phase 3: On-Premise Adapter | Implement OnPremiseAdapter with HTTP client, gateway registration, lifecycle operations, health checks, retry logic | On-premise adapter functional |
| Phase 4: Services & APIs | Implement Environment and Gateway repositories/services, create REST controllers (17 endpoints), add validation | All APIs live |
| Phase 5: Configuration | Create YAML configuration, implement adapter selection at startup, Docker Compose setup, deployment scripts | System deployable |
Success Criteria
- On-premise adapter communicates with gateway-controller
- All 17 API endpoints functional
- Environment-gateway many-to-many relationship working
- Application starts with configured adapter type
- End-to-end tests pass
Tasks
- [Gateway Management] Database migration and data models #291
- Gateway adapter interface implementation #292
- On-premise gateway adapter implementation #293
- Gateway service APIs implementation (repository and service layer impl) #294
Risks
| Risk | Mitigation |
|---|---|
| Gateway-controller API changes | Version API calls, backward compatibility |
| On-premise connectivity issues | Retry logic, clear error messages |