Requirements for shared-gateway multi-tenancy #1365
pdettori
started this conversation in
Feature Request
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Context
We have built Kagenti, an open source multi-tenant AI agent orchestration platform for Kubernetes (Kind, OpenShift, HyperShift). We integrate OpenShell as our sandbox runtime for interactive and headless agent execution. Today we deploy an OpenShell gateway-per-tenant model (one gateway StatefulSet per namespace) which provides hard isolation but creates operational overhead: each tenant is a full gateway + driver + credentials-driver deployment, TLS certificate set, and ingress resource.
We'd like to understand the requirements and constraints for a shared-gateway model — a single gateway instance serving multiple tenants with logical isolation — as an eventual upstream capability. This issue documents the requirements from a downstream integrator's perspective, building on the discussion in #1145 and the "virtual gateway" concept.
We are not proposing a specific implementation — the upstream team is better positioned for that. We're describing what multi-tenancy means to us as adopters and what invariants we need to hold.
Requirements
R1: Authenticated caller identity must flow through the entire request lifecycle
Today, once OIDC authentication succeeds (PR #935), the gateway processes requests without carrying the caller's identity context into downstream operations (sandbox creation, listing, deletion, provider access). For shared multi-tenancy, the identity established at authentication must be threaded through so that authorization decisions can be made at each resource boundary.
What we need: After a user authenticates, every operation they perform should be attributable to that identity. The gateway should know "who is asking" at every decision point, not just at the front door.
R2: Sandbox resources must be scoped to a tenant or owner
When multiple tenants share a gateway,
sandbox listmust only return sandboxes belonging to the caller's tenant (or the caller themselves). Today all sandboxes are visible to any authenticated user of a given gateway.What we need:
R3: Mutating operations must enforce ownership
sandbox delete,sandbox exec, SSH connections, and policy mutations must verify the caller owns (or has delegation rights to) the target sandbox. This is related to #1354 (per-sandbox secret binding) — the shared sandbox secret today allows any holder to act as any sandbox.What we need: A sandbox operation fails with permission denied if the caller is not the owner or a delegated admin. This must hold even if the caller knows the sandbox name/ID.
R4: Provider credentials must be isolated per tenant
Providers (LLM API keys, OAuth2 clients) stored in the gateway's credential system must be scoped. Tenant A's providers are invisible and inaccessible to Tenant B. Today all providers on a gateway instance are globally visible.
What we need:
R5: Resource accounting should be possible per tenant
For chargeback and capacity planning, we need to attribute sandbox resource consumption (CPU, memory, GPU, storage, session count) to a tenant.
What we need: The gateway exposes enough metadata (labels, annotations, or an API) to aggregate resource usage per tenant. This doesn't necessarily mean the gateway enforces quotas — Kubernetes ResourceQuota on namespaces or the compute driver can handle enforcement — but the gateway's data model must support the accounting.
Non-requirements (for this issue)
These are explicitly not part of what we're asking for here:
Our current workaround
We deploy one full gateway stack per tenant namespace. This provides hard process isolation (compromising one gateway gives zero access to others) and avoids all the shared-state problems. The tradeoffs:
This works well for 2-10 tenants. At 50+ tenants with heterogeneous resource needs, the per-gateway overhead becomes significant and a shared model would be valuable.
Relevant upstream work
Questions for the maintainers
tenantororg, a Keycloak group, an OIDC scope)?Beta Was this translation helpful? Give feedback.
All reactions