Skip to content

Tenant graduation

Tenant graduation moves a tenant that has outgrown the shared tier (a schema inside nexisomni_shared) onto its own dedicated database, optionally on a different server, without touching its Tenant-ID or any issued token. Only where the tenant’s data physically lives changes; the cut-over is orchestrated by the API itself, one tenant at a time.

This page is an operational orientation. The full runbook, including step-by-step crash recovery and the exact SQL, lives in the backend repo at NexisOmni/docs/ops/tenant-graduation.md.

What graduation changes (and what it does not)

Section titled “What graduation changes (and what it does not)”

Graduation only rewrites the central Tenant row: its encrypted ConnectionString flips to the new dedicated role connection, Mode becomes Dedicated (a PascalCase enum string on the wire), and SchemaName becomes null. Everything else holds:

  • The tenant’s tenant_id never changes, so existing JWTs stay valid. The hq-dashboard and the offline POS need no re-config and no re-login.
  • The tenant plane keeps sending its Tenant-ID header on every request exactly as before; the admin plane that triggers graduation never sends one.
  • Money remains a quoted decimal string end to end (for example "118.00"); a data copy moves rows byte for byte and does not reinterpret them.

The move is triggered by one admin-plane call: POST /admin/tenants/{id}/graduate.

The risky moment is the copy snapshot. If a write could commit against the shared schema after the snapshot began, it would be silently left behind. Deactivating the tenant (IsActive=false) stops new tenant resolutions, but a request that resolved just before the flag flipped still holds a live connection to the shared schema. So graduation does more than flag:

  1. It sets the tenant’s own database login role to NOLOGIN.
  2. It terminates every backend session authenticated as that role and waits until none remain.
  3. Only then does the copy’s snapshot begin.

The window this opens is transient by construction. An in-flight request at fence time sees a dropped connection with no HTTP response; a request arriving during the window fails tenant resolution and is short-circuited with 503 tenant_unavailable plus a Retry-After, never a terminal 4xx.

Graduation holds the same session advisory lock as the migration fan-out (ADR-0021), so a fan-out, a second graduation, and this graduation are mutually exclusive. A concurrent attempt is rejected with 409 maintenance_in_progress.

If tenant migrations are pending, run the migration fan-out first: the data copier gates on migration-head equality between source and target and refuses to copy across a model gap.

Under the maintenance lock, the service runs a fixed sequence. The data copy is an in-process binary COPY from the shared schema into the freshly migrated dedicated public schema; no external client tools are needed on the host.

Order Step
1 Validate the tenant exists and is Shared; precheck target database and role names; capture the pre-cutover row values for rollback.
2 Create the dedicated login role (outside the rollback scope; tagged with a per-tenant graduation marker).
3 Deactivate the tenant and fence its role (the read-only window, made real).
4 Create, lock down, grant, and migrate the dedicated database.
5 Copy the data from the shared schema into the dedicated public.
6 Validate the dedicated side before the flip, connecting as the tenant’s own restricted role.
7 Flip the central row and restore the pre-graduation activation state.
8 Only then drop the now-unused shared schema and role, and verify the drops.

Two properties matter operationally. Validation happens before the flip, so a bad copy can never be cut over to. And a tenant an operator deliberately suspended comes out of graduation still suspended: graduation moves data, it does not undo a suspension.

Any failure before the flip in step 7 rolls back cleanly. It restores the captured row values (including activation state), un-fences the shared role with LOGIN, clears the terminated connection pool, and drops the partial dedicated database and role. The shared schema is untouched, so the tenant is exactly where it started.

By default the dedicated database is built on the tenant’s current server. To land it on an emptier server, name a configured placement server: POST /admin/tenants/{id}/graduate?server=pg-2 (default names the fallback connection string). The source side (fence, copy snapshot, post-cutover drop) always stays on the current server, and the copier streams client-side, so a cross-server copy needs no server-to-server connectivity. An unknown ?server= name is a 400 before anything is touched.

The flow is not resumable state. If the API process dies mid-graduation, identify the state from the central Tenant row (Mode, IsActive, SchemaName) together with the pg_database and pg_roles catalogs, then recover by hand. The three cases and their exact SQL are in the runbook, but the shape is:

  • Orphan dedicated role only (died before the window opened): nothing to do. The next attempt sweeps every configured server for residue and reclaims a role that carries this tenant’s graduation marker.
  • Tenant deactivated, orphan database or role (died between the fence and the flip): the shared schema is authoritative. Drop the orphans, restore login on the shared role, reactivate the tenant, and re-run off-peak.
  • Flipped, shared schema still present (died between the flip and cleanup): the tenant is live on its dedicated database. Run the step 8 drops on the shared database by hand.

A client or proxy timeout does not abort the graduation: the endpoint detaches from the request’s cancellation and the operation completes server-side. Check the tenant row’s Mode and the logs for the outcome rather than retrying immediately.

  • Full runbook, per-step detail, exact recovery SQL, and required Postgres privileges: NexisOmni/docs/ops/tenant-graduation.md.
  • Design rationale: ADR-0022 (shared and dedicated tiers) in NexisOmni/docs/adr/.
  • For how the two auth planes, the Tenant-ID header, and database-per-tenant fit together, see Auth and tenancy.