Azure Deployment

Azure Container Apps (ACA) is the platform’s sole deployment target (decision D0 — the earlier Vercel path was retired and deleted on 2026-07-02). One singleton control-plane app (aurora) runs permanently; every solution created through Aurora gets its own per-solution Orbit stack of three ACA apps provisioned automatically, all pulling shared images from one container registry, with the two public apps (reader + Mission Control) fronted by per-solution Cloudflare subdomains and the agent kept internal to the environment (since 2026-07-03).

Since 2026-07-05 the implementation is split by plane, not by twin (the earlier bash/TS twin model is retired — see ADR-015):

Solution plane / TypeScript — the @maxq/orbit-deploy engine (implementation/shared/orbit-deploy, export ./azure): a catalog of idempotent steps (each runnable as plan / apply / verify) composed into scenarios. Three drivers share it: the Aurora agent tools, the ops CLI (deploy.sh solution <id> [scenario]), and the createSolution() auto-deploy hook behind ORBIT_DEPLOY_ENABLED.
Platform plane / bash — infrastructure/azure/ keeps the human-run singletons: bootstrap, postgres, services, aurora, and the build-images.sh release step, all driven by config.local.yaml. provision-solution.sh remains as the legacy solution provisioner until the engine’s front door is live-proven, then it is deleted.

Topology

All resources live in one resource group in westeurope, identified by the non-secret coordinates in infrastructure/azure/config.local.yaml (copied from config.local.example.yaml, git-ignored):

Resource	Name	Role
Resource group	`rg-solutions-orbit`	Holds everything below
Container registry (ACR)	`acrorbit` (`acrorbit.azurecr.io`)	Basic SKU, admin user disabled — pull is via managed identity only
ACA managed environment	`azure.aca_env` (live: `acae-orbit`)	Hosts `aurora`, the internal platform services (registry + activity feed), and every per-solution stack
Storage account	`stororbit`	Premium FileStorage; backs the per-solution Azure Files shares
PostgreSQL Flexible Server	`psql-orbit`	The portfolio registry database (`portfolio`), Entra-only auth
User-assigned managed identity	`id-aurora`	AcrPull on `acrorbit` + Contributor on the resource group

The environment has moved twice: the design’s original westcentralus environment went bad in 2026-06 (it stopped provisioning any new revision) and was replaced by orbit-aca-wcus; the whole platform was then rebuilt in westeurope (2026-07-04) as acae-orbit. Names in older documents and the example config lag reality — azure.aca_env in config.local.yaml is authoritative.

Topology diagram

How the resources fit together — Cloudflare in front, one ACA environment hosting the singleton control plane, the internal registry services, and one three-app stack per solution, with ACR, Azure Files, and Postgres behind them (icons are the official Azure service icons, vendored under public/azure/):

Reading notes: only the agent mounts storage — the reader and Mission Control are deliberately stateless; the agent and the platform services (registry + activity feed) have internal ingress only (nothing outside the environment can reach them); the three public apps (aurora, reader, Mission Control) sit behind per-name Cloudflare CNAMEs with the edge-header lockdown described under Networking and domains.

Auth model: managed identity, no secrets

There is no service-principal secret anywhere. The bash path assumes an existing az session (developer az login, or az login --identity inside the aurora container); the TypeScript path uses DefaultAzureCredential, which picks up the same identities. bootstrap.sh is the one-time identity step: it creates the UAMI id-aurora and grants it AcrPull on acrorbit and Contributor on rg-solutions-orbit (GET-first, idempotent, safe to re-run).

The scripts

infrastructure/azure/ contains the whole ops surface, dispatched through deploy.sh:


deploy.sh bootstrap            # one-time: id-aurora UAMI + role assignments + AcrPull
deploy.sh aurora               # provision the singleton `aurora` app + Cloudflare front door
deploy.sh postgres             # provision the portfolio-registry Flexible Server
deploy.sh services             # provision svc-customer / svc-tenant / svc-solution / svc-activity
deploy.sh docs                 # provision the `docs` app (documentation site) + front door
deploy.sh solution <id> [scenario]  # per-solution stacks via the orbit-deploy ENGINE
deploy.sh provision <id>       # LEGACY bash per-solution provisioner (until the engine is live-proven)
deploy.sh verify [<id>]        # read-only health check (a solution stack, or aurora)
deploy.sh delete <id> [--yes]  # DESTRUCTIVE teardown of a solution stack (legacy bash path)

Script	Purpose
`lib.sh`	Sourced foundation: logging, the YAML config reader (`cfg` / `cfg_req`), Azure coordinate accessors, the deterministic `azure_names` helper, the ACA reconcile recipe (`aca_configure_app` and friends), and the Cloudflare / custom-domain helpers
`bootstrap.sh`	One-time UAMI + roles
`provision-aurora.sh`	The singleton `aurora` app + its Cloudflare front door (`app-aurora.yaml.tmpl`)
`provision-solution.sh <id>`	LEGACY per-solution worker: shares → links → seed job → three apps → front doors (`app-agent` / `app-orbit` / `app-agent-webapp` / `seed-job` `.yaml.tmpl`); superseded by `deploy.sh solution` and kept only until the engine is live-proven
`provision-postgres.sh`	The registry’s PostgreSQL Flexible Server
`provision-services.sh`	The platform services (three registry + the activity feed) from one `app-service.yaml.tmpl`
`provision-docs.sh`	The `docs` app (the orbit-documentation site — this site) + its `documentation.<zone>` Cloudflare front door (`app-docs.yaml.tmpl`)
`build-images.sh`	Release step: build + push images from git tags, record releases
`deploy.sh`	Dispatcher; `verify` and `delete` are implemented inline

Every provisioner is idempotent and reconciling (decision D10): GET-first, create-if-absent, update-on-drift, never an error on “already exists”. Re-running any scenario or provisioner converges.

The solution engine (`@maxq/orbit-deploy`)

Per-solution provisioning lives in implementation/shared/orbit-deploy (src/azure/). The engine is a catalog of steps, each implementing three modes — plan (read-only diff), apply (reconcile), verify (read-only assertions) — and every step returns a structured StepResult (status ok / changed / failed / skipped, a one-line detail, machine-readable evidence, and a remediation hint on failures). A run produces a RunLog that is persisted onto the registry solution record as deployment.lastRun after every step, so progress survives a dropped chat turn and is readable by the UI and the deployment_progress agent tool.

Steps, in dependency order: preflight.images (every pinned tag must exist in ACR — the engine validates and stops, it never builds; images remain the human release step), storage.shares, storage.links, seed (conditional — skipped when both shares already have a .git clone at the root, checked over the Files data plane; reseed forces a run), app.agent, app.orbit, app.mc, frontdoor.orbit, frontdoor.mc (the TS port of the custom-domain dance below), and edge.lockdown (the per-solution Transform Rule). Teardown has its own reverse-ordered steps.

Scenarios are named step lists — this is what makes partial operations first-class:

Scenario	Steps	Typical ask
`deploy`	everything	full deploy / reconcile
`redeploy-apps`	preflight + the three apps (optionally filtered to one)	roll to newly pinned images
`frontdoor`	the two front doors + edge lockdown	fix Cloudflare / certs
`storage`	shares + links + seed	revalidate the Files integration
`verify`	the full deploy list in verify mode	health / drift report
`teardown`	destructive reverse order	delete a stack

An apply run aborts the remaining steps on the first failure; plan and verify always run everything so the report is complete. In apply mode the ARM SDK sends full envelopes in single createOrUpdate calls — none of the az-CLI --yaml field-dropping that forced the bash path into its placeholder-image recipe.

The ops entry point is deploy.sh solution <id> [scenario] [--plan] [--only agent,orbit,mc] [--reseed] [--yes] [--json], which delegates to the package’s CLI (reading the same config.local.yaml); the in-product entry points are the agent deploy tools and the createSolution hook (reading the env rendered into the aurora app by provision-aurora.sh). Naming parity between the engine and the bash azure_names is enforced by orbit-deploy/scripts/parity-names.sh.

The singleton Aurora app

deploy.sh aurora (→ provision-aurora.sh) reconciles the control-plane app:

Name aurora (configurable via aurora.app_name), external ingress on port 3000, kept warm at minReplicas: 1 / maxReplicas: 3 so the dashboard never cold-starts.
Secrets as ACA secrets: the two GitHub App private keys, the cross-org provision PAT, the edge shared secret, and the Anthropic credential for the agent chat — all sourced from the git-ignored config, never baked into an image.
Cloudflare-only front door: a proxied CNAME aurora.<zone> → the app’s ACA FQDN, zone SSL Full (strict), and a header-injection Transform Rule. The app’s own src/proxy.ts middleware 403s any request missing the injected edge header, so the raw *.azurecontainerapps.io FQDN is blocked and only Cloudflare traffic passes.
The app runs with the id-aurora managed identity, which is what lets it provision entire solution stacks from inside the product.

Per-solution stacks: three apps

Every solution gets three ACA apps, all built from shared images — two external (reader + Mission Control) and one internal-only (the agent, since 2026-07-03):

App	Name	Image	Port	Ingress	Role
Reader	`orbit-<h>`	`orbit-webapp`	3000	External	Read-only solution viewer; mount-free — fetches solution data from the agent over HTTP (`ORBIT_AGENT_URL`, the in-env address)
Writer	`agent-<h>`	`maxq-orbit-agent`	3000	Internal	The authoring agent; sole owner of the repo mounts, single writer; reachable only from inside the environment as `http://agent-<h>`
Mission Control	`mc-<h>`	`maxq-orbit-agent-webapp`	3001	External	Agent console UI; stateless (no share, no mount), reaches the agent via the in-env `AGENT_URL` and proxies browser traffic through its same-origin `/agent` route

The naming scheme and the hash

A solutionId is <orgSlug>-<solSlug> and can run to ~41 characters, but ACA app names are capped at 32. The scheme therefore hashes the id:


h = sha256(solutionId).hex[:12]

and derives every resource name from it. Human identity lives in resource tags and share metadata (solutionId, org, solution, customerRepo, internalRepo, plus portfolio customer / tenant tags) — e.g. az containerapp list --query "[?tags.solutionId=='acme-billing']".

Resource	Name pattern
Reader app	`orbit-<h>`
Agent app	`agent-<h>`
Mission Control app	`mc-<h>`
Customer Files share	`<solutionId>-customer`
Internal Files share	`<solutionId>-internal`
Env-storage link (customer RW)	`cust-rw-<h>`
Env-storage link (internal RW)	`int-rw-<h>`
Env-storage link (customer RO)	`cust-ro-<h>` — legacy, no longer provisioned; kept so teardown can clean old stacks
Seed job	`seed-<h>`

Two naming rules are load-bearing: ACA app names must start with a letter (the orbit- / agent- / mc- / seed- prefixes guarantee that), and ManagedEnvironmentStorage link names must also start with a letter — the hash begins with a digit ~62.5% of the time, so the role word comes first (cust-rw-<h>, never the hash first).

One naming home, one parity gate

The scheme’s home is the engine — azureNames(solutionId) in orbit-deploy/src/azure/names.ts (returns orbitApp, agentApp, agentWebappApp, customerShare, internalShare, linkCustRw, linkIntRw, seedJob). The bash azure_names in infrastructure/azure/lib.sh (the AZ_* shell globals) remains for the legacy provisioner and teardown, and the two must stay byte-identical — enforced by orbit-deploy/scripts/parity-names.sh, which diffs both implementations over a fixed id set (run it whenever either side changes).

Determinism is the point: reconcile, verify, and teardown never look anything up — they recompute every name from the solutionId.

Storage: two shares, two links, account key

Each solution gets two Azure Files shares on stororbit holding working clones of its two GitHub repos (GitHub remains the source of truth — D9):

<solutionId>-customer → mounted at /repo on the agent (ReadWrite, via link cust-rw-<h>).
<solutionId>-internal → mounted at /repo-internal on the agent (ReadWrite, via link int-rw-<h>).

The reader is mount-free (it fetches solution data from the agent over HTTP), so the historical third link — the read-only cust-ro-<h> — is no longer created; deploy.sh delete still removes it from stacks that predate the refactor. The methodology mount is not a third share either (D5): the agent reads the customer repo’s vendored copy at /repo/trajectory-methodology.

Two storage decisions to know:

D8 — env-storage links carry the storage account key. ACA’s Azure Files linking has no identity-based path, so provision-solution.sh reads a key via az storage account keys list (the TS twin via storageAccounts.listKeys) and passes it to az containerapp env storage set. This is the one place a key is used; image pull and provisioning stay identity-based.
Premium FileStorage enforces a 100 GiB minimum share quota — the bash provisioner defaults SHARE_QUOTA_GIB=100 (env-overridable).

The seed job

Before the apps come up, a one-shot ACA job seed-<h> clones (or pull --ff-onlys) both repos into the mounted shares. It reuses the maxq-orbit-agent image (it already bundles git and lives in acrorbit, so the UAMI/AcrPull path covers it — no Docker Hub pull), takes the cross-org provision token as its git-token secret, and the run waits for the execution to report Succeeded before creating any app.

In the engine the seed step is conditional: it first checks — over the Files data plane, with the account key — whether each share already has a .git directory at its root, and skips the multi-minute job when both do (the legacy bash provisioner re-ran the seed on every reconcile, its biggest re-run cost). --reseed / the reseed tool parameter forces a run, which then pull --ff-onlys the existing clones.

Images: shared per release (D1)

Images are shared per release, not per solution. A per-solution deployment is only ACA apps + shares + env vars; every app pins an image tag from the one registry. build-images.sh is the release step:


commit  →  git tag <component>/v<version>  →  build-images.sh <target>[@<version>]  →  deploy

Builds from the immutable git tag <component>/v<version> — never the working tree — by checking the tag out into a throwaway git worktree, so the image, the recorded commit, and the tag can never drift.
Builds run server-side via az acr build (ACR Tasks; no local Docker) and push both :<version> and :latest.
Components version independently — each target resolves its own latest <component>/v* tag or an explicit @<version> pin. There is deliberately no all target.
After a successful push it writes/reconciles releases/<component>/v<version>/release.yaml — a container release record (registry / repository / tag + the tag’s commit, status draft) validated by releases/release.schema.json. See Release Registry.

Targets: aurora (→ aurora-webapp), orbit-webapp, orbit-agent (→ maxq-orbit-agent), orbit-agent-webapp (→ maxq-orbit-agent-webapp), docs (→ orbit-documentation), the registry services customer-service / tenant-service / solution-service, and activity-service. The orbit-webapp, orbit-agent, and service builds use implementation/ (not the app’s codebase/) as build context so the shared trajectory-loader / registry-kit / activity-kit packages are inside it.

Deployed tags are pinned per environment in config.local.yaml’s images: block; bumping a tag there and re-running the relevant provisioner rolls the apps forward.

Networking and domains

Two per-solution apps are external; the agent is internal-only since 2026-07-03 (its ACA ingress is external: false — requests from outside the environment are rejected by the environment proxy with a 404, and it gets no Cloudflare subdomain; the earlier “agent is public by design” stance is superseded, see ADR-003). In-env callers address the agent by its app name, http://agent-<h>, which resolves via the environment’s internal DNS and hits the ingress on :80 — the .internal.<domain> FQDN form does not resolve on this environment, the same pattern as the internal registry services (svc-*). When cloudflare.zone_name is set, provision-solution.sh provisions a Cloudflare front door for each public app:

App	Subdomain
Reader	`<solutionId>.<zone>`
Mission Control	`<solutionId>-mc.<zone>`
Agent	— (internal ingress, no subdomain; legacy `<solutionId>-agent.<zone>` records are cleaned up on teardown)

Two platform singletons follow the same front-door recipe: aurora.<zone> (provision-aurora.sh, with the edge-secret transform rule) and documentation.<zone> (provision-docs.sh → the docs app serving this documentation site — proxied CNAME + managed certificate + strict SSL, but no edge-secret rule: the site is public content and ships no edge guard).

(Flat suffixes, not nested subdomains — Universal SSL and ACA managed certificates cover one level. With the Cloudflare zone/token unconfigured the front-door steps report skipped and the public apps are reached on their raw ACA FQDNs. In-product, the engine gets the Cloudflare credentials from the CLOUDFLARE_API_TOKEN / CLOUDFLARE_ZONE_NAME env rendered into the aurora app — before ADR-015 the in-product path could not drive Cloudflare at all and produced front-door-less stacks.)

ACA custom domains need an explicit binding

A proxied Cloudflare CNAME to an ACA FQDN alone returns HTTP 525: ACA routes and serves certificates by SNI and doesn’t recognise the custom host until it’s bound. The dance lives twice — the lib.sh helpers (used by the singleton aurora path) and the engine’s frontdoor step, a faithful TS port (asuid TXT → grey-cloud → hostname add Disabled → managed certificate → SniEnabled bind → re-proxy). The bash helpers:

ensure_dns / ensure_txt / cf_delete_dns — Cloudflare record upserts via cfapi / cf_zone_id;
aca_bind_custom_domain — writes the asuid.<host> ownership TXT, temporarily grey-clouds the CNAME for validation, runs az containerapp hostname add then hostname bind --validation-method CNAME (free, auto-renewing managed cert), idempotent once SniEnabled;
reconcile_ssl — zone SSL mode to Full (strict);
reconcile_transform_rule_hosts — one header-injection Transform Rule per scope (one host for aurora; a solution’s two public hosts share a single rule).

Edge lockdown (opt-in for solutions)

Aurora is always Cloudflare-locked. The per-solution public apps are public by default; setting edge.lock_solutions: true wires the same lockdown onto the two of them (the EDGE_SHARED_SECRET ACA secret + proxy.ts, which exempts /health, plus one per-solution Transform Rule covering the two hosts). The agent is outside the lockdown’s scope entirely: it is internal, needs no edge secret, and the old reader→agent bypass header (x-orbit-bypass / EDGE_BYPASS_SECRET) is obsolete — the config keys edge.bypass_header_name / edge.bypass_secret were removed. Edge lockdown blocks direct-origin access but is not user authentication — WorkOS user auth in front of the two public subdomains is a planned follow-on.

Mission Control → agent: runtime `AGENT_URL` + the `/agent` proxy

The agent’s address is per-solution, but the shared image can’t bake it in (D1) — Next.js inlines NEXT_PUBLIC_* at build time. So mc-<h> receives the agent’s in-env address at runtime as the AGENT_URL env var (http://agent-<h>). Because the agent is internal, the browser can’t call it directly: Mission Control ships a same-origin streaming proxy route (src/app/agent/[...path]/route.ts) that forwards fetch and SSE to AGENT_URL, and the app’s force-dynamic root layout injects the literal prefix /agent as window.__AGENT_URL__ whenever AGENT_URL is set; the server side reads process.env.AGENT_URL directly. Similarly, the reader gets AGENT_WEBAPP_URL and serves a small force-dynamic route at /mission-control that 302-redirects to Mission Control — both values are computed deterministically, so they can be set before the target app exists.

Scaling

App	minReplicas	maxReplicas	Notes
`aurora`	1	3	Control plane stays warm
`orbit-<h>` (reader)	0	3	Stateless; scale-to-zero
`mc-<h>` (Mission Control)	0	3	Stateless; scale-to-zero
`agent-<h>` (writer)	0	1	See below
`svc-*` (registry)	0	2	Internal-only, stateless

The agent’s bounds encode two decisions:

Hard cap at 1 replica (D4) — two replicas sharing one .git over SMB corrupt index.lock. The single-writer invariant is non-negotiable.
Scale-to-zero since 2026-07-03, made safe by a busy keep-alive inside the agent (not a queue/KEDA scaler): maxq-orbit-agent’s src/util/keepAlive.ts holds a long-poll GET /keepalive?hold=45 open against the agent’s own ingress (the AGENT_SELF_URL env var, injected by both provisioners — the in-env address http://agent-<h>) while planning turns, the task drain loop, or startup recovery are in flight. The request passes the (internal) ingress, so an in-flight request pins ACA’s default HTTP scale rule above zero and the replica is never evicted mid-task; when idle, the hold aborts and the app scales to zero after the cooldown. Any incoming request wakes it. AGENT_MIN_REPLICAS=1 (bash env override) forces always-on; with AGENT_SELF_URL unset (local dev) the keep-alive is dormant.

Provisioning flow

What happens when a solution is created in Aurora with auto-deploy enabled (the engine’s deploy scenario; the legacy provision-solution.sh implements the same sequence):

The bash path orders the three apps agent → reader → Mission Control; the reader’s ORBIT_AGENT_URL and Mission Control’s AGENT_URL both carry the deterministic in-env address http://agent-<h>.

az containerapp create --yaml silently drops identity, registries, secrets, and scale — an app created that way cannot pull from the private ACR. The bash provisioners therefore use the recipe in lib.sh (aca_configure_app): create on a public placeholder image, then imperatively attach the identity, wire the registry, set secrets, update --yaml the real spec (image, env, volumes), re-verify, and set scale — each step verified, transient Azure errors retried with backoff. This defect is az-CLI-specific: the engine’s ARM SDK path sends the full envelope in one createOrUpdate and needs none of the recipe — a major reason TypeScript won the solution plane (ADR-015).

Registry services: `deploy.sh postgres` and `deploy.sh services`

The portfolio registry (Customer → Tenant → Solution) deploys alongside Aurora:

deploy.sh postgres (→ provision-postgres.sh) reconciles the Flexible Server psql-orbit (Burstable Standard_B1ms, PostgreSQL 16, 32 GiB) with Microsoft Entra authentication only — no password exists anywhere. The UAMI id-aurora and the signed-in az user are the Entra admins; public network access with the allow-Azure-services firewall rule (VNet integration is the documented escalation). The database is portfolio; schemas and tables come from the services’ own boot-time migrations.
deploy.sh services (→ provision-services.sh) renders one app-service.yaml.tmpl four times into svc-customer, svc-tenant, svc-solution, and svc-activity (the activity feed, whose peer-URL placeholders stay empty and are pruned): internal ingress only, stateless, minReplicas: 0, using the same aca_configure_app recipe. Services connect to Postgres with a DefaultAzureCredential token as the pg password (requires AZURE_CLIENT_ID pointing at the UAMI). In-environment URLs use the app-name form (http://svc-customer) — the .internal.<env-domain> FQDN form does not resolve on this environment.

Order on a fresh environment: bootstrap → postgres → build-images.sh for the four services → services → aurora.

Verify, status, and teardown

deploy.sh solution <id> verify (or the verify_solution agent tool) — the engine’s verify scenario runs every deploy step’s read-only assertions: pinned images present in ACR, shares and links exist, shares seeded, all three apps Succeeded on the pinned image (image drift is a finding), the agent’s ingress internal, the front doors SniEnabled with correct DNS, and the lockdown rule present when the flag is on — each finding with a remediation (usually the narrow scenario that fixes it). The solution_status tool is the faster snapshot variant (live app states, image drift, effective URLs, registry cross-check); deploy_diagnostics surfaces the real failed ARM operations from the Activity Log.
deploy.sh verify <id> — the legacy bash check (apps/shares/links + advisory, lockdown-aware HTTP probes; asserts the agent FQDN carries .internal.). deploy.sh verify without an id checks the aurora app and its front door — that half stays current (platform plane).
deploy.sh solution <id> teardown (or the teardown_solution agent tool) — destructive teardown in dependency order: the three apps, the seed job, the env-storage links (including a legacy cust-ro-<h>), both shares last, then the Cloudflare CNAMEs, asuid TXTs, and the solution’s Transform Rule — plus any legacy <id>-agent.<zone> records. Deleting the shares destroys the agent’s working clones — GitHub remains the source of truth, but unpushed work is lost — so the CLI requires retyping the solutionId (unless --yes) and the agent tool requires the typed-back confirmId. deploy.sh delete <id> is the legacy bash equivalent.