Azure Deployment
Azure Container Apps (ACA) is the platform’s sole deployment target (decision
D0 — the earlier Vercel path was retired and deleted on 2026-07-02). One singleton
control-plane app (aurora) runs permanently; every solution created through
Aurora gets its own per-solution Orbit stack of three ACA apps provisioned
automatically, all pulling shared images from one container registry, with
the two public apps (reader + Mission Control) fronted by per-solution
Cloudflare subdomains and the agent kept internal to the environment
(since 2026-07-03).
Since 2026-07-05 the implementation is split by plane, not by twin (the earlier bash/TS twin model is retired — see ADR-015):
- Solution plane / TypeScript — the
@maxq/orbit-deployengine (implementation/shared/orbit-deploy, export./azure): a catalog of idempotent steps (each runnable asplan/apply/verify) composed into scenarios. Three drivers share it: the Aurora agent tools, the ops CLI (deploy.sh solution <id> [scenario]), and thecreateSolution()auto-deploy hook behindORBIT_DEPLOY_ENABLED. - Platform plane / bash —
infrastructure/azure/keeps the human-run singletons:bootstrap,postgres,services,aurora, and thebuild-images.shrelease step, all driven byconfig.local.yaml.provision-solution.shremains as the legacy solution provisioner until the engine’s front door is live-proven, then it is deleted.
Topology
All resources live in one resource group in westeurope, identified by
the non-secret coordinates in infrastructure/azure/config.local.yaml (copied
from config.local.example.yaml, git-ignored):
| Resource | Name | Role |
|---|---|---|
| Resource group | rg-solutions-orbit | Holds everything below |
| Container registry (ACR) | acrorbit (acrorbit.azurecr.io) | Basic SKU, admin user disabled — pull is via managed identity only |
| ACA managed environment | azure.aca_env (live: acae-orbit) | Hosts aurora, the internal platform services (registry + activity feed), and every per-solution stack |
| Storage account | stororbit | Premium FileStorage; backs the per-solution Azure Files shares |
| PostgreSQL Flexible Server | psql-orbit | The portfolio registry database (portfolio), Entra-only auth |
| User-assigned managed identity | id-aurora | AcrPull on acrorbit + Contributor on the resource group |
The environment has moved twice: the design’s original westcentralus
environment went bad in 2026-06 (it stopped provisioning any new revision) and
was replaced by orbit-aca-wcus; the whole platform was then rebuilt in
westeurope (2026-07-04) as acae-orbit. Names in older documents and the
example config lag reality — azure.aca_env in config.local.yaml is
authoritative.
Topology diagram
How the resources fit together — Cloudflare in front, one ACA environment
hosting the singleton control plane, the internal registry services, and one
three-app stack per solution, with ACR, Azure Files, and Postgres behind them
(icons are the official Azure service icons, vendored under public/azure/):
Reading notes: only the agent mounts storage — the reader and Mission Control
are deliberately stateless; the agent and the platform services (registry +
activity feed) have internal
ingress only (nothing outside the environment can reach them); the three
public apps (aurora, reader, Mission Control) sit behind per-name Cloudflare
CNAMEs with the edge-header lockdown described under
Networking and domains.
Auth model: managed identity, no secrets
There is no service-principal secret anywhere. The bash path assumes an
existing az session (developer az login, or az login --identity inside the
aurora container); the TypeScript path uses DefaultAzureCredential, which
picks up the same identities. bootstrap.sh is the one-time identity step: it
creates the UAMI id-aurora and grants it AcrPull on acrorbit and
Contributor on rg-solutions-orbit (GET-first, idempotent, safe to re-run).
The scripts
infrastructure/azure/ contains the whole ops surface, dispatched through
deploy.sh:
deploy.sh bootstrap # one-time: id-aurora UAMI + role assignments + AcrPull
deploy.sh aurora # provision the singleton `aurora` app + Cloudflare front door
deploy.sh postgres # provision the portfolio-registry Flexible Server
deploy.sh services # provision svc-customer / svc-tenant / svc-solution / svc-activity
deploy.sh docs # provision the `docs` app (documentation site) + front door
deploy.sh solution <id> [scenario] # per-solution stacks via the orbit-deploy ENGINE
deploy.sh provision <id> # LEGACY bash per-solution provisioner (until the engine is live-proven)
deploy.sh verify [<id>] # read-only health check (a solution stack, or aurora)
deploy.sh delete <id> [--yes] # DESTRUCTIVE teardown of a solution stack (legacy bash path)| Script | Purpose |
|---|---|
lib.sh | Sourced foundation: logging, the YAML config reader (cfg / cfg_req), Azure coordinate accessors, the deterministic azure_names helper, the ACA reconcile recipe (aca_configure_app and friends), and the Cloudflare / custom-domain helpers |
bootstrap.sh | One-time UAMI + roles |
provision-aurora.sh | The singleton aurora app + its Cloudflare front door (app-aurora.yaml.tmpl) |
provision-solution.sh <id> | LEGACY per-solution worker: shares → links → seed job → three apps → front doors (app-agent / app-orbit / app-agent-webapp / seed-job .yaml.tmpl); superseded by deploy.sh solution and kept only until the engine is live-proven |
provision-postgres.sh | The registry’s PostgreSQL Flexible Server |
provision-services.sh | The platform services (three registry + the activity feed) from one app-service.yaml.tmpl |
provision-docs.sh | The docs app (the orbit-documentation site — this site) + its documentation.<zone> Cloudflare front door (app-docs.yaml.tmpl) |
build-images.sh | Release step: build + push images from git tags, record releases |
deploy.sh | Dispatcher; verify and delete are implemented inline |
Every provisioner is idempotent and reconciling (decision D10): GET-first, create-if-absent, update-on-drift, never an error on “already exists”. Re-running any scenario or provisioner converges.
The solution engine (@maxq/orbit-deploy)
Per-solution provisioning lives in implementation/shared/orbit-deploy
(src/azure/). The engine is a catalog of steps, each implementing three
modes — plan (read-only diff), apply (reconcile), verify (read-only
assertions) — and every step returns a structured StepResult (status
ok / changed / failed / skipped, a one-line detail, machine-readable
evidence, and a remediation hint on failures). A run produces a RunLog
that is persisted onto the registry solution record as deployment.lastRun
after every step, so progress survives a dropped chat turn and is readable by
the UI and the deployment_progress agent tool.
Steps, in dependency order: preflight.images (every pinned tag must
exist in ACR — the engine validates and stops, it never builds; images remain
the human release step), storage.shares, storage.links, seed
(conditional — skipped when both shares already have a .git clone at the
root, checked over the Files data plane; reseed forces a run), app.agent,
app.orbit, app.mc, frontdoor.orbit, frontdoor.mc (the TS port of the
custom-domain dance below), and edge.lockdown (the per-solution Transform
Rule). Teardown has its own reverse-ordered steps.
Scenarios are named step lists — this is what makes partial operations first-class:
| Scenario | Steps | Typical ask |
|---|---|---|
deploy | everything | full deploy / reconcile |
redeploy-apps | preflight + the three apps (optionally filtered to one) | roll to newly pinned images |
frontdoor | the two front doors + edge lockdown | fix Cloudflare / certs |
storage | shares + links + seed | revalidate the Files integration |
verify | the full deploy list in verify mode | health / drift report |
teardown | destructive reverse order | delete a stack |
An apply run aborts the remaining steps on the first failure; plan and
verify always run everything so the report is complete. In apply mode the
ARM SDK sends full envelopes in single createOrUpdate calls — none of the
az-CLI --yaml field-dropping that forced the bash path into its
placeholder-image recipe.
The ops entry point is deploy.sh solution <id> [scenario] [--plan] [--only agent,orbit,mc] [--reseed] [--yes] [--json], which delegates to the package’s
CLI (reading the same config.local.yaml); the in-product entry points are
the agent deploy tools and the createSolution hook (reading the env rendered
into the aurora app by provision-aurora.sh). Naming parity between the
engine and the bash azure_names is enforced by
orbit-deploy/scripts/parity-names.sh.
The singleton Aurora app
deploy.sh aurora (→ provision-aurora.sh) reconciles the control-plane app:
- Name
aurora(configurable viaaurora.app_name), external ingress on port 3000, kept warm atminReplicas: 1/maxReplicas: 3so the dashboard never cold-starts. - Secrets as ACA secrets: the two GitHub App private keys, the cross-org provision PAT, the edge shared secret, and the Anthropic credential for the agent chat — all sourced from the git-ignored config, never baked into an image.
- Cloudflare-only front door: a proxied CNAME
aurora.<zone>→ the app’s ACA FQDN, zone SSL Full (strict), and a header-injection Transform Rule. The app’s ownsrc/proxy.tsmiddleware 403s any request missing the injected edge header, so the raw*.azurecontainerapps.ioFQDN is blocked and only Cloudflare traffic passes. - The app runs with the
id-auroramanaged identity, which is what lets it provision entire solution stacks from inside the product.
Per-solution stacks: three apps
Every solution gets three ACA apps, all built from shared images — two external (reader + Mission Control) and one internal-only (the agent, since 2026-07-03):
| App | Name | Image | Port | Ingress | Role |
|---|---|---|---|---|---|
| Reader | orbit-<h> | orbit-webapp | 3000 | External | Read-only solution viewer; mount-free — fetches solution data from the agent over HTTP (ORBIT_AGENT_URL, the in-env address) |
| Writer | agent-<h> | maxq-orbit-agent | 3000 | Internal | The authoring agent; sole owner of the repo mounts, single writer; reachable only from inside the environment as http://agent-<h> |
| Mission Control | mc-<h> | maxq-orbit-agent-webapp | 3001 | External | Agent console UI; stateless (no share, no mount), reaches the agent via the in-env AGENT_URL and proxies browser traffic through its same-origin /agent route |
The naming scheme and the hash
A solutionId is <orgSlug>-<solSlug> and can run to ~41 characters, but ACA
app names are capped at 32. The scheme therefore hashes the id:
h = sha256(solutionId).hex[:12]and derives every resource name from it. Human identity lives in resource
tags and share metadata (solutionId, org, solution,
customerRepo, internalRepo, plus portfolio customer / tenant tags) —
e.g. az containerapp list --query "[?tags.solutionId=='acme-billing']".
| Resource | Name pattern |
|---|---|
| Reader app | orbit-<h> |
| Agent app | agent-<h> |
| Mission Control app | mc-<h> |
| Customer Files share | <solutionId>-customer |
| Internal Files share | <solutionId>-internal |
| Env-storage link (customer RW) | cust-rw-<h> |
| Env-storage link (internal RW) | int-rw-<h> |
| Env-storage link (customer RO) | cust-ro-<h> — legacy, no longer provisioned; kept so teardown can clean old stacks |
| Seed job | seed-<h> |
Two naming rules are load-bearing: ACA app names must start with a letter (the
orbit- / agent- / mc- / seed- prefixes guarantee that), and
ManagedEnvironmentStorage link names must also start with a letter — the
hash begins with a digit ~62.5% of the time, so the role word comes first
(cust-rw-<h>, never the hash first).
One naming home, one parity gate
The scheme’s home is the engine — azureNames(solutionId) in
orbit-deploy/src/azure/names.ts (returns orbitApp, agentApp,
agentWebappApp, customerShare, internalShare, linkCustRw, linkIntRw,
seedJob). The bash azure_names in infrastructure/azure/lib.sh (the
AZ_* shell globals) remains for the legacy provisioner and teardown, and the
two must stay byte-identical — enforced by
orbit-deploy/scripts/parity-names.sh, which diffs both implementations over
a fixed id set (run it whenever either side changes).
Determinism is the point: reconcile, verify, and teardown never look anything
up — they recompute every name from the solutionId.
Storage: two shares, two links, account key
Each solution gets two Azure Files shares on stororbit holding working
clones of its two GitHub repos (GitHub remains the source of truth — D9):
<solutionId>-customer→ mounted at/repoon the agent (ReadWrite, via linkcust-rw-<h>).<solutionId>-internal→ mounted at/repo-internalon the agent (ReadWrite, via linkint-rw-<h>).
The reader is mount-free (it fetches solution data from the agent over HTTP),
so the historical third link — the read-only cust-ro-<h> — is no longer
created; deploy.sh delete still removes it from stacks that predate the
refactor. The methodology mount is not a third share either (D5): the agent
reads the customer repo’s vendored copy at /repo/trajectory-methodology.
Two storage decisions to know:
- D8 — env-storage links carry the storage account key. ACA’s Azure Files
linking has no identity-based path, so
provision-solution.shreads a key viaaz storage account keys list(the TS twin viastorageAccounts.listKeys) and passes it toaz containerapp env storage set. This is the one place a key is used; image pull and provisioning stay identity-based. - Premium FileStorage enforces a 100 GiB minimum share quota — the bash
provisioner defaults
SHARE_QUOTA_GIB=100(env-overridable).
The seed job
Before the apps come up, a one-shot ACA job seed-<h> clones (or pull --ff-onlys) both repos into the mounted shares. It reuses the
maxq-orbit-agent image (it already bundles git and lives in acrorbit, so the
UAMI/AcrPull path covers it — no Docker Hub pull), takes the cross-org
provision token as its git-token secret, and the run waits for the
execution to report Succeeded before creating any app.
In the engine the seed step is conditional: it first checks — over the
Files data plane, with the account key — whether each share already has a
.git directory at its root, and skips the multi-minute job when both do
(the legacy bash provisioner re-ran the seed on every reconcile, its biggest
re-run cost). --reseed / the reseed tool parameter forces a run, which
then pull --ff-onlys the existing clones.
Images: shared per release (D1)
Images are shared per release, not per solution. A per-solution deployment
is only ACA apps + shares + env vars; every app pins an image tag from the one
registry. build-images.sh is the release step:
commit → git tag <component>/v<version> → build-images.sh <target>[@<version>] → deploy- Builds from the immutable git tag
<component>/v<version>— never the working tree — by checking the tag out into a throwawaygit worktree, so the image, the recorded commit, and the tag can never drift. - Builds run server-side via
az acr build(ACR Tasks; no local Docker) and push both:<version>and:latest. - Components version independently — each target resolves its own latest
<component>/v*tag or an explicit@<version>pin. There is deliberately noalltarget. - After a successful push it writes/reconciles
releases/<component>/v<version>/release.yaml— acontainerrelease record (registry / repository / tag + the tag’s commit, statusdraft) validated byreleases/release.schema.json. See Release Registry.
Targets: aurora (→ aurora-webapp), orbit-webapp, orbit-agent
(→ maxq-orbit-agent), orbit-agent-webapp (→ maxq-orbit-agent-webapp),
docs (→ orbit-documentation), the
registry services customer-service / tenant-service / solution-service,
and activity-service. The orbit-webapp, orbit-agent, and service builds
use implementation/ (not the app’s codebase/) as build context so the
shared trajectory-loader / registry-kit / activity-kit packages are
inside it.
Deployed tags are pinned per environment in config.local.yaml’s images:
block; bumping a tag there and re-running the relevant provisioner rolls the
apps forward.
Networking and domains
Two per-solution apps are external; the agent is internal-only since
2026-07-03 (its ACA ingress is external: false — requests from outside the
environment are rejected by the environment proxy with a 404, and it gets no
Cloudflare subdomain; the earlier “agent is public by design” stance is
superseded, see ADR-003). In-env callers
address the agent by its app name, http://agent-<h>, which resolves via
the environment’s internal DNS and hits the ingress on :80 — the
.internal.<domain> FQDN form does not resolve on this environment, the
same pattern as the internal registry services (svc-*). When
cloudflare.zone_name is set, provision-solution.sh provisions a Cloudflare
front door for each public app:
| App | Subdomain |
|---|---|
| Reader | <solutionId>.<zone> |
| Mission Control | <solutionId>-mc.<zone> |
| Agent | — (internal ingress, no subdomain; legacy <solutionId>-agent.<zone> records are cleaned up on teardown) |
Two platform singletons follow the same front-door recipe: aurora.<zone>
(provision-aurora.sh, with the edge-secret transform rule) and
documentation.<zone> (provision-docs.sh → the docs app serving this
documentation site — proxied CNAME + managed certificate + strict SSL, but
no edge-secret rule: the site is public content and ships no edge guard).
(Flat suffixes, not nested subdomains — Universal SSL and ACA managed
certificates cover one level. With the Cloudflare zone/token unconfigured the
front-door steps report skipped and the public apps are reached on their raw
ACA FQDNs. In-product, the engine gets the Cloudflare credentials from the
CLOUDFLARE_API_TOKEN / CLOUDFLARE_ZONE_NAME env rendered into the aurora
app — before ADR-015 the in-product path could not drive Cloudflare at all and
produced front-door-less stacks.)
ACA custom domains need an explicit binding
A proxied Cloudflare CNAME to an ACA FQDN alone returns HTTP 525: ACA
routes and serves certificates by SNI and doesn’t recognise the custom host
until it’s bound. The dance lives twice — the lib.sh helpers (used by the
singleton aurora path) and the engine’s frontdoor step, a faithful TS port
(asuid TXT → grey-cloud → hostname add Disabled → managed certificate →
SniEnabled bind → re-proxy). The bash helpers:
ensure_dns/ensure_txt/cf_delete_dns— Cloudflare record upserts viacfapi/cf_zone_id;aca_bind_custom_domain— writes theasuid.<host>ownership TXT, temporarily grey-clouds the CNAME for validation, runsaz containerapp hostname addthenhostname bind --validation-method CNAME(free, auto-renewing managed cert), idempotent onceSniEnabled;reconcile_ssl— zone SSL mode to Full (strict);reconcile_transform_rule_hosts— one header-injection Transform Rule per scope (one host foraurora; a solution’s two public hosts share a single rule).
Edge lockdown (opt-in for solutions)
Aurora is always Cloudflare-locked. The per-solution public apps are public
by default; setting edge.lock_solutions: true wires the same lockdown onto
the two of them (the EDGE_SHARED_SECRET ACA secret + proxy.ts, which
exempts /health, plus one per-solution Transform Rule covering the two
hosts). The agent is outside the lockdown’s scope entirely: it is internal,
needs no edge secret, and the old reader→agent bypass header
(x-orbit-bypass / EDGE_BYPASS_SECRET) is obsolete — the config keys
edge.bypass_header_name / edge.bypass_secret were removed. Edge lockdown
blocks direct-origin access but is not user authentication — WorkOS user
auth in front of the two public subdomains is a planned follow-on.
Mission Control → agent: runtime AGENT_URL + the /agent proxy
The agent’s address is per-solution, but the shared image can’t bake it in
(D1) — Next.js inlines NEXT_PUBLIC_* at build time. So mc-<h> receives the
agent’s in-env address at runtime as the AGENT_URL env var
(http://agent-<h>). Because the agent is internal, the browser can’t call it
directly: Mission Control ships a same-origin streaming proxy route
(src/app/agent/[...path]/route.ts) that forwards fetch and SSE to
AGENT_URL, and the app’s force-dynamic root layout injects the literal
prefix /agent as window.__AGENT_URL__ whenever AGENT_URL is set; the
server side reads process.env.AGENT_URL directly. Similarly, the reader gets
AGENT_WEBAPP_URL and serves a small force-dynamic route at
/mission-control that 302-redirects to Mission Control — both values are
computed deterministically, so they can be set before the target app exists.
Scaling
| App | minReplicas | maxReplicas | Notes |
|---|---|---|---|
aurora | 1 | 3 | Control plane stays warm |
orbit-<h> (reader) | 0 | 3 | Stateless; scale-to-zero |
mc-<h> (Mission Control) | 0 | 3 | Stateless; scale-to-zero |
agent-<h> (writer) | 0 | 1 | See below |
svc-* (registry) | 0 | 2 | Internal-only, stateless |
The agent’s bounds encode two decisions:
- Hard cap at 1 replica (D4) — two replicas sharing one
.gitover SMB corruptindex.lock. The single-writer invariant is non-negotiable. - Scale-to-zero since 2026-07-03, made safe by a busy keep-alive inside
the agent (not a queue/KEDA scaler):
maxq-orbit-agent’ssrc/util/keepAlive.tsholds a long-pollGET /keepalive?hold=45open against the agent’s own ingress (theAGENT_SELF_URLenv var, injected by both provisioners — the in-env addresshttp://agent-<h>) while planning turns, the task drain loop, or startup recovery are in flight. The request passes the (internal) ingress, so an in-flight request pins ACA’s default HTTP scale rule above zero and the replica is never evicted mid-task; when idle, the hold aborts and the app scales to zero after the cooldown. Any incoming request wakes it.AGENT_MIN_REPLICAS=1(bash env override) forces always-on; withAGENT_SELF_URLunset (local dev) the keep-alive is dormant.
Provisioning flow
What happens when a solution is created in Aurora with auto-deploy enabled
(the engine’s deploy scenario; the legacy provision-solution.sh implements
the same sequence):
The bash path orders the three apps agent → reader → Mission Control; the
reader’s ORBIT_AGENT_URL and Mission Control’s AGENT_URL both carry the
deterministic in-env address http://agent-<h>.
az containerapp create --yaml silently drops identity, registries,
secrets, and scale — an app created that way cannot pull from the private
ACR. The bash provisioners therefore use the recipe in lib.sh
(aca_configure_app): create on a public placeholder image, then imperatively
attach the identity, wire the registry, set secrets, update --yaml the real
spec (image, env, volumes), re-verify, and set scale — each step verified,
transient Azure errors retried with backoff. This defect is az-CLI-specific:
the engine’s ARM SDK path sends the full envelope in one createOrUpdate and
needs none of the recipe — a major reason TypeScript won the solution plane
(ADR-015).
Registry services: deploy.sh postgres and deploy.sh services
The portfolio registry (Customer → Tenant → Solution) deploys alongside Aurora:
deploy.sh postgres(→provision-postgres.sh) reconciles the Flexible Serverpsql-orbit(BurstableStandard_B1ms, PostgreSQL 16, 32 GiB) with Microsoft Entra authentication only — no password exists anywhere. The UAMIid-auroraand the signed-in az user are the Entra admins; public network access with the allow-Azure-services firewall rule (VNet integration is the documented escalation). The database isportfolio; schemas and tables come from the services’ own boot-time migrations.deploy.sh services(→provision-services.sh) renders oneapp-service.yaml.tmplfour times intosvc-customer,svc-tenant,svc-solution, andsvc-activity(the activity feed, whose peer-URL placeholders stay empty and are pruned): internal ingress only, stateless,minReplicas: 0, using the sameaca_configure_apprecipe. Services connect to Postgres with aDefaultAzureCredentialtoken as the pg password (requiresAZURE_CLIENT_IDpointing at the UAMI). In-environment URLs use the app-name form (http://svc-customer) — the.internal.<env-domain>FQDN form does not resolve on this environment.
Order on a fresh environment: bootstrap → postgres → build-images.sh
for the four services → services → aurora.
Verify, status, and teardown
deploy.sh solution <id> verify(or theverify_solutionagent tool) — the engine’s verify scenario runs every deploy step’s read-only assertions: pinned images present in ACR, shares and links exist, shares seeded, all three appsSucceededon the pinned image (image drift is a finding), the agent’s ingress internal, the front doorsSniEnabledwith correct DNS, and the lockdown rule present when the flag is on — each finding with a remediation (usually the narrow scenario that fixes it). Thesolution_statustool is the faster snapshot variant (live app states, image drift, effective URLs, registry cross-check);deploy_diagnosticssurfaces the real failed ARM operations from the Activity Log.deploy.sh verify <id>— the legacy bash check (apps/shares/links + advisory, lockdown-aware HTTP probes; asserts the agent FQDN carries.internal.).deploy.sh verifywithout an id checks theauroraapp and its front door — that half stays current (platform plane).deploy.sh solution <id> teardown(or theteardown_solutionagent tool) — destructive teardown in dependency order: the three apps, the seed job, the env-storage links (including a legacycust-ro-<h>), both shares last, then the Cloudflare CNAMEs,asuidTXTs, and the solution’s Transform Rule — plus any legacy<id>-agent.<zone>records. Deleting the shares destroys the agent’s working clones — GitHub remains the source of truth, but unpushed work is lost — so the CLI requires retyping the solutionId (unless--yes) and the agent tool requires the typed-backconfirmId.deploy.sh delete <id>is the legacy bash equivalent.