Orbit Agent

The hand: a headless, HTTP-accessible multi-agent service (implementation/maxq-orbit-agent/) that authors a Trajectory solution. One container manages one solution; a team of Claude-powered agents works the repo under the orchestration of a Team Lead, and the container — not the agents — owns the entire git lifecycle.

The agent is also the sole owner of the solution repository mounts and the platform’s single writer. Since the 2026-07-02 cutover, the read-only Orbit Webapp has no repo mount at all: the agent loads the whole solution tree into memory and serves it over HTTP, and the webapp consumes that API. Every change to the solution flows through this one process, which is what makes the in-memory model, the event stream, and the branch lifecycle safe without any cross-process locking.

Single-writer contract

One container, one solution. Horizontal scaling means more containers, each with its own repos. There is no shared state across instances; on Azure the agent is pinned to maxReplicas: 1.
Git is infrastructure, not a tool. Agents never see credentials and never call git. The orchestrator commits after every task and merges when a request completes.
Non-blocking HTTP. POST /requests/tasks returns 202 with a request id immediately; work happens in the background. Clients follow progress over SSE and read results back through the API.
Strictly serial execution. Exactly one worker task runs at any moment, across all requests — see TaskProcessor below.

Tech stack

Verified in codebase/package.json:

Concern	Choice
HTTP server	Hono 4 on `@hono/node-server` (Node ≥ 20)
AI runtime	`@anthropic-ai/claude-agent-sdk` — every agent run is an SDK session
Solution loading	`@maxq/trajectory-loader` (shared `file:` package in `implementation/shared/trajectory-loader/`)
Validation / config	`zod` (env schema in `src/config.ts`)
Logging	`pino`
YAML	`yaml`
Ids	`nanoid`

Default model is claude-opus-4-8 (DEFAULT_MODEL overridable per deployment; the registry also supports a per-role model override). Either ANTHROPIC_API_KEY or CLAUDE_CODE_OAUTH_TOKEN must be set — config refuses to boot otherwise.

The three mounts

The agent works across the platform’s three-repo split. All three roots are env-driven (src/config.ts), and since 2026-07-02 all three mount into the agent only:

Mount	Env var	Default	Access	Contents
`/repo`	`SOLUTION_REPOSITORY_PATH`	`/repo`	read-write	Customer repo: `workspace/` (with `workspace/solution-definition/` as the solution root), `repositories/`, and the vendored, version-pinned `trajectory-methodology/` copy
`/repo-internal`	`SOLUTION_INTERNAL_REPOSITORY_PATH`	`<customer>-internal` (derived by naming convention)	read-write	Internal repo: `agents/` (personas), `catalog/` (registries), `templates/`, and the `.orbit/` audit trail
`/methodology`	`TRAJECTORY_METHODOLOGY_PATH`	`/methodology`	read-only	Global methodology release registry — consulted only for the methodology upgrade operation; normal authoring resolves the vendored copy inside the customer repo

Derived paths worth knowing: the solution root the in-memory model walks is <customer>/workspace/solution-definition; the source-repositories root served by the code explorer defaults to <customer>/repositories (override: ORBIT_REPOSITORIES_PATH); persona folders resolve to <internal>/agents/<id>/; the tracked audit trail is <internal>/.orbit/.

Each repo has its own git identity: GIT_REMOTE_URL / GIT_TOKEN for the customer repo and INTERNAL_GIT_REMOTE_URL / INTERNAL_GIT_TOKEN for the internal one (the latter fall back to the customer values for single-credential dev setups). The methodology repo is read-only — no git subsystem at all.

Request lifecycle: plan, enqueue, drain, merge

A task request flows through a strict division of responsibility:

TeamLead (src/orchestrator/TeamLead.ts) only plans and enqueues. acceptTaskRequest runs one planning turn against the team-lead persona using the SDK’s structured output (a JSON-Schema plan: summary + a tasks array of assignee/title/brief), persists the planSummary, creates all the plan’s tasks as pending in one batched index write (TaskList.createTasks), then calls processor.kick(). Malformed structured output gets exactly one corrective retry; an empty-but-valid plan raises NoRouteError (“cannot route request”) rather than silently accepting work that will never run.
TaskProcessor (src/orchestrator/TaskProcessor.ts) owns execution — the drain loop, the git branch lifecycle, request finalisation, and the solution-model refresh hooks.

There are exactly two LLM boundaries in the pipeline besides the workers themselves, both via structured output: planning (above) and aggregation — on the success path the processor calls back into the TeamLead (aggregateRequest, wired via processor.setAggregator(lead) to break the class cycle) to enrich the request summary with a narrative, highlights, open questions, and per-file rationale. The deterministic backbone (git change manifest, per-task rollup, metrics) is always written even when enrichment fails; the failure path writes a purely deterministic summary with no LLM call. The result lands at .orbit/requests/<id>/summary.yaml and is served by GET /requests/:requestId/summary (404 until finalised).

The agent roster

Which agents exist is decided by a hardcoded TypeScript array, not by what persona folders sit on disk: AGENT_DEFS in src/agents/registry.ts. The current roster:

Id	Kind	Role	Stage digest
`team-lead`	support	Team Lead (the orchestrator: plans, chats, aggregates)	—
`solution-manager`	worker	Solution Manager	—
`as-is-analyst`	worker	As-Is Analyst	`as-is`
`assessor`	worker	Assessor	—

Key facts about the model:

The registry is the routing source of truth. Registry.workerIds() returns only kind: "worker" entries, and that list is what the planner is allowed to assign tasks to. kind: "support" is inert for routing — team-lead is the only support agent any code path invokes.
Personas live on disk, decoupled from the registry. Each registered id maps to <internal>/agents/<id>/CLAUDE.md + memory.md, loaded at runtime and prepended as the system prompt. Memory is writable, so roles accumulate knowledge across runs.
The two layers can drift silently. A fully-authored persona folder that is missing from AGENT_DEFS is unroutable; a registered id with no folder runs with an empty system prompt (no crash). A boot-time drift guard (verifyAgentPersonas, called from index.ts) logs both traps loudly — an error for registered-but-no-CLAUDE.md, a warning for folder-but-unregistered. It is advisory only and never blocks startup.

Adding an agent means updating all the layers: the AGENT_DEFS entry (registry), the on-disk persona folder (and its upstream in solution-template-internal), and the advisory team-lead docs. Authoring a persona folder alone does nothing.

The TaskProcessor

src/orchestrator/TaskProcessor.ts is the single, process-wide consumer of the task queue. Planning is concurrent (each request’s team-lead turn runs independently), but worker execution is strictly serial across all requests: one drain loop, guarded by an in-memory running flag, pulls the oldest pending task, runs it to completion, and only then pulls the next. That serial invariant is what makes the shared working tree, the per-request branch lifecycle, and the single in-memory solution model safe.

Design points (all load-bearing):

Kick-on-enqueue, no polling timer. Producers — request enqueue, task continue, startup recovery — call kick(). There is deliberately no interval loop.
The double-check in kick() closes a race. After the drain loop finishes and clears running, the processor re-checks taskList.hasPending() and re-kicks. Without it, a task enqueued in the window between the loop’s final empty-check and the flag clearing would be stranded pending forever.
runTask never throws. A failure is recorded on the task and published as task.failed; the drain loop moves on. If a task fails, the request’s still-pending sibling tasks are cancelled so the request finalises immediately.
Resume is detected, not a separate code path. A task only carries an sdkSessionId after a prior run, so its presence means “resume”: the SDK reloads the prior conversation and the worker continues where it left off. POST /requests/tasks/:taskId/continue flips a failed/cancelled task back to pending and kicks; the processor does the rest.
Workers get a scoped access set. Each run’s additional directories span the repos: read-write on the customer workspace/ and the task folder; read-only on the vendored methodology, catalog/, and templates/. The prompt hands workers absolute named roots (WORKSPACE_ROOT, SOLUTION_METHODOLOGY_ROOT, CATALOG_ROOT, TEMPLATES_ROOT).
Scale-to-zero keep-alive. Background work runs detached from any HTTP request, so on Azure Container Apps the HTTP scaler would otherwise evict the replica mid-task. The drain loop (plus planning and recovery) holds a keep-alive (src/util/keepAlive.ts) that long-polls the agent’s own ingress (GET /keepalive?hold=45 against AGENT_SELF_URL — on Azure the in-env address http://agent-<h>; the request still passes the internal ingress, so ACA’s HTTP scale rule sees the app busy) while work is in flight. Dormant in local dev where AGENT_SELF_URL is unset.
Startup recovery. Because execution is serial, a crash leaves at most one task in_progress. recover() commits its partial work to the request branch, marks it failed (“interrupted by restart”), finalises the request (branch kept), returns the tree to base, reloads the solution model, and drains anything still queued.
In-process, not cross-process. The running flag is plain memory; TaskList serialises every tasklist.json read-modify-write with an in-process lock. This is correct precisely because the agent is a single Node process — running two agent processes against the same .orbit/ would need a file lock that deliberately doesn’t exist.

Per-request git branches

The processor owns two independent git lifecycles, one per writable repo (src/git/GitSubsystem.ts, two instances wired in index.ts):

Customer repo — branch per request. On a request’s first task the processor lazily creates a branch named after the request id from the base branch (checking out base and pulling first). Every task ends with a commit of workspace/ — message <taskId>: <title>, or <taskId> (failed): <title> on failure, so the tree is always clean before the next branch switch and partial work stays inspectable. Task commit SHAs are persisted onto the task records and branches are pushed for visibility (skipped in local-only mode). When all tasks of a request complete, the branch is merged into base, pushed, and deleted; on any task failure, cancellation, or merge conflict the request is marked failed and the branch is kept unmerged for inspection or /continue, with the working tree returned to base.

The base branch is captured on first clean boot and persisted at .orbit/base-branch — so a restart that lands mid-request (tree still on a request branch) doesn’t mistake that branch for base.

Internal repo — linear main. The .orbit/ audit trail (task store, logs, request summaries) plus any persona/catalog changes are committed to the internal repo’s main after every task and at request finalisation (commitAudit). No branches: the internal working tree never switches under the running processor, which is exactly what makes tracking the live .orbit/ safe. Audit commits are best-effort and never block task progress.

The methodology digest

Every worker used to re-read ~74K tokens of Trajectory methodology per task pickup. The digest optimisation replaces that with a small generated, per-stage methodology digest (~8.2K tokens) injected into the cached system prompt:

Digests are generated deterministically from the methodology prose and JSON Schemas, and vendored with the pinned methodology copy inside the customer repo at trajectory-methodology/.generated/agent-digest/<version>/<stage>.md — so a solution’s digest always matches its pinned methodology version.
A registry entry’s optional stage field opts an agent in: Agent.run loads the matching digest (src/agents/methodologyDigest.ts, memoised, gracefully null when none is vendored) and prepends it first to its own system prompt and to every loaded sub-agent persona — an identical prefix maximises prompt-cache reuse. Currently only as-is-analyst sets stage: "as-is", and only the as-is digest exists.
Measured reality: the static methodology slice shrinks ~89% (74K → 8.2K) and moves from uncached tool-result reads into the cached prefix, but the live end-to-end gain is a modest ~9–13% total cost (with orchestrator per-turn cache reads ~30% lower). Real runs are dominated by work context — code walked, artefacts read and written — which the digest doesn’t touch.

Serving the solution

The agent doesn’t just write the solution — it serves it. src/solution/SolutionModel.ts holds the whole raw solution tree in memory (loaded with @maxq/trajectory-loader from workspace/solution-definition/) and exposes it over HTTP. Cache identity is the (instance, version) pair: version is a per-process monotonic counter and instance a per-boot UUID, so a restart can never alias a stale cache. GET /solution/tree is pre-serialised once per version, gzipped, and ETagged with "<instance>:<version>" for If-None-Match → 304 round trips.

Reloads are event-driven, serialized, and coalescing — there is deliberately no filesystem watcher. The TaskProcessor fires fire-and-forget refreshes at every settled moment: after each per-task commit (task-commit, so readers see live mid-request progress), after a successful merge (request-merged, carrying the changed-file list), on the failure path (request-failed, the tree flips back to base), and unconditionally after recovery’s base checkout (recovery). Out-of-band edits to the repo therefore don’t appear until a manual POST /solution/reload. Every reload emits a solution.updated SSE event with { version, reason, requestId?, changedFiles? }, which is what drives the webapp’s live refresh. The full pipeline is covered on the data flow page.

The agent also hosts the code explorer’s data (/sources/*, over the customer repo’s repositories/ folder with a size cap, extension allowlist, and path-escape guards); the webapp’s /api/sources/* routes are thin proxies to these.

HTTP API

All routes verified in src/routes/ and src/index.ts. Everything except /health sits behind the optional edgeGuard middleware — a Cloudflare front-door lockdown that is dormant unless EDGE_SHARED_SECRET is set. Since 2026-07-03 the deployed agent doesn’t need it: its ACA ingress is internal (external: false), so requests from outside the environment are rejected by the environment proxy before they reach the app, and the edge lockdown (edge.lock_solutions) applies only to the two public apps. The old reader→agent bypass header (x-orbit-bypass / EDGE_BYPASS_SECRET) is obsolete.

Method	Path	Purpose
GET	`/health`	Liveness probe (always exempt from the edge guard).
GET	`/keepalive?hold=45`	Long-poll target for the agent’s own scale-to-zero guard (max hold 60s).
POST	`/requests/tasks`	Submit a task request (`{ prompt, metadata? }`). Returns `202` + `{ requestId }`.
GET	`/requests/tasks`	The task index (`tasklist.json`).
POST	`/requests/tasks/:taskId/continue`	Resume a failed/cancelled task via its stored SDK session (`409` if none).
GET	`/requests/tasks/:taskId/log`	SSE tail of the task’s `log.jsonl` (one `log.line` event per SDK message; 500ms size-poll).
GET	`/requests/tasks/:taskId/details`	Static task folder contents: `task.json` metadata, `task.md` brief, `result/plan.md`, `result/summary.md`.
GET	`/requests/:requestId/summary`	The finalised request summary (`summary.yaml` as JSON); `404` while in flight.
POST	`/requests/chats`	Stream a chat turn with the Team Lead over SSE (`chat.session` / `chat.delta` / `chat.done` / `chat.error`).
GET	`/requests/chats`	Chat session index, most recent first.
GET	`/requests/chats/:sessionId`	One chat session + its messages.
PATCH	`/requests/chats/:sessionId`	Rename a chat (`{ title }`).
DELETE	`/requests/chats/:sessionId`	Delete a chat session.
GET	`/solution`	Solution identity summary (id, shortName, name, customer) — consumed by Mission Control.
GET	`/solution/version`	`{ version, instance, loadedAt, headSha }`; `503 { loading: true }` before first load.
GET	`/solution/tree`	The whole raw solution tree; gzip + instance-qualified ETag with `304` support.
POST	`/solution/reload`	Manual reload escape hatch (e.g. after out-of-band edits). Async `202`.
GET	`/sources/tree`	Source-repositories tree for the code explorer.
GET	`/sources/file?path=`	One source file, with path-escape and size guards.
GET	`/events/subscribe`	SSE stream of all events; `Last-Event-ID` (or `?since=`) replays from the ring buffer.

Events published on the bus (src/events/types.ts): request.accepted, request.rejected, request.planning, request.in_progress, request.done, request.failed, task.created, task.started, task.progress, task.completed, task.failed, task.summary, chat.delta, chat.done, agents.memory.updated, and solution.updated. The in-memory EventBus keeps a ring buffer of recent events for SSE reconnect replay.

Boot sequence

src/index.ts, in order: ensure both git clones (customer + internal) → idempotent repo scaffold (bootstrap.ts, copies missing persona/template seeds without ever overwriting) → wire EventBus + Registry → run the persona drift guard → eager-but-async solution model load (/health answers immediately; /solution/* serves 503 until the first load lands) → construct the TaskProcessor (resolving the persisted base branch) and the TeamLead, wiring the aggregator back-reference → fire-and-forget crash recovery → mount middleware (HTTP logging, edge guard, CORS) and the seven route groups.

The service expects to run behind a trusted boundary: there is no user auth on the HTTP surface itself. On Azure that boundary is the ACA environment — since 2026-07-03 the agent has internal-only ingress (no public FQDN, no Cloudflare subdomain), and its only clients are the reader webapp’s server side and Mission Control (whose browser traffic arrives via Mission Control’s same-origin /agent proxy).