Orbit Agent
The hand: a headless, HTTP-accessible multi-agent service (implementation/maxq-orbit-agent/)
that authors a Trajectory solution. One container manages one solution; a team of
Claude-powered agents works the repo under the orchestration of a Team Lead, and the
container — not the agents — owns the entire git lifecycle.
The agent is also the sole owner of the solution repository mounts and the platform’s single writer. Since the 2026-07-02 cutover, the read-only Orbit Webapp has no repo mount at all: the agent loads the whole solution tree into memory and serves it over HTTP, and the webapp consumes that API. Every change to the solution flows through this one process, which is what makes the in-memory model, the event stream, and the branch lifecycle safe without any cross-process locking.
Single-writer contract
- One container, one solution. Horizontal scaling means more containers, each with its
own repos. There is no shared state across instances; on Azure the agent is pinned to
maxReplicas: 1. - Git is infrastructure, not a tool. Agents never see credentials and never call git. The orchestrator commits after every task and merges when a request completes.
- Non-blocking HTTP.
POST /requests/tasksreturns202with a request id immediately; work happens in the background. Clients follow progress over SSE and read results back through the API. - Strictly serial execution. Exactly one worker task runs at any moment, across all requests — see TaskProcessor below.
Tech stack
Verified in codebase/package.json:
| Concern | Choice |
|---|---|
| HTTP server | Hono 4 on @hono/node-server (Node ≥ 20) |
| AI runtime | @anthropic-ai/claude-agent-sdk — every agent run is an SDK session |
| Solution loading | @maxq/trajectory-loader (shared file: package in implementation/shared/trajectory-loader/) |
| Validation / config | zod (env schema in src/config.ts) |
| Logging | pino |
| YAML | yaml |
| Ids | nanoid |
Default model is claude-opus-4-8 (DEFAULT_MODEL overridable per deployment; the registry
also supports a per-role model override). Either ANTHROPIC_API_KEY or
CLAUDE_CODE_OAUTH_TOKEN must be set — config refuses to boot otherwise.
The three mounts
The agent works across the platform’s three-repo split. All three roots are env-driven
(src/config.ts), and since 2026-07-02 all three mount into the agent only:
| Mount | Env var | Default | Access | Contents |
|---|---|---|---|---|
/repo | SOLUTION_REPOSITORY_PATH | /repo | read-write | Customer repo: workspace/ (with workspace/solution-definition/ as the solution root), repositories/, and the vendored, version-pinned trajectory-methodology/ copy |
/repo-internal | SOLUTION_INTERNAL_REPOSITORY_PATH | `<customer>-internal` (derived by naming convention) | read-write | Internal repo: agents/ (personas), catalog/ (registries), templates/, and the .orbit/ audit trail |
/methodology | TRAJECTORY_METHODOLOGY_PATH | /methodology | read-only | Global methodology release registry — consulted only for the methodology upgrade operation; normal authoring resolves the vendored copy inside the customer repo |
Derived paths worth knowing: the solution root the in-memory model walks is
<customer>/workspace/solution-definition; the source-repositories root served by the code
explorer defaults to <customer>/repositories (override: ORBIT_REPOSITORIES_PATH); persona
folders resolve to <internal>/agents/<id>/; the tracked audit trail is <internal>/.orbit/.
Each repo has its own git identity: GIT_REMOTE_URL / GIT_TOKEN for the customer repo and
INTERNAL_GIT_REMOTE_URL / INTERNAL_GIT_TOKEN for the internal one (the latter fall back to
the customer values for single-credential dev setups). The methodology repo is read-only — no
git subsystem at all.
Request lifecycle: plan, enqueue, drain, merge
A task request flows through a strict division of responsibility:
TeamLead(src/orchestrator/TeamLead.ts) only plans and enqueues.acceptTaskRequestruns one planning turn against the team-lead persona using the SDK’s structured output (a JSON-Schema plan:summary+ atasksarray ofassignee/title/brief), persists theplanSummary, creates all the plan’s tasks aspendingin one batched index write (TaskList.createTasks), then callsprocessor.kick(). Malformed structured output gets exactly one corrective retry; an empty-but-valid plan raisesNoRouteError(“cannot route request”) rather than silently accepting work that will never run.TaskProcessor(src/orchestrator/TaskProcessor.ts) owns execution — the drain loop, the git branch lifecycle, request finalisation, and the solution-model refresh hooks.
There are exactly two LLM boundaries in the pipeline besides the workers themselves, both
via structured output: planning (above) and aggregation — on the success path the
processor calls back into the TeamLead (aggregateRequest, wired via
processor.setAggregator(lead) to break the class cycle) to enrich the request summary with a
narrative, highlights, open questions, and per-file rationale. The deterministic backbone
(git change manifest, per-task rollup, metrics) is always written even when enrichment fails;
the failure path writes a purely deterministic summary with no LLM call. The result lands at
.orbit/requests/<id>/summary.yaml and is served by GET /requests/:requestId/summary
(404 until finalised).
The agent roster
Which agents exist is decided by a hardcoded TypeScript array, not by what persona folders
sit on disk: AGENT_DEFS in src/agents/registry.ts. The current roster:
| Id | Kind | Role | Stage digest |
|---|---|---|---|
team-lead | support | Team Lead (the orchestrator: plans, chats, aggregates) | — |
solution-manager | worker | Solution Manager | — |
as-is-analyst | worker | As-Is Analyst | as-is |
assessor | worker | Assessor | — |
Key facts about the model:
- The registry is the routing source of truth.
Registry.workerIds()returns onlykind: "worker"entries, and that list is what the planner is allowed to assign tasks to.kind: "support"is inert for routing —team-leadis the only support agent any code path invokes. - Personas live on disk, decoupled from the registry. Each registered id maps to
<internal>/agents/<id>/CLAUDE.md+memory.md, loaded at runtime and prepended as the system prompt. Memory is writable, so roles accumulate knowledge across runs. - The two layers can drift silently. A fully-authored persona folder that is missing from
AGENT_DEFSis unroutable; a registered id with no folder runs with an empty system prompt (no crash). A boot-time drift guard (verifyAgentPersonas, called fromindex.ts) logs both traps loudly — an error for registered-but-no-CLAUDE.md, a warning for folder-but-unregistered. It is advisory only and never blocks startup.
Adding an agent means updating all the layers: the AGENT_DEFS entry (registry), the
on-disk persona folder (and its upstream in solution-template-internal), and the advisory
team-lead docs. Authoring a persona folder alone does nothing.
The TaskProcessor
src/orchestrator/TaskProcessor.ts is the single, process-wide consumer of the task queue.
Planning is concurrent (each request’s team-lead turn runs independently), but worker
execution is strictly serial across all requests: one drain loop, guarded by an in-memory
running flag, pulls the oldest pending task, runs it to completion, and only then pulls
the next. That serial invariant is what makes the shared working tree, the per-request branch
lifecycle, and the single in-memory solution model safe.
Design points (all load-bearing):
- Kick-on-enqueue, no polling timer. Producers — request enqueue, task continue, startup
recovery — call
kick(). There is deliberately no interval loop. - The double-check in
kick()closes a race. After the drain loop finishes and clearsrunning, the processor re-checkstaskList.hasPending()and re-kicks. Without it, a task enqueued in the window between the loop’s final empty-check and the flag clearing would be strandedpendingforever. runTasknever throws. A failure is recorded on the task and published astask.failed; the drain loop moves on. If a task fails, the request’s still-pending sibling tasks are cancelled so the request finalises immediately.- Resume is detected, not a separate code path. A task only carries an
sdkSessionIdafter a prior run, so its presence means “resume”: the SDK reloads the prior conversation and the worker continues where it left off.POST /requests/tasks/:taskId/continueflips a failed/cancelled task back topendingand kicks; the processor does the rest. - Workers get a scoped access set. Each run’s additional directories span the repos:
read-write on the customer
workspace/and the task folder; read-only on the vendored methodology,catalog/, andtemplates/. The prompt hands workers absolute named roots (WORKSPACE_ROOT,SOLUTION_METHODOLOGY_ROOT,CATALOG_ROOT,TEMPLATES_ROOT). - Scale-to-zero keep-alive. Background work runs detached from any HTTP request, so on
Azure Container Apps the HTTP scaler would otherwise evict the replica mid-task. The drain
loop (plus planning and recovery) holds a keep-alive (
src/util/keepAlive.ts) that long-polls the agent’s own ingress (GET /keepalive?hold=45againstAGENT_SELF_URL— on Azure the in-env addresshttp://agent-<h>; the request still passes the internal ingress, so ACA’s HTTP scale rule sees the app busy) while work is in flight. Dormant in local dev whereAGENT_SELF_URLis unset. - Startup recovery. Because execution is serial, a crash leaves at most one task
in_progress.recover()commits its partial work to the request branch, marks it failed (“interrupted by restart”), finalises the request (branch kept), returns the tree to base, reloads the solution model, and drains anything still queued. - In-process, not cross-process. The
runningflag is plain memory;TaskListserialises everytasklist.jsonread-modify-write with an in-process lock. This is correct precisely because the agent is a single Node process — running two agent processes against the same.orbit/would need a file lock that deliberately doesn’t exist.
Per-request git branches
The processor owns two independent git lifecycles, one per writable repo
(src/git/GitSubsystem.ts, two instances wired in index.ts):
Customer repo — branch per request. On a request’s first task the processor lazily
creates a branch named after the request id from the base branch (checking out base and
pulling first). Every task ends with a commit of workspace/ — message <taskId>: <title>,
or <taskId> (failed): <title> on failure, so the tree is always clean before the next
branch switch and partial work stays inspectable. Task commit SHAs are persisted onto the
task records and branches are pushed for visibility (skipped in local-only mode). When all
tasks of a request complete, the branch is merged into base, pushed, and deleted; on any task
failure, cancellation, or merge conflict the request is marked failed and the branch is
kept unmerged for inspection or /continue, with the working tree returned to base.
The base branch is captured on first clean boot and persisted at .orbit/base-branch —
so a restart that lands mid-request (tree still on a request branch) doesn’t mistake that
branch for base.
Internal repo — linear main. The .orbit/ audit trail (task store, logs, request
summaries) plus any persona/catalog changes are committed to the internal repo’s main after
every task and at request finalisation (commitAudit). No branches: the internal working
tree never switches under the running processor, which is exactly what makes tracking the
live .orbit/ safe. Audit commits are best-effort and never block task progress.
The methodology digest
Every worker used to re-read ~74K tokens of Trajectory methodology per task pickup. The digest optimisation replaces that with a small generated, per-stage methodology digest (~8.2K tokens) injected into the cached system prompt:
- Digests are generated deterministically from the methodology prose and JSON Schemas, and
vendored with the pinned methodology copy inside the customer repo at
trajectory-methodology/.generated/agent-digest/<version>/<stage>.md— so a solution’s digest always matches its pinned methodology version. - A registry entry’s optional
stagefield opts an agent in:Agent.runloads the matching digest (src/agents/methodologyDigest.ts, memoised, gracefullynullwhen none is vendored) and prepends it first to its own system prompt and to every loaded sub-agent persona — an identical prefix maximises prompt-cache reuse. Currently onlyas-is-analystsetsstage: "as-is", and only theas-isdigest exists. - Measured reality: the static methodology slice shrinks ~89% (74K → 8.2K) and moves from uncached tool-result reads into the cached prefix, but the live end-to-end gain is a modest ~9–13% total cost (with orchestrator per-turn cache reads ~30% lower). Real runs are dominated by work context — code walked, artefacts read and written — which the digest doesn’t touch.
Serving the solution
The agent doesn’t just write the solution — it serves it. src/solution/SolutionModel.ts
holds the whole raw solution tree in memory (loaded with @maxq/trajectory-loader from
workspace/solution-definition/) and exposes it over HTTP. Cache identity is the
(instance, version) pair: version is a per-process monotonic counter and instance a
per-boot UUID, so a restart can never alias a stale cache. GET /solution/tree is
pre-serialised once per version, gzipped, and ETagged with "<instance>:<version>" for
If-None-Match → 304 round trips.
Reloads are event-driven, serialized, and coalescing — there is deliberately no
filesystem watcher. The TaskProcessor fires fire-and-forget refreshes at every settled
moment: after each per-task commit (task-commit, so readers see live mid-request progress),
after a successful merge (request-merged, carrying the changed-file list), on the failure
path (request-failed, the tree flips back to base), and unconditionally after recovery’s
base checkout (recovery). Out-of-band edits to the repo therefore don’t appear until a
manual POST /solution/reload. Every reload emits a solution.updated SSE event with
{ version, reason, requestId?, changedFiles? }, which is what drives the webapp’s live
refresh. The full pipeline is covered on the data flow page.
The agent also hosts the code explorer’s data (/sources/*, over the customer repo’s
repositories/ folder with a size cap, extension allowlist, and path-escape guards); the
webapp’s /api/sources/* routes are thin proxies to these.
HTTP API
All routes verified in src/routes/ and src/index.ts. Everything except /health sits
behind the optional edgeGuard middleware — a Cloudflare front-door lockdown that is dormant
unless EDGE_SHARED_SECRET is set. Since 2026-07-03 the deployed agent doesn’t need it: its
ACA ingress is internal (external: false), so requests from outside the environment are
rejected by the environment proxy before they reach the app, and the edge lockdown
(edge.lock_solutions) applies only to the two public apps. The old reader→agent bypass
header (x-orbit-bypass / EDGE_BYPASS_SECRET) is obsolete.
| Method | Path | Purpose |
|---|---|---|
| GET | /health | Liveness probe (always exempt from the edge guard). |
| GET | /keepalive?hold=45 | Long-poll target for the agent’s own scale-to-zero guard (max hold 60s). |
| POST | /requests/tasks | Submit a task request ({ prompt, metadata? }). Returns 202 + { requestId }. |
| GET | /requests/tasks | The task index (tasklist.json). |
| POST | /requests/tasks/:taskId/continue | Resume a failed/cancelled task via its stored SDK session (409 if none). |
| GET | /requests/tasks/:taskId/log | SSE tail of the task’s log.jsonl (one log.line event per SDK message; 500ms size-poll). |
| GET | /requests/tasks/:taskId/details | Static task folder contents: task.json metadata, task.md brief, result/plan.md, result/summary.md. |
| GET | /requests/:requestId/summary | The finalised request summary (summary.yaml as JSON); 404 while in flight. |
| POST | /requests/chats | Stream a chat turn with the Team Lead over SSE (chat.session / chat.delta / chat.done / chat.error). |
| GET | /requests/chats | Chat session index, most recent first. |
| GET | /requests/chats/:sessionId | One chat session + its messages. |
| PATCH | /requests/chats/:sessionId | Rename a chat ({ title }). |
| DELETE | /requests/chats/:sessionId | Delete a chat session. |
| GET | /solution | Solution identity summary (id, shortName, name, customer) — consumed by Mission Control. |
| GET | /solution/version | { version, instance, loadedAt, headSha }; 503 { loading: true } before first load. |
| GET | /solution/tree | The whole raw solution tree; gzip + instance-qualified ETag with 304 support. |
| POST | /solution/reload | Manual reload escape hatch (e.g. after out-of-band edits). Async 202. |
| GET | /sources/tree | Source-repositories tree for the code explorer. |
| GET | /sources/file?path= | One source file, with path-escape and size guards. |
| GET | /events/subscribe | SSE stream of all events; Last-Event-ID (or ?since=) replays from the ring buffer. |
Events published on the bus (src/events/types.ts): request.accepted, request.rejected,
request.planning, request.in_progress, request.done, request.failed, task.created,
task.started, task.progress, task.completed, task.failed, task.summary,
chat.delta, chat.done, agents.memory.updated, and solution.updated. The in-memory
EventBus keeps a ring buffer of recent events for SSE reconnect replay.
Boot sequence
src/index.ts, in order: ensure both git clones (customer + internal) → idempotent repo
scaffold (bootstrap.ts, copies missing persona/template seeds without ever overwriting) →
wire EventBus + Registry → run the persona drift guard → eager-but-async solution model
load (/health answers immediately; /solution/* serves 503 until the first load lands) →
construct the TaskProcessor (resolving the persisted base branch) and the TeamLead, wiring
the aggregator back-reference → fire-and-forget crash recovery → mount middleware
(HTTP logging, edge guard, CORS) and the seven route groups.
The service expects to run behind a trusted boundary: there is no user auth on the HTTP
surface itself. On Azure that boundary is the ACA environment — since 2026-07-03 the agent
has internal-only ingress (no public FQDN, no Cloudflare subdomain), and its only clients
are the reader webapp’s server side and Mission Control (whose browser traffic arrives via
Mission Control’s same-origin /agent proxy).