The Harness
The techniques the frontier labs use.
Shipped, tested, and measured.
DeepSeek, OpenAI, and Anthropic build their agents on a known set of techniques — reasoning loops, memory compaction, parallel tools, verification gates. This page is Salsa's honest ledger of every one of them: what's shipped, what's partial, what's planned — each backed by a real file, a real test, and a real number.
How to read this
Three states. No wishful thinking.
Every capability below carries one of three honest states. Marketing copy cannot promote a capability past its real state — the state is set by the code and its tests, not by the pitch.
Implemented in the codebase with a named test that fails without it, and a KPI we measure. Cited with file:line.
A real, working core exists, but a meaningful piece of the frontier version is not yet in place. We name exactly what's missing.
On the roadmap, not yet in the codebase. Listed so the ledger is complete and honest — and so the gap is visible.
Current scorecard
Where the harness stands today
This scorecard changes as work lands. When a Planned item ships with its test, it moves up — and this page is the record of that.
Capability group 01
Reasoning & control loop
The agent's core loop: think, act, observe, repeat — with hard bounds so it can never run away, and cached replay so it never redoes work.
ReAct reasoning loop
ShippedInterleaved reasoning and tool calls with a bounded max_turns stop condition, so the loop always terminates.
Idempotent tool replay
ShippedA per-run dedup cache replays the cached result when the model repeats an identical successful tool call — no double execution, no wasted spend.
Self-healing loop control
ShippedDegenerate loops are detected and bounded (dedup entry caps, turn caps) so a stuck model is contained rather than left to spin.
Tree of Thoughts
PlannedBranching multi-path exploration with backtracking. The current loop is linear ReAct; ToT search is not yet in the codebase.
Capability group 02
Memory & context
Context is a budget, not an afterthought. Salsa caps session tokens and compacts on overflow so long runs stay coherent and affordable.
Context compaction & token budget
ShippedA hard session cap of 8,000 tokens triggers summarization when exceeded, keeping the working context inside the model's effective window.
Persistent memory (GraphRAG)
ShippedCross-session knowledge is stored and retrieved through the GraphRAG store (vector + graph), so the agent carries context between runs.
Multi-tier memory (short / mid / long)
PartialShort-term (session budget) and long-term (GraphRAG) tiers exist. The mid-term rolling-summary tier — a compacted digest between the two — is not yet in place.
Prompt / context caching
PlannedProvider-side prompt caching (reusing a cached system+tools prefix across turns) is not yet wired. It's the highest-leverage cost win on the roadmap.
Capability group 03
Advanced tool use
Tools run in parallel under a concurrency bound, every call passes a permission check first, and the model can search its own toolbox.
Parallel tool execution
ShippedIndependent tool calls run concurrently under a Semaphore-bounded pool, then join — fast without unbounded fan-out.
Permission-gated tool calls
ShippedEvery tool invocation passes an async permission check returning an explicit decision before it can execute — enforced in code, not prompt text.
Tool search / discovery
ShippedThe model calls tool_search to find tools by keyword against the full catalog; a match is revealed so its native schema rejoins the next turn. It is the reveal mechanism behind Progressive Discovery, so it now cuts per-turn payload rather than adding to it.
Progressive Discovery (reveal-set tool-loading)
ShippedEach turn offers only a small always-on core plus the tool_search meta-tool. When tool_search matches a tool it is revealed for the rest of the conversation and its full schema rejoins subsequent turns as a native provider tool — keeping parallel calls, tool_choice, and the permission gate intact. One code path for Local / Cloud / Desktop / CLI / subagents, gated by SALSA_PROGRESSIVE_TOOLS (default ON).
Strict tool contracts
ShippedEvery provider funnels tool calls through one dispatch choke point, so the model's raw arguments are validated against each tool's JSON Schema before the tool body runs. Malformed calls are rejected with a structured error the model can self-correct from — not executed. The gate fails open on a schema we authored wrong and closed on the model's bad args; compiled schemas are cached once per tool (read-mostly, keyed by name) so the parallel fan-out is never serialized. Gated by SALSA_STRICT_TOOL_CONTRACTS (default ON).
Capability group 04
Multi-agent & verification
An orchestrator delegates to workers, a supervisor judges the output, and the patent-pending DSP protocol cryptographically verifies every action.
Orchestrator → workers
ShippedThe agent can spawn sub-agents as tools, delegating scoped subtasks and collecting their results — the orchestrator-workers pattern.
LLM-as-judge (supervisor)
ShippedA supervisor reviews plan-step output and gates progression — a second model judging the first, in code.
DSP verification gates
ShippedEvery agent action is encoded as a signed DSP frame and passes five verification gates (magic, version, codebook, epoch, Ed25519 signature) before execution.
Consensus / multi-model review
PartialA consensus loop exists for cross-checking model output. Broad multi-model voting across the full agent surface is not yet the default.
Capability group 05
Routing, safety & observability
The right model for the job, a sandbox that fails closed, and enough telemetry to prove what happened.
Smart routing & cost tiering
ShippedA SmartRouter scores task complexity and picks a model tier; when a task type shows a >30% escalation rate, it auto-promotes to a stronger tier.
VM sandbox isolation
ShippedAutonomous actions execute in an isolated VM reachable only over VSock 4100. If the sandbox is down, the task fails — it never falls back to the host.
Meta-prompt optimization (APO)
ShippedIncoming prompts are optimized/rewritten before dispatch, improving instruction quality without the user rewriting anything.
Deep observability (OpenTelemetry)
PartialStructured tracing spans and Prometheus metrics are in place. A full OpenTelemetry export pipeline (traces to an external collector) is not yet wired.
Eval harness & outcome tracking
PartialRouter outcomes and QA gates feed back into routing decisions. A comprehensive, standalone offline eval suite over golden trajectories is still growing.
Trajectory reduction
PlannedCompressing long tool-call histories into a minimal replayable trace (beyond the session summary) is on the roadmap for very long autonomous runs.
The full ledger
Every capability, one table
The single source of truth. When a row's state changes, this table changes with it.
| Capability | State | Source | Measured by |
|---|---|---|---|
| ReAct reasoning loop | Shipped | engine.rs:110 |
Bounded turn cap per run |
| Idempotent tool replay | Shipped | engine.rs:444 |
Duplicate calls = 0 re-exec |
| Self-healing loop control | Shipped | engine.rs:1927 |
Bounded entries & turns |
| Context compaction / budget | Shipped | budget.rs:12 |
8K session token cap |
| Persistent memory (GraphRAG) | Shipped | exec-core GraphRAG | Knowledge across sessions |
| Parallel tool execution | Shipped | tool_executor.rs:394 |
Bounded concurrency |
| Permission-gated tools | Shipped | permission.rs:40 |
100% checked pre-exec |
| Orchestrator → workers | Shipped | agent_tool.rs |
Scoped delegation |
| LLM-as-judge (supervisor) | Shipped | plan_executor/supervisor.rs |
Steps gated on quality |
| DSP verification gates | Shipped | dsp-protocol crate | 5-gate verify ~30µs |
| Smart routing & cost tiering | Shipped | router/engine.rs:184 |
Escalate on >30% rate |
| Meta-prompt optimization | Shipped | query_engine_stream/mod.rs |
Normalized prompts |
| VM sandbox isolation | Shipped | atlas-core; vsock-server | 0 host-side fallback exec |
| Tool search / discovery | Shipped | tool_search_tool.rs |
Matched tools revealed as native |
| Progressive Discovery tool-loading | Shipped | progressive.rs; tool_executor.rs |
Per-turn schema payload ≤ 60% |
| Multi-tier memory | Partial | budget.rs + GraphRAG | Mid-term tier missing |
| Strict tool contracts | Partial | openai.rs (sanitize) |
Uniform strict-mode missing |
| Consensus / multi-model review | Partial | services/consensus_loop.rs |
Not default-on |
| Deep observability (OTel) | Partial | tracing + prometheus | OTel export missing |
| Eval harness / outcome tracking | Partial | router feedback; qa_gate.rs | Golden-set incomplete |
| Tree of Thoughts | Planned | — | Branching search test |
| Prompt / context caching | Planned | — | Cached prefix reuse |
| Trajectory reduction | Planned | — | Minimal replayable trace |
Why this page exists
Proof mints trust.
Anyone can list the techniques the frontier labs use and claim to have them. Salsa's promise is different: you can stop checking our work only when it's verifiable. So this page never runs ahead of the code.
Every Shipped row points at a file you could open and a test that would fail if the capability regressed. Every Partial row names exactly what's missing. Every Planned row admits the gap. That's the deal — and it's also the codebase's north star: close the Partials, ship the Planned, and this ledger fills in.
An agent you can audit.
Request access to the Salsa beta and hold the harness to its word.