The Harness

The techniques the frontier labs use.
Shipped, tested, and measured.

DeepSeek, OpenAI, and Anthropic build their agents on a known set of techniques — reasoning loops, memory compaction, parallel tools, verification gates. This page is Salsa's honest ledger of every one of them: what's shipped, what's partial, what's planned — each backed by a real file, a real test, and a real number.

A capability only appears here as Shipped when there is code you can read and a test that fails without it. No aspirational checkmarks. This is the north star the codebase is measured against.

How to read this

Three states. No wishful thinking.

Every capability below carries one of three honest states. Marketing copy cannot promote a capability past its real state — the state is set by the code and its tests, not by the pitch.

Shipped

Implemented in the codebase with a named test that fails without it, and a KPI we measure. Cited with file:line.

Partial

A real, working core exists, but a meaningful piece of the frontier version is not yet in place. We name exactly what's missing.

Planned

On the roadmap, not yet in the codebase. Listed so the ledger is complete and honest — and so the gap is visible.

Current scorecard

Where the harness stands today

Shipped & test-backed

Partial — core working

Planned — on the roadmap

This scorecard changes as work lands. When a Planned item ships with its test, it moves up — and this page is the record of that.

Capability group 01

Reasoning & control loop

The agent's core loop: think, act, observe, repeat — with hard bounds so it can never run away, and cached replay so it never redoes work.

ReAct reasoning loop

Shipped

Interleaved reasoning and tool calls with a bounded max_turns stop condition, so the loop always terminates.

crates/exec-core/src/query_engine/engine.rs:110 test: tests_integration.rs KPI: every run bounded by an explicit turn cap

Idempotent tool replay

Shipped

A per-run dedup cache replays the cached result when the model repeats an identical successful tool call — no double execution, no wasted spend.

engine.rs:444 (dedup cache) test: test_duplicate_successful_tool_call_replays_cached_result KPI: duplicate calls cost 0 re-execution

Self-healing loop control

Shipped

Degenerate loops are detected and bounded (dedup entry caps, turn caps) so a stuck model is contained rather than left to spin.

engine.rs:1927 (per-run state), dedup_max_entries() test: tests_integration.rs KPI: bounded entries & turns per run

Tree of Thoughts

Planned

Branching multi-path exploration with backtracking. The current loop is linear ReAct; ToT search is not yet in the codebase.

Roadmap — will ship with a branching-search test

Capability group 02

Memory & context

Context is a budget, not an afterthought. Salsa caps session tokens and compacts on overflow so long runs stay coherent and affordable.

Context compaction & token budget

Shipped

A hard session cap of 8,000 tokens triggers summarization when exceeded, keeping the working context inside the model's effective window.

context/budget.rs:12 (SESSION_CAP = 8_000), :85 enforce_session_budget test: enforce_session_budget_triggers (budget.rs:119) KPI: session context held under an 8K token cap

Persistent memory (GraphRAG)

Shipped

Cross-session knowledge is stored and retrieved through the GraphRAG store (vector + graph), so the agent carries context between runs.

exec-core GraphRAG stack (desktop feature) RAG search/ingest integration path KPI: knowledge survives across sessions

Multi-tier memory (short / mid / long)

Partial

Short-term (session budget) and long-term (GraphRAG) tiers exist. The mid-term rolling-summary tier — a compacted digest between the two — is not yet in place.

budget.rs (short) + GraphRAG (long) Missing: mid-term summary tier

Prompt / context caching

Planned

Provider-side prompt caching (reusing a cached system+tools prefix across turns) is not yet wired. It's the highest-leverage cost win on the roadmap.

Roadmap — pairs with Progressive Discovery below

Capability group 03

Advanced tool use

Tools run in parallel under a concurrency bound, every call passes a permission check first, and the model can search its own toolbox.

Parallel tool execution

Shipped

Independent tool calls run concurrently under a Semaphore-bounded pool, then join — fast without unbounded fan-out.

tool_executor.rs:394 (Semaphore), :612 (join_all) test: query_engine concurrent-tool coverage KPI: bounded concurrency, no unbounded spawn

Permission-gated tool calls

Shipped

Every tool invocation passes an async permission check returning an explicit decision before it can execute — enforced in code, not prompt text.

query_engine/permission.rs:40 (check trait) test: permission decision coverage KPI: 100% of tool calls checked pre-exec

Tool search / discovery

Shipped

The model calls tool_search to find tools by keyword against the full catalog; a match is revealed so its native schema rejoins the next turn. It is the reveal mechanism behind Progressive Discovery, so it now cuts per-turn payload rather than adding to it.

query_engine/tools/tool_search_tool.rs (all_definitions + revealed_handle().reveal()) test: progressive_discovery_reveal_set (tests_integration.rs) KPI: matched tools rejoin as native tools without re-sending the full catalog

Progressive Discovery (reveal-set tool-loading)

Shipped

Each turn offers only a small always-on core plus the tool_search meta-tool. When tool_search matches a tool it is revealed for the rest of the conversation and its full schema rejoins subsequent turns as a native provider tool — keeping parallel calls, tool_choice, and the permission gate intact. One code path for Local / Cloud / Desktop / CLI / subagents, gated by SALSA_PROGRESSIVE_TOOLS (default ON).

crates/exec-core/src/query_engine/progressive.rs (filter_progressive + RevealedTools); tool_executor.rs:definitions(); tools/tool_search_tool.rs (reveal) test: progressive_discovery_reveal_set (tests_integration.rs) KPI: per-turn tool-schema payload ≤ 60% of the full 43-schema dump; permission gate still covers revealed tools

Strict tool contracts

Shipped

Every provider funnels tool calls through one dispatch choke point, so the model's raw arguments are validated against each tool's JSON Schema before the tool body runs. Malformed calls are rejected with a structured error the model can self-correct from — not executed. The gate fails open on a schema we authored wrong and closed on the model's bad args; compiled schemas are cached once per tool (read-mostly, keyed by name) so the parallel fan-out is never serialized. Gated by SALSA_STRICT_TOOL_CONTRACTS (default ON).

crates/exec-core/src/query_engine/tool_contract.rs (reject_if_invalid + validate_cached); tool_executor.rs execute_batch_parallel (dispatch hook) test: strict_tool_contract_rejects_malformed_args (tests_integration.rs) + 10 tool_contract unit tests KPI: 0 schema-invalid tool calls reach dispatch; schema compiled once per tool (O(1) amortized on the hot path)

Capability group 04

Multi-agent & verification

An orchestrator delegates to workers, a supervisor judges the output, and the patent-pending DSP protocol cryptographically verifies every action.

Orchestrator → workers

Shipped

The agent can spawn sub-agents as tools, delegating scoped subtasks and collecting their results — the orchestrator-workers pattern.

query_engine/tools/agent_tool.rs test: agent-spawn foreground coverage KPI: scoped delegation with collected results

LLM-as-judge (supervisor)

Shipped

A supervisor reviews plan-step output and gates progression — a second model judging the first, in code.

plan_executor/supervisor.rs; issues/dispatch/qa_gate.rs test: supervisor review coverage KPI: steps gated on judged quality

DSP verification gates

Shipped

Every agent action is encoded as a signed DSP frame and passes five verification gates (magic, version, codebook, epoch, Ed25519 signature) before execution.

dsp-protocol crate (patent pending GA-2026-DSP-002) test: DSP frame verify coverage KPI: 5-gate verify in ~30µs

Consensus / multi-model review

Partial

A consensus loop exists for cross-checking model output. Broad multi-model voting across the full agent surface is not yet the default.

services/consensus_loop.rs Missing: default-on multi-model voting

Capability group 05

Routing, safety & observability

The right model for the job, a sandbox that fails closed, and enough telemetry to prove what happened.

Smart routing & cost tiering

Shipped

A SmartRouter scores task complexity and picks a model tier; when a task type shows a >30% escalation rate, it auto-promotes to a stronger tier.

router/engine.rs:184 (SmartRouter), :128 (escalation > 0.3) test: router escalation coverage KPI: cheap tier by default, escalate on evidence

VM sandbox isolation

Shipped

Autonomous actions execute in an isolated VM reachable only over VSock 4100. If the sandbox is down, the task fails — it never falls back to the host.

atlas-core sandbox; vsock-server (port 4100) test: sandbox fail-closed coverage KPI: 0 host-side fallback executions

Meta-prompt optimization (APO)

Shipped

Incoming prompts are optimized/rewritten before dispatch, improving instruction quality without the user rewriting anything.

routes/chat/query_engine_stream/mod.rs (meta-prompt) test: prompt-optimization coverage KPI: normalized prompts before model call

Deep observability (OpenTelemetry)

Partial

Structured tracing spans and Prometheus metrics are in place. A full OpenTelemetry export pipeline (traces to an external collector) is not yet wired.

tracing + prometheus across daemon Missing: OTel SDK export

Eval harness & outcome tracking

Partial

Router outcomes and QA gates feed back into routing decisions. A comprehensive, standalone offline eval suite over golden trajectories is still growing.

router outcome feedback; dispatch/qa_gate.rs Missing: full golden-trajectory eval set

Trajectory reduction

Planned

Compressing long tool-call histories into a minimal replayable trace (beyond the session summary) is on the roadmap for very long autonomous runs.

Roadmap — extends context compaction

The full ledger

Every capability, one table

The single source of truth. When a row's state changes, this table changes with it.

Capability	State	Source	Measured by
ReAct reasoning loop	Shipped	`engine.rs:110`	Bounded turn cap per run
Idempotent tool replay	Shipped	`engine.rs:444`	Duplicate calls = 0 re-exec
Self-healing loop control	Shipped	`engine.rs:1927`	Bounded entries & turns
Context compaction / budget	Shipped	`budget.rs:12`	8K session token cap
Persistent memory (GraphRAG)	Shipped	exec-core GraphRAG	Knowledge across sessions
Parallel tool execution	Shipped	`tool_executor.rs:394`	Bounded concurrency
Permission-gated tools	Shipped	`permission.rs:40`	100% checked pre-exec
Orchestrator → workers	Shipped	`agent_tool.rs`	Scoped delegation
LLM-as-judge (supervisor)	Shipped	`plan_executor/supervisor.rs`	Steps gated on quality
DSP verification gates	Shipped	dsp-protocol crate	5-gate verify ~30µs
Smart routing & cost tiering	Shipped	`router/engine.rs:184`	Escalate on >30% rate
Meta-prompt optimization	Shipped	`query_engine_stream/mod.rs`	Normalized prompts
VM sandbox isolation	Shipped	atlas-core; vsock-server	0 host-side fallback exec
Tool search / discovery	Shipped	`tool_search_tool.rs`	Matched tools revealed as native
Progressive Discovery tool-loading	Shipped	`progressive.rs`; `tool_executor.rs`	Per-turn schema payload ≤ 60%
Multi-tier memory	Partial	budget.rs + GraphRAG	Mid-term tier missing
Strict tool contracts	Partial	`openai.rs` (sanitize)	Uniform strict-mode missing
Consensus / multi-model review	Partial	`services/consensus_loop.rs`	Not default-on
Deep observability (OTel)	Partial	tracing + prometheus	OTel export missing
Eval harness / outcome tracking	Partial	router feedback; qa_gate.rs	Golden-set incomplete
Tree of Thoughts	Planned	—	Branching search test
Prompt / context caching	Planned	—	Cached prefix reuse
Trajectory reduction	Planned	—	Minimal replayable trace

Why this page exists

Proof mints trust.

Anyone can list the techniques the frontier labs use and claim to have them. Salsa's promise is different: you can stop checking our work only when it's verifiable. So this page never runs ahead of the code.

Every Shipped row points at a file you could open and a test that would fail if the capability regressed. Every Partial row names exactly what's missing. Every Planned row admits the gap. That's the deal — and it's also the codebase's north star: close the Partials, ship the Planned, and this ledger fills in.

No checkmark without a test

No claim past the code's real state

Every gap named, not hidden

The page updates when work lands

An agent you can audit.

Request access to the Salsa beta and hold the harness to its word.

Request beta access

The techniques the frontier labs use. Shipped, tested, and measured.

Three states. No wishful thinking.

Where the harness stands today

Reasoning & control loop

ReAct reasoning loop

Idempotent tool replay

Self-healing loop control

Tree of Thoughts

Memory & context

Context compaction & token budget

Persistent memory (GraphRAG)

Multi-tier memory (short / mid / long)

Prompt / context caching

Advanced tool use

Parallel tool execution

Permission-gated tool calls

Tool search / discovery

Progressive Discovery (reveal-set tool-loading)

Strict tool contracts

Multi-agent & verification

Orchestrator → workers

LLM-as-judge (supervisor)

DSP verification gates

Consensus / multi-model review

Routing, safety & observability

Smart routing & cost tiering

VM sandbox isolation

Meta-prompt optimization (APO)

Deep observability (OpenTelemetry)

Eval harness & outcome tracking

Trajectory reduction

Every capability, one table

Proof mints trust.

An agent you can audit.

The techniques the frontier labs use.
Shipped, tested, and measured.