v2.0 — Self-improving AI

Open Aya OS — the agentic, in-browser cognitive operating system.

Open Aya uses a CAISI-inspired evaluation framework to measure capability, cost, latency, auditability, and workflow lift across baseline models, the Aya Pipeline, and reasoner routes. The goal is not to claim AGI; the goal is to prove whether an AI operating layer completes organizational work better than fragmented AI tools.

28 integrated apps. Voice-native, vision-aware, local-first with optional cloud sync. Multi-agent routing across planner, executor, memory, verifier, and critic strategies — every step auditable through a public benchmark harness, not a brochure.

Open Aya OS

Intelligence Card

Live system facts. Every number on this card is generated at request time from the runtime registry or the public eval database — there is no separate marketing source to drift.

Live

Model layer

anthropic/claude-sonnet-4.6 (conversation tier)
anthropic/claude-opus-4.6 (extended thinking, 10k budget)
google/gemini-3-flash (multimodal)
anthropic/claude-opus-4.6 (SWE-Bench leader)

Agent layer

6 routed strategies: planner, executor, memory_retriever, verifier, router, self_critic

Strategy-Auction routing implemented as system-prompt routing rules

Memory layer

Supabase + browser IndexedDB (local-first)

Kinds: short-term turn cache · long-term Auto-Dream consolidation · GraphRAG knowledge edges

Tool layer

6 built-in tools across 28 apps

Web search · Code execution (Code Lab) · File store (Spatial Files) · Calendar / Notes / Word Processor · …

Local-first status

Yes — runs in-browser; data stays on device by default

Cloud sync status

Optional — Supabase auth + persistence when signed in

Apps in registry

Generated from lib/app-registry.ts

Routed agents

Strategy-auction policies, system-prompt routed

Eval score (avg)

—

Across 0 completed runs

Last eval run

no runs yet

UTC server time

Avg latency / task

—

Wall-clock, includes network hop

Audit mode

Public — every eval result writes a reasoning trace to /api/aya/inspect and aggregates to /api/aya/audit

A/B comparison — pass rate by route

baseline

—

Claude Sonnet 4.6, no spine (control)

aya_pipeline

—

Claude Sonnet 4.6 + 7-stage cognitive spine

aya_reasoner

—

Claude Opus 4.6, extended thinking (10k)

What you can verify, right now, without an account.

Public eval API. /api/evaluate accepts a prompt and returns the canonical result shape (task_id, category, answer, agents_used, confidence, latency_ms, cost_estimate, memory_used, audit_trace).
Public status JSON. /api/aya/status lists every capability flag with an honest functional / claimed marker — no inference required.
Public audit aggregates. /api/aya/audit publishes the A/B verdict between baseline Claude Sonnet 4.6, Aya's 7-stage cognitive pipeline on anthropic/claude-sonnet-4.6, and the anthropic/claude-opus-4.6 reasoner (extended thinking, 10k budget) across all completed runs.
Three live demos. Reasoning, memory, and agent routing run a canned task end-to-end and show the full reasoning trace.
No claim without a receipt. Every superlative on this site links to a reproducible run with a JSON trace. Where data isn't available yet, we say so plainly instead of rounding up.

What we are not yet, and how you'll know when we are.

Open Aya OS is not AGI and does not claim to be. ARC-AGI alignment refers to architecture (multi-strategy reasoning, verifier loops, cost-per-task accounting) — not to a published score.
The strategy auction is currently implemented as deterministic system-prompt routing rules, not as six independent learned policies. The /eval harness measures the lift this routing actually provides over a Claude Sonnet 4.6 baseline running the same conversation tier without the cognitive spine — so the A/B delta isolates the wrapping, not a model upgrade.
“Self-improving” refers to per-user memory consolidation (Auto-Dream) and TinyAdapter parameter drift, not to weight updates of the underlying base model.
Pass rates on /receipts are computed from real, persisted eval runs. If a tier shows “no data”, no run of that tier has completed yet.

Open Aya OS — Public Eval Harness

Run a benchmark. Get a receipt.

Seven public tests. One canonical JSON shape. Independent judge models for open-ended scoring. A/B comparison against a raw baseline. Every claim on this site links back to a row in the public eval database — and you can rerun any of them from this page.

Open Aya OS

Intelligence Card

Live system facts. Every number on this card is generated at request time from the runtime registry or the public eval database — there is no separate marketing source to drift.

Live

Model layer

anthropic/claude-sonnet-4.6 (conversation tier)
anthropic/claude-opus-4.6 (extended thinking, 10k budget)
google/gemini-3-flash (multimodal)
anthropic/claude-opus-4.6 (SWE-Bench leader)

Agent layer

6 routed strategies: planner, executor, memory_retriever, verifier, router, self_critic

Strategy-Auction routing implemented as system-prompt routing rules

Memory layer

Supabase + browser IndexedDB (local-first)

Kinds: short-term turn cache · long-term Auto-Dream consolidation · GraphRAG knowledge edges

Tool layer

6 built-in tools across 28 apps

Web search · Code execution (Code Lab) · File store (Spatial Files) · Calendar / Notes / Word Processor · …

Local-first status

Yes — runs in-browser; data stays on device by default

Cloud sync status

Optional — Supabase auth + persistence when signed in

Apps in registry

Generated from lib/app-registry.ts

Routed agents

Strategy-auction policies, system-prompt routed

Eval score (avg)

—

Across 0 completed runs

Last eval run

no runs yet

UTC server time

Avg latency / task

—

Wall-clock, includes network hop

Audit mode

Public — every eval result writes a reasoning trace to /api/aya/inspect and aggregates to /api/aya/audit

A/B comparison — pass rate by route

baseline

—

Claude Sonnet 4.6, no spine (control)

aya_pipeline

—

Claude Sonnet 4.6 + 7-stage cognitive spine

aya_reasoner

—

Claude Opus 4.6, extended thinking (10k)

Seven public tests.

Each link below is a real GET request that runs the test end-to-end and returns the canonical receipt JSON. No auth, no rate-limit gates beyond a 60-second per-task timeout.

ARC-style reasoning test

abstract_reasoning

Abstract pattern induction over small input/output grid pairs. Tests skill-acquisition and generalization, not memorized priors.

GET /api/evaluate?task=arc.reflect_h.1

Turing-style conversation test

open_ended_dialogue

Open-ended dialogue judged on coherence, ambiguity-handling, and personality consistency. Independent judge model (gemini-3-flash) scores responses against a public rubric.

GET /api/evaluate?task=turing.trolley_empty_track

Tool-use test

tool_use

OS-command grounding: given a natural-language request, return the canonical tool / app the runtime should invoke.

GET /api/evaluate?task=os.find_app.1

Memory test

memory

Cross-turn recall: a fact is stated, the model is asked unrelated questions, then must retrieve the original fact precisely.

GET /api/evaluate?task=mem.recall.1

Latency test

latency

Wall-clock latency for a tier_1 task across all three routes (baseline / aya_pipeline / aya_reasoner). Reported alongside the answer in every receipt.

GET /api/evaluate?task=os.find_app.1&route=baseline

Multi-agent routing test

multi_step_reasoning

Verifies which strategies the pipeline invoked: planner, executor, memory_retriever, verifier, self_critic. Returned as agents_used in the receipt.

GET /api/evaluate?task=plan.recipe.1

Audit-trace test

audit

Confirms every receipt includes a reasoning trace fetchable via /api/aya/inspect. Pass ?trace=full to inline the trace.

GET /api/evaluate?task=plan.recipe.1&trace=full

The receipt shape.

Every endpoint above returns this exact JSON. Stable. Versioned. Greppable.

{
  "task_id": "plan.recipe.1",
  "category": "multi_step_reasoning",
  "answer": "Step 1 ... Step 2 ... FINAL ANSWER: ...",
  "agents_used": [
    "planner",
    "executor",
    "verifier"
  ],
  "confidence": 0.82,
  "latency_ms": 1380,
  "cost_estimate": 0.0041,
  "memory_used": false,
  "audit_trace": true,
  "route": "aya_pipeline",
  "passed": true,
  "score": 0.95,
  "inspect_url": "/api/aya/inspect?result_id=..."
}

task_id — stable identifier within the seed corpus
category — abstract_reasoning, tool_use, memory, etc.
answer — model's final response, verbatim
agents_used — strategies the pipeline actually invoked
confidence — 0..1, derived from passed × score
latency_ms — wall-clock including network hop
cost_estimate — USD, 4 decimals, model-card pricing
memory_used — true if the trace invoked a memory stage
audit_trace — true if reasoning trace exists; fetch via /api/aya/inspect
route — baseline | aya_pipeline | aya_reasoner

Run a live eval right now.

Type any prompt. Pick a route. We post it to /api/evaluate and stream back the receipt. No login.

baselineaya_pipelineaya_reasoner

Inline reasoning trace

What this harness is, and what it is not.

It is a public, reproducible benchmark with persisted reasoning traces and an A/B baseline. Anyone can rerun it.
It is not a certified ARC-AGI submission. We are architecturally-aligned, not officially scored. Read the ARC Prize methodology before quoting any number from this harness as an “ARC score.”
It is open about cost: every receipt includes cost_estimate in USD.
It is not a sandbox-isolated agent runtime. Tasks run against the live AI Gateway with real model calls and real latency.