v2.0 — Self-improving AI

Open Aya OS — the agentic, in-browser cognitive operating system.

Open Aya uses a CAISI-inspired evaluation framework to measure capability, cost, latency, auditability, and workflow lift across baseline models, the Aya Pipeline, and reasoner routes. The goal is not to claim AGI; the goal is to prove whether an AI operating layer completes organizational work better than fragmented AI tools.

29 integrated apps. Voice-native, vision-aware, local-first with optional cloud sync. Multi-agent routing across planner, executor, memory, verifier, and critic strategies — every step auditable through a public benchmark harness, not a brochure.

Open Aya OS

Intelligence Card

Live system facts. Every number on this card is generated at request time from the runtime registry or the public eval database — there is no separate marketing source to drift.

Live

Model layer

anthropic/claude-sonnet-4.6 (conversation tier)
anthropic/claude-opus-4.6 (extended thinking, 10k budget)
google/gemini-3-flash (multimodal)
anthropic/claude-opus-4.6 (SWE-Bench leader)

Agent layer

6 routed strategies: planner, executor, memory_retriever, verifier, router, self_critic

Strategy-Auction routing implemented as system-prompt routing rules

Memory layer

Supabase + browser IndexedDB (local-first)

Kinds: short-term turn cache · long-term Auto-Dream consolidation · GraphRAG knowledge edges

Tool layer

6 built-in tools across 29 apps

Web search · Code execution (Code Lab) · File store (Spatial Files) · Calendar / Notes / Word Processor · …

Local-first status

Yes — runs in-browser; data stays on device by default

Cloud sync status

Optional — Supabase auth + persistence when signed in

Apps in registry

Generated from lib/app-registry.ts

Routed agents

Strategy-auction policies, system-prompt routed

Eval score (avg)

51.4%

Across 21 completed runs

Last eval run

Jun 28, 2026, 04:00 AM

UTC server time

Avg latency / task

2,236ms

Wall-clock, includes network hop

Audit mode

Public — every eval result writes a reasoning trace to /api/aya/inspect and aggregates to /api/aya/audit

A/B comparison — pass rate by route

baseline

—

Claude Sonnet 4.6, no spine (control)

aya_pipeline

—

Claude Sonnet 4.6 + 7-stage cognitive spine

aya_reasoner

—

Claude Opus 4.6, extended thinking (10k)

What you can verify, right now, without an account.

Public eval API. /api/evaluate accepts a prompt and returns the canonical result shape (task_id, category, answer, agents_used, confidence, latency_ms, cost_estimate, memory_used, audit_trace).
Public status JSON. /api/aya/status lists every capability flag with an honest functional / claimed marker — no inference required.
Public audit aggregates. /api/aya/audit publishes the A/B verdict between baseline Claude Sonnet 4.6, Aya's 7-stage cognitive pipeline on anthropic/claude-sonnet-4.6, and the anthropic/claude-opus-4.6 reasoner (extended thinking, 10k budget) across all completed runs.
Three live demos. Reasoning, memory, and agent routing run a canned task end-to-end and show the full reasoning trace.
No claim without a receipt. Every superlative on this site links to a reproducible run with a JSON trace. Where data isn't available yet, we say so plainly instead of rounding up.

What we are not yet, and how you'll know when we are.

Open Aya OS is not AGI and does not claim to be. ARC-AGI alignment refers to architecture (multi-strategy reasoning, verifier loops, cost-per-task accounting) — not to a published score.
The strategy auction is currently implemented as deterministic system-prompt routing rules, not as six independent learned policies. The /eval harness measures the lift this routing actually provides over a Claude Sonnet 4.6 baseline running the same conversation tier without the cognitive spine — so the A/B delta isolates the wrapping, not a model upgrade.
“Self-improving” refers to per-user memory consolidation (Auto-Dream) and TinyAdapter parameter drift, not to weight updates of the underlying base model.
Pass rates on /receipts are computed from real, persisted eval runs. If a tier shows “no data”, no run of that tier has completed yet.

Open Aya OS, scored against itself.

Every architectural claim Aya makes lives on this page as a number, with a reproducible run behind it. ARC-style reasoning, Turing-style conversation, functional-OS commands — all A/B-compared against the raw base model. No marketing copy, no vibes. Receipts.

A/B: does the wrapping help?

The same task corpus, run three ways. Baseline is Claude Sonnet 4.6 with a generic system prompt — the control runs the same conversation tier as the pipeline so the A/B delta isolates exactly what the cognitive wrapping adds, not a model upgrade. Aya Pipeline runs anthropic/claude-sonnet-4.6 with the seven-stage Cognitive Spine. Aya Reasoner routes through anthropic/claude-opus-4.6 with extended thinking enabled (10k token budget). If the wrapping is real, you should see lift over the baseline.

Route	Pass rate	Latency	Tokens
baseline (Claude Sonnet 4.6)	—	—	—
aya_pipeline (anthropic/claude-sonnet-4.6)	—	—	—
aya_reasoner (anthropic/claude-opus-4.6)	—	—	—

Route

50% horizon

80% horizon

Scored attempts

baseline (Claude Sonnet 4.6)

—

aya_pipeline (anthropic/claude-sonnet-4.6)

—

aya_reasoner (anthropic/claude-opus-4.6)

—

By eval tier

Tier 1

Functional OS

—

no runs yet

Voice/text commands that should produce an OS effect (open app, start timer, dictate note).

Tier 2

Cognitive Agent

—

no runs yet

Memory, planning, multi-step reasoning across the assistant's tools and skills.

Tier 3

ARC-Style Reasoning

—

no runs yet

Few-shot grid puzzles, abstract pattern induction, novel-rule generalization.

Turing

Conversation Quality

—

no runs yet

Trick questions, false-memory traps, irony, multi-turn coherence — judged by an independent model.

Mixed

End-to-End

What we measured. What we did not.

The point of receipts is to be honest about scope. Here is what this page can and cannot tell you about Aya right now.

What the receipts prove

Aya routes anthropic/claude-sonnet-4.6 for conversation and anthropic/claude-opus-4.6 for extended-thinking reasoning; the eval baseline runs Claude Sonnet 4.6 without the cognitive spine as the fixed control.
Every score has a reproducible run id and a full reasoning trace stored in Supabase.
A/B comparison against the unwrapped base model is real and ongoing.
Scoring is automated (exact_match, contains, numeric, grid_match) or judged by an independent google/gemini-3-flash judge with a published rubric.
Latency and token cost of every step is recorded — including the cost-of-thought premium the wrapping adds.

What the receipts do NOT prove

The 6-agent Strategy Auction is currently implemented as system-prompt routing across the same base model, not 6 independently trained policies.
Cross-device persistence ships when production Supabase auth ships; local-first is the default today.
ARC-AGI ceiling tracks the underlying reasoner (anthropic/claude-opus-4.6 with extended thinking — published low-single-digit performance on ARC-AGI-2). The wrapping does not change that ceiling — it is a UX layer, not a reasoning multiplier.

OpenAya OS

Open Aya OS — the agentic, in-browser cognitive operating system.

Intelligence Card

What you can verify, right now, without an account.

What we are not yet, and how you'll know when we are.

OpenAya OS

Open Aya OS, scored against itself.

Total measurements

A/B: does the wrapping help?

Time-horizons

By eval tier

What we measured. What we did not.

What the receipts prove

What the receipts do NOT prove

Run your own evals.