OpenAya OS

Loading...

v2.0 — Self-improving AI

Open Aya OS — the agentic, in-browser cognitive operating system.

Open Aya uses a CAISI-inspired evaluation framework to measure capability, cost, latency, auditability, and workflow lift across baseline models, the Aya Pipeline, and reasoner routes. The goal is not to claim AGI; the goal is to prove whether an AI operating layer completes organizational work better than fragmented AI tools.

28 integrated apps. Voice-native, vision-aware, local-first with optional cloud sync. Multi-agent routing across planner, executor, memory, verifier, and critic strategies — every step auditable through a public benchmark harness, not a brochure.

Open Aya OS

Intelligence Card

Live system facts. Every number on this card is generated at request time from the runtime registry or the public eval database — there is no separate marketing source to drift.

Live

Model layer

  • anthropic/claude-sonnet-4.6 (conversation tier)
  • anthropic/claude-opus-4.6 (extended thinking, 10k budget)
  • google/gemini-3-flash (multimodal)
  • anthropic/claude-opus-4.6 (SWE-Bench leader)

Agent layer

6 routed strategies: planner, executor, memory_retriever, verifier, router, self_critic

Strategy-Auction routing implemented as system-prompt routing rules

Memory layer

Supabase + browser IndexedDB (local-first)

Kinds: short-term turn cache · long-term Auto-Dream consolidation · GraphRAG knowledge edges

Tool layer

6 built-in tools across 28 apps

Web search · Code execution (Code Lab) · File store (Spatial Files) · Calendar / Notes / Word Processor · …

Local-first status

Yes — runs in-browser; data stays on device by default

Cloud sync status

Optional — Supabase auth + persistence when signed in

Apps in registry

28

Generated from lib/app-registry.ts

Routed agents

6

Strategy-auction policies, system-prompt routed

Eval score (avg)

Across 0 completed runs

Last eval run

no runs yet

UTC server time

Avg latency / task

Wall-clock, includes network hop

Audit mode

Public — every eval result writes a reasoning trace to /api/aya/inspect and aggregates to /api/aya/audit

A/B comparison — pass rate by route

baseline

Claude Sonnet 4.6, no spine (control)

aya_pipeline

Claude Sonnet 4.6 + 7-stage cognitive spine

aya_reasoner

Claude Opus 4.6, extended thinking (10k)

What you can verify, right now, without an account.

  • Public eval API. /api/evaluate accepts a prompt and returns the canonical result shape (task_id, category, answer, agents_used, confidence, latency_ms, cost_estimate, memory_used, audit_trace).
  • Public status JSON. /api/aya/status lists every capability flag with an honest functional / claimed marker — no inference required.
  • Public audit aggregates. /api/aya/audit publishes the A/B verdict between baseline Claude Sonnet 4.6, Aya's 7-stage cognitive pipeline on anthropic/claude-sonnet-4.6, and the anthropic/claude-opus-4.6 reasoner (extended thinking, 10k budget) across all completed runs.
  • Three live demos. Reasoning, memory, and agent routing run a canned task end-to-end and show the full reasoning trace.
  • No claim without a receipt. Every superlative on this site links to a reproducible run with a JSON trace. Where data isn't available yet, we say so plainly instead of rounding up.

What we are not yet, and how you'll know when we are.

  • Open Aya OS is not AGI and does not claim to be. ARC-AGI alignment refers to architecture (multi-strategy reasoning, verifier loops, cost-per-task accounting) — not to a published score.
  • The strategy auction is currently implemented as deterministic system-prompt routing rules, not as six independent learned policies. The /eval harness measures the lift this routing actually provides over a Claude Sonnet 4.6 baseline running the same conversation tier without the cognitive spine — so the A/B delta isolates the wrapping, not a model upgrade.
  • “Self-improving” refers to per-user memory consolidation (Auto-Dream) and TinyAdapter parameter drift, not to weight updates of the underlying base model.
  • Pass rates on /receipts are computed from real, persisted eval runs. If a tier shows “no data”, no run of that tier has completed yet.

Demo · Memory

Stash a fact. Distract. Recall.

The public eval endpoint is stateless within a single request, so memory continuity in this demo is shown via the reasoning trace: which stages handled the stash, which stages fired during recall, and whether memory_used flipped to true.

For the full session-persistent memory experience, sign in to the OS and use the Memory app — that uses Auto-Dream consolidation against your private Supabase row, RLS-scoped.

RouteRoute:

Step 1

Stash a fact

Expected behaviour: Aya should acknowledge the fact and the trace should show a memory write stage.

Please remember this for me: my favorite color is teal, and my dog's name is Pixel. Acknowledge that you've stored it.

Step 2

Distractor turn

Expected behaviour: A direct answer. The memory_retriever stage should NOT fire (this turn has no recall need).

What is the capital of France? Answer in one word.

Step 3

Recall the original fact

Expected behaviour: Aya should retrieve 'teal' and 'Pixel' verbatim. The memory_retriever stage should fire and the receipt's memory_used flag should be true.

Without re-reading earlier messages, what is my favorite color and what is my dog's name?

Step 4

Seed memory recall task

Expected behaviour: Curated, reproducible memory test from the seed corpus. Same shape as above, with a published expected answer for scoring.

seed: mem.recall.1