ARC-style reasoning test
abstract_reasoningAbstract pattern induction over small input/output grid pairs. Tests skill-acquisition and generalization, not memorized priors.
GET /api/evaluate?task=arc.reflect_h.1I°
Loading...
v2.0 — Self-improving AI
Open Aya uses a CAISI-inspired evaluation framework to measure capability, cost, latency, auditability, and workflow lift across baseline models, the Aya Pipeline, and reasoner routes. The goal is not to claim AGI; the goal is to prove whether an AI operating layer completes organizational work better than fragmented AI tools.
29 integrated apps. Voice-native, vision-aware, local-first with optional cloud sync. Multi-agent routing across planner, executor, memory, verifier, and critic strategies — every step auditable through a public benchmark harness, not a brochure.
Open Aya OS
Live system facts. Every number on this card is generated at request time from the runtime registry or the public eval database — there is no separate marketing source to drift.
Model layer
Agent layer
6 routed strategies: planner, executor, memory_retriever, verifier, router, self_critic
Strategy-Auction routing implemented as system-prompt routing rules
Memory layer
Supabase + browser IndexedDB (local-first)
Kinds: short-term turn cache · long-term Auto-Dream consolidation · GraphRAG knowledge edges
Tool layer
6 built-in tools across 29 apps
Web search · Code execution (Code Lab) · File store (Spatial Files) · Calendar / Notes / Word Processor · …
Local-first status
Yes — runs in-browser; data stays on device by default
Cloud sync status
Optional — Supabase auth + persistence when signed in
Apps in registry
29
Generated from lib/app-registry.ts
Routed agents
6
Strategy-auction policies, system-prompt routed
Eval score (avg)
51.4%
Across 21 completed runs
Last eval run
Jun 28, 2026, 04:00 AM
UTC server time
Avg latency / task
2,236ms
Wall-clock, includes network hop
Audit mode
Public — every eval result writes a reasoning trace to /api/aya/inspect and aggregates to /api/aya/audit
A/B comparison — pass rate by route
baseline
—
Claude Sonnet 4.6, no spine (control)
aya_pipeline
—
Claude Sonnet 4.6 + 7-stage cognitive spine
aya_reasoner
—
Claude Opus 4.6, extended thinking (10k)
Claude Sonnet 4.6, Aya's 7-stage cognitive pipeline on anthropic/claude-sonnet-4.6, and the anthropic/claude-opus-4.6 reasoner (extended thinking, 10k budget) across all completed runs.Claude Sonnet 4.6 baseline running the same conversation tier without the cognitive spine — so the A/B delta isolates the wrapping, not a model upgrade.I°
Loading...
Open Aya OS — Public Eval Harness
Seven public tests. One canonical JSON shape. Independent judge models for open-ended scoring. A/B comparison against a raw baseline. Every claim on this site links back to a row in the public eval database — and you can rerun any of them from this page.
Each link below is a real GET request that runs the test end-to-end and returns the canonical receipt JSON. No auth, no rate-limit gates beyond a 60-second per-task timeout.
Abstract pattern induction over small input/output grid pairs. Tests skill-acquisition and generalization, not memorized priors.
GET /api/evaluate?task=arc.reflect_h.1Open-ended dialogue judged on coherence, ambiguity-handling, and personality consistency. Independent judge model (gemini-3-flash) scores responses against a public rubric.
GET /api/evaluate?task=turing.trolley_empty_trackOS-command grounding: given a natural-language request, return the canonical tool / app the runtime should invoke.
GET /api/evaluate?task=os.find_app.1Cross-turn recall: a fact is stated, the model is asked unrelated questions, then must retrieve the original fact precisely.
GET /api/evaluate?task=mem.recall.1Wall-clock latency for a tier_1 task across all three routes (baseline / aya_pipeline / aya_reasoner). Reported alongside the answer in every receipt.
GET /api/evaluate?task=os.find_app.1&route=baselineVerifies which strategies the pipeline invoked: planner, executor, memory_retriever, verifier, self_critic. Returned as agents_used in the receipt.
GET /api/evaluate?task=plan.recipe.1Confirms every receipt includes a reasoning trace fetchable via /api/aya/inspect. Pass ?trace=full to inline the trace.
GET /api/evaluate?task=plan.recipe.1&trace=fullEvery endpoint above returns this exact JSON. Stable. Versioned. Greppable.
{
"task_id": "plan.recipe.1",
"category": "multi_step_reasoning",
"answer": "Step 1 ... Step 2 ... FINAL ANSWER: ...",
"agents_used": [
"planner",
"executor",
"verifier"
],
"confidence": 0.82,
"latency_ms": 1380,
"cost_estimate": 0.0041,
"memory_used": false,
"audit_trace": true,
"route": "aya_pipeline",
"passed": true,
"score": 0.95,
"inspect_url": "/api/aya/inspect?result_id=..."
}Type any prompt. Pick a route. We post it to /api/evaluate and stream back the receipt. No login.
cost_estimate in USD.Open Aya OS
Live system facts. Every number on this card is generated at request time from the runtime registry or the public eval database — there is no separate marketing source to drift.
Model layer
Agent layer
6 routed strategies: planner, executor, memory_retriever, verifier, router, self_critic
Strategy-Auction routing implemented as system-prompt routing rules
Memory layer
Supabase + browser IndexedDB (local-first)
Kinds: short-term turn cache · long-term Auto-Dream consolidation · GraphRAG knowledge edges
Tool layer
6 built-in tools across 29 apps
Web search · Code execution (Code Lab) · File store (Spatial Files) · Calendar / Notes / Word Processor · …
Local-first status
Yes — runs in-browser; data stays on device by default
Cloud sync status
Optional — Supabase auth + persistence when signed in
Apps in registry
29
Generated from lib/app-registry.ts
Routed agents
6
Strategy-auction policies, system-prompt routed
Eval score (avg)
51.4%
Across 21 completed runs
Last eval run
Jun 28, 2026, 04:00 AM
UTC server time
Avg latency / task
2,236ms
Wall-clock, includes network hop
Audit mode
Public — every eval result writes a reasoning trace to /api/aya/inspect and aggregates to /api/aya/audit
A/B comparison — pass rate by route
baseline
—
Claude Sonnet 4.6, no spine (control)
aya_pipeline
—
Claude Sonnet 4.6 + 7-stage cognitive spine
aya_reasoner
—
Claude Opus 4.6, extended thinking (10k)