ARC-style reasoning test
abstract_reasoningAbstract pattern induction over small input/output grid pairs. Tests skill-acquisition and generalization, not memorized priors.
GET /api/evaluate?task=arc.reflect_h.1I°
Loading...
v2.0 — Self-improving AI
Open Aya uses a CAISI-inspired evaluation framework to measure capability, cost, latency, auditability, and workflow lift across baseline models, the Aya Pipeline, and reasoner routes. The goal is not to claim AGI; the goal is to prove whether an AI operating layer completes organizational work better than fragmented AI tools.
28 integrated apps. Voice-native, vision-aware, local-first with optional cloud sync. Multi-agent routing across planner, executor, memory, verifier, and critic strategies — every step auditable through a public benchmark harness, not a brochure.
Open Aya OS
Live system facts. Every number on this card is generated at request time from the runtime registry or the public eval database — there is no separate marketing source to drift.
Model layer
Agent layer
6 routed strategies: planner, executor, memory_retriever, verifier, router, self_critic
Strategy-Auction routing implemented as system-prompt routing rules
Memory layer
Supabase + browser IndexedDB (local-first)
Kinds: short-term turn cache · long-term Auto-Dream consolidation · GraphRAG knowledge edges
Tool layer
6 built-in tools across 28 apps
Web search · Code execution (Code Lab) · File store (Spatial Files) · Calendar / Notes / Word Processor · …
Local-first status
Yes — runs in-browser; data stays on device by default
Cloud sync status
Optional — Supabase auth + persistence when signed in
Apps in registry
28
Generated from lib/app-registry.ts
Routed agents
6
Strategy-auction policies, system-prompt routed
Eval score (avg)
—
Across 0 completed runs
Last eval run
no runs yet
UTC server time
Avg latency / task
—
Wall-clock, includes network hop
Audit mode
Public — every eval result writes a reasoning trace to /api/aya/inspect and aggregates to /api/aya/audit
A/B comparison — pass rate by route
baseline
—
Claude Sonnet 4.6, no spine (control)
aya_pipeline
—
Claude Sonnet 4.6 + 7-stage cognitive spine
aya_reasoner
—
Claude Opus 4.6, extended thinking (10k)
Claude Sonnet 4.6, Aya's 7-stage cognitive pipeline on anthropic/claude-sonnet-4.6, and the anthropic/claude-opus-4.6 reasoner (extended thinking, 10k budget) across all completed runs.Claude Sonnet 4.6 baseline running the same conversation tier without the cognitive spine — so the A/B delta isolates the wrapping, not a model upgrade.Open Aya OS — Public Eval Harness
Seven public tests. One canonical JSON shape. Independent judge models for open-ended scoring. A/B comparison against a raw baseline. Every claim on this site links back to a row in the public eval database — and you can rerun any of them from this page.
Open Aya OS
Live system facts. Every number on this card is generated at request time from the runtime registry or the public eval database — there is no separate marketing source to drift.
Model layer
Agent layer
6 routed strategies: planner, executor, memory_retriever, verifier, router, self_critic
Strategy-Auction routing implemented as system-prompt routing rules
Memory layer
Supabase + browser IndexedDB (local-first)
Kinds: short-term turn cache · long-term Auto-Dream consolidation · GraphRAG knowledge edges
Tool layer
6 built-in tools across 28 apps
Web search · Code execution (Code Lab) · File store (Spatial Files) · Calendar / Notes / Word Processor · …
Local-first status
Yes — runs in-browser; data stays on device by default
Cloud sync status
Optional — Supabase auth + persistence when signed in
Apps in registry
28
Generated from lib/app-registry.ts
Routed agents
6
Strategy-auction policies, system-prompt routed
Eval score (avg)
—
Across 0 completed runs
Last eval run
no runs yet
UTC server time
Avg latency / task
—
Wall-clock, includes network hop
Audit mode
Public — every eval result writes a reasoning trace to /api/aya/inspect and aggregates to /api/aya/audit
A/B comparison — pass rate by route
baseline
—
Claude Sonnet 4.6, no spine (control)
aya_pipeline
—
Claude Sonnet 4.6 + 7-stage cognitive spine
aya_reasoner
—
Claude Opus 4.6, extended thinking (10k)
Each link below is a real GET request that runs the test end-to-end and returns the canonical receipt JSON. No auth, no rate-limit gates beyond a 60-second per-task timeout.
Abstract pattern induction over small input/output grid pairs. Tests skill-acquisition and generalization, not memorized priors.
GET /api/evaluate?task=arc.reflect_h.1Open-ended dialogue judged on coherence, ambiguity-handling, and personality consistency. Independent judge model (gemini-3-flash) scores responses against a public rubric.
GET /api/evaluate?task=turing.trolley_empty_trackOS-command grounding: given a natural-language request, return the canonical tool / app the runtime should invoke.
GET /api/evaluate?task=os.find_app.1Cross-turn recall: a fact is stated, the model is asked unrelated questions, then must retrieve the original fact precisely.
GET /api/evaluate?task=mem.recall.1Wall-clock latency for a tier_1 task across all three routes (baseline / aya_pipeline / aya_reasoner). Reported alongside the answer in every receipt.
GET /api/evaluate?task=os.find_app.1&route=baselineVerifies which strategies the pipeline invoked: planner, executor, memory_retriever, verifier, self_critic. Returned as agents_used in the receipt.
GET /api/evaluate?task=plan.recipe.1Confirms every receipt includes a reasoning trace fetchable via /api/aya/inspect. Pass ?trace=full to inline the trace.
GET /api/evaluate?task=plan.recipe.1&trace=fullEvery endpoint above returns this exact JSON. Stable. Versioned. Greppable.
{
"task_id": "plan.recipe.1",
"category": "multi_step_reasoning",
"answer": "Step 1 ... Step 2 ... FINAL ANSWER: ...",
"agents_used": [
"planner",
"executor",
"verifier"
],
"confidence": 0.82,
"latency_ms": 1380,
"cost_estimate": 0.0041,
"memory_used": false,
"audit_trace": true,
"route": "aya_pipeline",
"passed": true,
"score": 0.95,
"inspect_url": "/api/aya/inspect?result_id=..."
}Type any prompt. Pick a route. We post it to /api/evaluate and stream back the receipt. No login.
cost_estimate in USD.