Agent Elo — Competitive Agent Arena

Let your agent be usable by other agents and humans. Let the best agent win.
8 February 2026 · Deep Market Assessment

I. Thesis

What: A ranking and routing layer for AI agents. Agents register as callable services (via MCP/tool-use), get used by both humans and other agents, and earn an Elo rating across dimensions that matter: taste, efficiency, depth, reliability. The best agents get called more. Bad agents stop getting called. Natural selection for software.

Who buys: Two-sided. Supply: agent builders (indie devs, vibecoder community, Eric's own agents). Demand: other agents needing capabilities, and humans needing the best agent for a task. The marketplace charges a take rate on agent-to-agent calls + premium for ranked routing.

How it works: MCP is the interop standard — agents expose tools/capabilities as MCP servers.1 Agent Elo wraps this with a registry, routing layer, and Elo system that tracks real usage outcomes. When Agent A calls Agent B to do research, Agent B's Elo updates based on the quality of output (rated by Agent A or the human downstream). Over time, the leaderboard becomes the canonical way to discover and compose agents.

Phase 0: Dog-Food Signal 🐶
  • Eric is building agents right now — Donna (relationship assistant), avet (community vetting), OpenClaw (self-hosted agent infra).2 He's literally experiencing the problem: how do agents discover each other? How does a human know which is best? How do agents compose capabilities?
  • Conrad set up OpenClaw on EC2 independently — proving that agent builders are emerging in Eric's network.
  • Jason Chan uses competing "Poke" — agent comparison is already happening organically.
  • This is the strongest PMF signal: the founder is the user. He already has supply (his agents) and demand (his network asking "how do you find it?").

II. Founder Context

DimensionAssessment
Technical depthStrong Builds full-stack agents, self-hosts on Mac mini, manages MCP/WhatsApp/Telegram integrations. Comfortable with Claude, OpenRouter, Supabase, Vercel. Has shipped Donna end-to-end.
Network (supply side)Strong 77+ tracked contacts. Agent creator directory project with David Li (VR). Conrad, Penny Yip, BennyKok, Emmanuel all technical + agent-curious. Vibecoder community access.
Network (demand side)Emerging 7 Donna pilot users. Jason, Bruce, Edward all interested. But demand is early — no paying agent-to-agent users yet.
BandwidthConstrained 7+ active projects. Build deficit day 5. ~10hr/week on Sourcy retainer. This is the bottleneck. Agent Elo competes for deep work time against Blackring, Donna pilot shipping, Wenhao validation.
CapitalModest HK$16K/mo from Sourcy retainer. No external funding. Ilona call exploring €25K-250K for Blackring, not this project. Bootstrap constraint is real.
Unfair advantageHas one Already building multiple agents (supply), already has agent users (demand), already connected to agent creator community. The Donna/avet/OpenClaw stack = built-in contestants for the arena.

III. Market Sizing

"Agent Elo" sits at the intersection of three emerging categories: AI agent platforms, API/model marketplaces, and AI evaluation/benchmarking. No research firm tracks "competitive agent marketplaces" as a segment — this is pre-category formation.3

AI Agent Platforms
US$47B
by 2030 (est.)
API Marketplaces
US$8.3B
by 2028 (est.)
AI Eval/Benchmark
~US$500M
emerging, est. 2026
Agent Elo Slice
~US$1-3B
routing + ranking layer

Market Sizing Logic

LayerSizeBasisSource
Global: AI Agent PlatformsUS$5.6B → $47.1B2024 → 2030, ~43% CAGR. Includes all autonomous agent infrastructure.Grand View Research, MarketsandMarkets estimates3
Segment: API MarketplacesUS$4.5B → $8.3B2024 → 2028. RapidAPI = ~US$1B valuation. This is the closest revenue analog.Verified Market Research4
Segment: AI Model HubsUS$4.5B (HF valuation)Hugging Face's valuation benchmarks what a model/agent discovery platform can be worth.Hugging Face Series D, Aug 20235
Segment: LLM Routing~US$10-50M ARROpenRouter, Martian, Not Diamond — all routing inference to best model. Pre-revenue to early revenue.Industry estimates6
Addressable: Eric's reachUS$0Zero paying users. 7 pilot users for Donna. ~20 agent-curious contacts. This is pre-revenue, pre-product.CRM data
⚠️ Honest Assessment: Pre-Category
  • "Competitive agent marketplace with Elo ranking" is not a recognized market segment. No research firm tracks it.
  • The US$47B AI agent figure includes everything from enterprise copilots to customer service bots — Agent Elo would capture a thin middleware slice.
  • The realistic addressable market for a solo founder in 2026 is US$0 to US$100K ARR — the question is whether this becomes a category or stays a research curiosity.

IV. Competitive Landscape

4a. Direct Competitors & Adjacent Players

CompanyWhat They DoModelStatusWhy They're Not Agent Elo
LMSys Chatbot Arena7Elo leaderboard for LLMs via blind votingFree research project (UC Berkeley)12M+ votes. Canonical. Unfunded.Ranks models, not agents. No agent-to-agent calls. No marketplace. No routing.
OpenRouter6Routes inference to cheapest/best model5-20% markup on API callsGrowing. Used by Eric + Donna.Routes models, not agents. No Elo. No quality feedback loop.
OpenHub.ai8Decentralized AI market economy for agentsProtocol-native marketplaceEarly. Docs-stage.Closest competitor. But protocol-focused, not taste/quality-focused. No Elo mechanism yet.
Magentic Marketplace9Research env for studying agentic marketsOpen-source (Microsoft)Academic. Oct 2025 paper.Research, not product. Studies agent economics but doesn't operationalize it.
Hugging Face5Model hub + community + leaderboardsFreemium SaaS ($4.5B valuation)~US$70M ARR (est. 2024)Hosts models and datasets. No agent composition. No Elo for agents. No routing.
CrewAI10Multi-agent orchestration frameworkOpen-source + enterprise ($18M Series A)Well-funded. Growing.Orchestration, not marketplace. Agents are internal to your system, not competing with others.
LangChain / LangSmith11Agent framework + observabilityOpen-source + SaaS ($25M Series A)Dominant framework.Framework, not marketplace. No external agent discovery or ranking.
GPT Store (OpenAI)12Marketplace for custom GPTsPlatform (no creator rev share til late 2024)Widely considered underwhelming.See "Failed Examples" below.

4b. Playbook Dissection — Who Won Adjacent Markets

CompanyModelRevenue / ScalePlaybookTransferability to Eric
LMSys Arena Free community Elo 12M+ votes, 0 revenue Blind A/B voting. Academic credibility. Became the benchmark for LLMs. No monetization. Mixed Proves Elo works for AI. But they chose not to monetize. Can Eric build the monetized version?
Hugging Face Freemium hub ~US$70M ARR, $4.5B valuation Open-source model hosting → community → enterprise SaaS. 7+ years. Network effects from model downloads. Low Took 7 years + massive VC funding ($400M+ raised). Community flywheel requires scale Eric doesn't have.
RapidAPI API marketplace ~US$45M ARR, $1B valuation (2022) Aggregated APIs → single interface → developer adoption. 35K+ APIs listed. Usage-based pricing. Instructive Closest marketplace analog. But required $300M+ funding and years of supply aggregation. Valuation reportedly dropped post-2022.
OpenRouter Model routing ~US$10-30M ARR (est.) Unified API for all LLM providers. 5-20% margin on top. Developer-friendly. Low friction. High Small team, bootstrap-friendly. Routes to best model per task. Agent Elo could be "OpenRouter for agents" — same playbook, different layer.
Not Diamond AI model routing US$3M seed (2024) Uses ML to route queries to optimal model. "Best model for every prompt." Quality-based routing. High Directly validates quality-based routing as a venture category. Same thesis, different layer (models vs agents).
Zapier Integration marketplace US$230M ARR (2024), profitable No-code integrations. 7,000+ apps. Marketplace effects. Took 12+ years. Instructive Shows integration marketplaces can be massive. But 12 years + no AI = different era.

4c. Failed Examples — The Startup Graveyard

🪦 GPT Store (OpenAI, Jan 2024)
  • What happened: OpenAI launched the GPT Store in Jan 2024 as a marketplace for custom GPTs.12 No revenue sharing for creators until late 2024. Poor discovery. Most GPTs saw fewer than 100 users.
  • Why it failed: No quality signal. No ranking by actual utility. Flooded with low-effort clones. Creators had no incentive (no revenue). Discovery was broken — no way to find the "best" GPT for a task.
  • Lesson for Agent Elo: This is exactly your thesis. The GPT Store failed because it had no Elo. No quality ranking. No competitive pressure. No agent-to-agent composition. Agent Elo is the fix.
🪦 ChatGPT Plugins (OpenAI, Mar 2023 → Discontinued mid-2024)
  • What happened: First attempt at agent-as-service. Plugins let GPT call external APIs. Shut down in favor of GPTs.
  • Why it failed: Too complex for users. Poor reliability. Models often chose wrong plugins. No quality feedback loop.
  • Lesson: Agent composition needs reliable quality signals, not just interop. Tool-use without ranking = chaos.
🪦 Crypto Agent Marketplaces (SingularityNET, Fetch.ai, Autonolas)
  • What happened: Promised decentralized AI agent economies. SingularityNET raised ~$36M (2017). Fetch.ai raised ~$15M.13 Token-driven incentive models.
  • Why they stalled: Crypto overhead (gas fees, wallet friction) killed adoption. Speculation-driven, not utility-driven. Most "agents" were simple API wrappers. No real quality measurement.
  • Lesson: Decentralization ≠ quality. Agent Elo should avoid crypto rails unless there's a clear utility reason. Keep it simple: API keys, usage-based billing, Elo ranking. Don't add blockchain complexity.
🪦 Fixie.ai → Pivoted (2023-2024)
  • What happened: Started as an agent platform/marketplace. Raised seed funding. Pivoted to enterprise agent infrastructure after marketplace model failed to gain traction.
  • Lesson: Pure agent marketplaces struggle without demand-side pull. Enterprise contracts are more reliable than marketplace network effects in early stages.

V. Unit Economics (Benchmarked)

Revenue Side

MetricBenchmark (Winner)Benchmark (Average)Agent Elo EstimateSource
Take Rate20-30% (App Store)10-15% (API marketplaces)10-15% on routed callsIndustry standard4
ARPU (agent builder)~US$200/mo (HF Pro)~US$20-50/mo~US$0 (free tier) → US$50-200/mo (pro)HF pricing5
ARPU (agent consumer)~US$20/mo (OpenRouter avg)~US$5-10/moUsage-based, ~US$10-50/moOpenRouter estimates6
Paid conversion5-8% (dev tools)2-4%2-5%Industry benchmarks
Gross margin70-85% (SaaS)60-70%60-80%Depends on proxy vs routing model

Cost Side (COGS Breakdown)

Cost ComponentPer-Unit CostAssumptionSource
Elo computation~US$0.001/matchSimple rating update per interaction. CPU-bound, negligible.Standard Elo algorithm
LLM judge (quality eval)~US$0.003-0.01/evalClaude Haiku or GPT-4o-mini to rate output quality. ~500 tokens/eval.Anthropic/OpenAI pricing14
API gateway/proxy~US$0.0001/requestIf proxying calls through Agent Elo's infra. Cloudflare Workers or similar.CF Workers pricing
Registry hosting~US$50-200/moDatabase + API for agent registry. Supabase or Planetscale.Supabase pricing
Leaderboard/frontend~US$0-20/moStatic site on Vercel. Minimal cost.Vercel free tier
💀 The Death Metric: LLM Judge Cost at Scale
  • If every agent-to-agent call triggers an LLM judge evaluation, at 1M calls/day that's US$3K-10K/day in eval costs alone.
  • Mitigation: Sample-based evaluation (eval 10% of calls, not 100%). Use lightweight models (Haiku, mini). Let user ratings supplement LLM judges.
  • If eval costs are 10x higher than estimated (longer prompts, bigger models), gross margin drops from ~70% to ~30%. This is the cost that could blow up viability.

Break-Even Scenarios

Optimistic
500 agents, 6 mo
Realistic
2K agents, 18 mo
Pessimistic
Never (platform risk)

Break-even = monthly infra costs (~US$200-500) covered by take rate revenue. At 10% take rate, need ~US$5K/mo in agent-to-agent call volume to cover basic costs. With 500 active agents doing US$10/mo avg volume through the platform = US$500/mo take. Not enough. Need either higher volume or premium tier.


VI. Live Signals

Category Formation Signals Bullish

SignalSourceImplication
Anthropic launches MCP (Nov 2024), rapidly adopted by Cursor, Windsurf, Claude Desktop1Anthropic blog, GitHubInterop is standardizing. Agents can now call other agents as tools. This is the prerequisite for Agent Elo.
Microsoft publishes Magentic Marketplace paper (Oct 2025) studying agent-to-agent economics9Microsoft ResearchBig tech is studying this exact problem. Validates the category. But also means incumbents may build it.
OpenAI launches ChatGPT Agent (Jan 2026) — agentic mode for browsing, code, actions15OpenAI blogConsumer expectations shifting to agentic. More agents = more need for ranking/routing.
CATArena paper (2025) validates tournament-based agent Elo16arXivAcademic proof that competitive ranking works for agents, not just models.
OpenHub.ai publishes protocol docs for decentralized agent economy8OpenHub docsEarly-stage competitor/validator. Shows builders are converging on agent marketplace concept.

Risk Signals Watch

SignalImplication
GPT Store remains underwhelming 12+ months after launchAgent marketplaces are hard. Discovery + quality + monetization all need to work simultaneously.
Magentic Marketplace finds "first-proposal bias" creates 10-30x speed advantage over quality9In agent markets, fast beats good by default. Elo needs to counter this — reward depth and taste, not just speed.
Every major platform (OpenAI, Anthropic, Google) building their own agent ecosystemsPlatform risk. If Anthropic builds MCP routing + ranking natively, Agent Elo gets subsumed.

VII. GTM Assessment (Eric-Specific)

What Eric Can Actually Do

CapabilityGTM ActionEffort
Already building Donna, avet, OpenClawRegister own agents as first supply. Dog-food the ranking system.Low — already exists
Agent creator directory project (with VR/David Li)Pivot from "directory" to "ranked arena." Same audience, stronger value prop.Medium — needs product pivot
Conrad set up OpenClaw on EC2OpenClaw users = natural first agents to register. Every OpenClaw instance = potential arena contestant.Low — distribution channel exists
Vibecoder/agent builder community accessLaunch as "leaderboard for your agent." Builders compete for rank. Vanity + distribution incentive.Medium — needs community activation
MCP expertise (Donna already uses MCP)Build the MCP-native agent registry. Technical credibility.Medium — needs build time

Minimum Viable Test

🧪 Week 1 MVP: Agent Leaderboard
  • Build: Static leaderboard site. Register agents by MCP server URL. Run them against a standard task set (research query, scheduling, data extraction). Rate output quality (LLM judge + human vote). Compute Elo.
  • Seed: Register Donna + avet + Poke + 2-3 public MCP servers from the community.
  • Launch: Post to HN, X, agent builder Discord channels. "Agent Elo: Which AI agent is actually the best?"
  • Cost: ~US$50-100 in LLM judge calls. Vercel hosting = free. 1-2 days of deep work.
  • Success metric: 50+ agents registered in first month. 500+ human votes on the leaderboard. One viral comparison.

Phase Plan

PhaseWhatWhenSuccess =
0. LeaderboardStatic Elo ranking site for agents. LLM judge + human voting. Public.Feb 202650+ agents, 500+ votes
1. RegistryMCP-native agent registry. Agents register, expose capabilities, get discovered.Mar-Apr 2026100+ agents, 10+ agent-to-agent calls/day
2. RoutingAgent Elo routes requests to highest-ranked agent for task type. "OpenRouter for agents."Q2 20261K+ calls/day, first revenue (take rate)
3. MarketplaceFull marketplace. Agents earn from being called. Builders get paid. Elo drives distribution.Q3-Q4 2026US$5K/mo GMV, 500+ active agents

Government Grants

Limited applicability for this project:


VIII. Red Team — Challenging the Thesis

Bull Case: Agent Elo Wins

  • MCP adoption accelerates → interop is solved → need quality layer on top
  • GPT Store failure proves marketplaces need Elo → Agent Elo is the fix
  • Eric dog-foods it with Donna/avet → real usage from day 1
  • Vibecoder community = free supply-side growth
  • LMSys proved Elo works for AI models → extend to agents
  • "OpenRouter for agents" is a legible pitch
  • Low COGS = can bootstrap to profitability
  • Network effects compound: more agents → better rankings → more users → more agents

Bear Case: Agent Elo Fails

  • Platform risk: Anthropic/OpenAI build native agent ranking into MCP/GPT ecosystem
  • Cold start: no agents registered = no rankings = no users = no agents (chicken-and-egg)
  • Magentic research shows speed beats quality 10-30x in agent markets → Elo may not matter
  • Agent quality is subjective — "taste" is hard to quantify into Elo
  • Eric's bandwidth is already maxed: 7+ projects, build deficit day 5
  • LMSys never monetized → maybe Elo is a public good, not a business
  • Crypto agent marketplaces spent $50M+ and failed → maybe premature
  • Agent composition is still fragile — MCP reliability issues could kill UX
⚠️ The Biggest Risk: Platform Subsumption
  • Anthropic owns MCP. If they add a native "agent quality score" or "recommended agents" layer, Agent Elo's entire value proposition gets absorbed.
  • OpenAI is already building the GPT Store. A quality-ranked version is an obvious next step.
  • Counter-argument: Platform-native rankings will be biased toward their own models/agents. An independent, cross-platform Elo system has value precisely because it's neutral. Like how LMSys is trusted because it's academic, not owned by any provider.
  • What needs to be true: Agent Elo must be perceived as neutral and cross-platform to survive platform risk. The moment it looks like a feature, not a platform, it gets cloned.

Steel-Man: The Strongest Counter-Argument

"This is too early. There aren't enough agents in the wild to rank. MCP adoption is months old. The 'agentic economy' is a research paper, not a market. Eric should focus on building one great agent (Donna) and worry about ranking agents after there are hundreds of them to rank."

My response: This is probably right for now. The timing question is the crux. Agent Elo in Feb 2026 is a leaderboard experiment. Agent Elo in late 2026 — after MCP has matured, after hundreds of vibecoded agents exist, after the GPT Store's failure has been fully digested — could be the right product at the right time. The play is: plant the flag now (leaderboard), build credibility, expand when the market catches up.


IX. Verdict

Is this a good opportunity for Eric at this time?

Conditionally yes — as a flag-planting side project, not a primary focus.

The thesis is sound. MCP standardization + agent proliferation + GPT Store failure = clear demand for a quality/routing layer. Eric has the unfair advantage: he's building agents, he's connected to agent builders, he understands MCP deeply. The "OpenRouter for agents" pitch is legible and fundable.

But the timing is early. There aren't enough agents to rank yet. The market is pre-category. Eric's bandwidth is already stretched across 7+ projects with a 5-day build deficit. Adding another primary focus would be destructive.

The one thing that would change the answer: If MCP adoption hits an inflection point (1,000+ public MCP servers, major frameworks integrating agent-to-agent calls as default), this becomes urgent. Watch for that signal.

Recommended path:

1. This week: Don't build Agent Elo. Ship Donna. Crack the ring BLE. Protect Monday deep work.

2. This month: Merge the "Agent Creator Directory" project (with VR/David Li) into Agent Elo. Same audience, stronger thesis. Build a static leaderboard as a weekend project. Register Donna + a few public agents. Post to HN.

3. Q2 2026: If the leaderboard gets traction (50+ agents, viral comparison), invest more build time. Add MCP-native registry. Start routing.

4. If it doesn't get traction: No loss. The leaderboard took 1-2 days to build. Agent creator community connections still valuable for Donna distribution.

The minimum viable version: A public webpage that runs 5-10 agents against the same task, rates their output with an LLM judge, and publishes an Elo leaderboard. One page. One afternoon. Plant the flag.


References

[1] Anthropic — Model Context Protocol (MCP) — Open standard for agent-tool interop. Launched Nov 2024. Adopted by Cursor, Claude Desktop, Windsurf.
[2] Eric's CRM — projects.json — Donna (relationship assistant), avet (agentic vetting), OpenClaw (self-hosted agent infra). All active/exploring.
[3] Grand View Research — AI Agents Market — US$5.6B (2024) → US$47.1B (2030), ~43% CAGR. Broadest relevant market. "Agent Elo" would be a thin middleware slice of this.
[4] Verified Market Research — API Marketplace Software Market — US$4.5B → US$8.3B by 2028. RapidAPI is the canonical example. Closest revenue analog to an agent marketplace.
[5] Hugging Face — Series D Announcement — US$4.5B valuation (Aug 2023). ~US$70M ARR (est. 2024). Model hub + community leaderboards. Shows marketplace model works for AI assets.
[6] OpenRouter — Unified API routing to 200+ LLM models. 5-20% markup. Bootstrap-friendly model. Closest playbook analog for Agent Elo.
[7] LMSys Chatbot Arena — 12M+ votes. Canonical Elo leaderboard for LLMs. Free, academic (UC Berkeley). Proves Elo works for AI quality ranking.
[8] OpenHub.ai — Protocol Documentation — Decentralized AI market economy. Agents as first-class economic participants. Early stage. Closest direct competitor.
[9] Magentic Marketplace — Microsoft Research (Oct 2025) — Open-source env for studying agentic markets. Key finding: frontier models show severe first-proposal bias (10-30x speed advantage over quality).
[10] CrewAI — Multi-agent orchestration framework. US$18M Series A (2024). Internal agent composition, not marketplace.
[11] LangChain / LangSmith — Dominant agent framework + observability. US$25M Series A (2023). Framework, not marketplace.
[12] OpenAI — GPT Store Launch (Jan 2024) — Agent marketplace. Widely considered underwhelming. No creator revenue share initially. Poor discovery. No quality ranking. The canonical failure case for Agent Elo's thesis.
[13] SingularityNET — Crypto-based AI agent marketplace. ~US$36M raised (2017 ICO). Token-driven. Stalled on adoption. Speculation > utility.
[14] Anthropic API Pricing — Claude Haiku: US$0.25/MTok input, US$1.25/MTok output. Used for LLM judge cost estimates.
[15] OpenAI — ChatGPT Agent (Jan 2026) — Agentic mode for ChatGPT. Browsing, code execution, tool use in unified loop. Signals consumer shift toward agent-first interaction.
[16] CATArena — Tournament-Based Agent Evaluation — Iterative competitive tournaments for LLM agents. Proves Elo-style ranking reveals learning ability and strategy quality beyond static benchmarks.