What: A free, consumer-grade AI agent grading tool. You connect your agent (OpenClaw, custom bot, GPT, Claude, any assistant). claw.degree runs a standardized test battery. You get a report card — a score, strengths, weaknesses, and a sharable badge.
Model: The HubSpot Website Grader playbook — free tool captures leads, scores go viral, data accumulates into a moat, upsell into monitoring/improvement tools.1
To whom: AI agent builders, OpenClaw deployers, GPT/Claude wrapper developers, enterprise teams evaluating their AI assistants before shipping.
Price: Free tier (grading + report card), Pro US$29–49/mo (monitoring + historical), Enterprise US$199–499/mo (CI/CD + team).
| Layer | Size | Source |
|---|---|---|
| Global — AI Observability | US$1.2B (2024) → US$8.7B (2033), 24.6% CAGR | MarketIntelo2 |
| Segment — Agentic AI Monitoring | US$550M (2025) → US$2.05B (2030), 30.1% CAGR | Mordor Intelligence3 |
| Segment — LLM Observability | US$511M (2024) → US$8.1B (2034), 31.8% CAGR | Market.us4 |
| Broader — AI Agent Platforms | US$10B+ (2025) → US$23.6B (2029), 41.1% CAGR | Technavio5 |
| Agent Proliferation | 1B+ agents by 2029 (40× vs 2025). 217B actions/day. | IDC6 |
| Enterprise Adoption | 40% US enterprises deployed agents. Custom GPT usage up 19× YTD. | OpenAI Enterprise Report7 |
claw.degree is not competing for the US$8.7B observability market. It’s the entry point — the free grading tool that captures builders, then upsells. The addressable market is the intersection of:
Conservative addressable TAM: If 1% of the ~25M active AI agent builders6 use a paid tier at US$39/mo average → US$117M ARR. If 0.1% → US$11.7M ARR. Both are venture-scale outcomes from a free tool.
| Company | Funding | Model | Price | Why Not claw.degree |
|---|---|---|---|---|
| Weights & Biases | US$250M, $1.25B val8 | MLOps platform | Enterprise | Full MLOps stack. Overkill for “is my chatbot good?” |
| Patronus AI | US$40M Series A9 | Enterprise agent monitoring | Enterprise | Percival monitors production agents. Not a grading tool. |
| Braintrust | US$39M (Series A1)10 | AI observability & eval | $0–$249/mo | CI/CD-native. Requires codebase integration. Dev tool, not consumer. |
| Arize | Enterprise ML observability | ML monitoring | Enterprise | Compliance-focused. Drift detection. Not agent grading. |
| Langfuse | Acquired by ClickHouse11 | Open-source LLM obs. | Free (OSS) | Self-host, configure, instrument code. Not “paste URL, get score.” |
| Company | Focus | Users | Gap vs claw.degree |
|---|---|---|---|
| Zenval12 | 100+ built-in evals, HELM/MMLU | Early | Developer platform, not consumer. No viral loop. |
| LangWatch13 | Agent testing + prompt mgmt | Thousands | Engineering tool. Requires SDK integration. |
| Evalion14 | Voice/text agent testing | Early | Domain-specific (call centers). Not general agent grading. |
| MetricsLM15 | IEEE CertifAIEd compliance | 200+ businesses | Compliance/certification for enterprise. Heavy process. |
| Seekr16 | AI model certification, gov/military | $1.2B val | US Army contracts. Enterprise/gov. Not consumer. |
| Platform | What It Does | Gap |
|---|---|---|
| Chatbot Arena / LMSYS17 | Crowdsourced LLM comparison (Elo rating). 240K+ votes. | Ranks models, not agents. Can’t test YOUR specific agent. |
| HAL (Holistic Agent Leaderboard)18 | Multi-dimension agent eval (cost, reliability, security) | Academic. Requires benchmark setup. Not consumer-grade. |
| AgentBench | Academic multi-environment agent benchmark | Research tool. Tests base models, not deployed agents. |
The “free grading tool → lead gen → upsell” playbook is one of the most proven PLG strategies in SaaS history. Here are the companies that did it:
| Tool | Company | Scale | What They Graded | Outcome |
|---|---|---|---|---|
| Website Grader | HubSpot | 2M+ URLs graded1 | Website performance, SEO, mobile, security | Legendary lead gen. Drove early HubSpot growth to IPO ($35B+ mkt cap) |
| PageSpeed Insights | Billions of tests | Web page speed and Core Web Vitals | Industry standard. Drives adoption of Google’s web tools. | |
| SSL Labs Test | Qualys | Industry standard | SSL/TLS configuration quality (A–F grade) | Became the de facto SSL score. Free tool drives enterprise sales. |
| GTmetrix | GTmetrix | Millions of users | Website speed & performance score | Freemium → PRO plans. Sustainable indie business. |
| BuiltWith | BuiltWith | Industry standard | Technology stack detection | Free lookup → $295–$995/mo for leads. ~AU$14M revenue. |
An agent “degree” needs measurable, repeatable dimensions. Here’s the proposed test battery, grounded in what the research says matters:18
| Dimension | What It Measures | Method | Difficulty |
|---|---|---|---|
| Instruction Following | Does the agent do what you told it to? | Structured prompts with expected outcomes | EASY |
| Latency | How fast does it respond? | Timed request/response cycles | EASY |
| Tool Usage | Does it use tools correctly? (MCP, function calling) | Provide test tools, verify correct invocation | MEDIUM |
| Consistency | Same question 10× → same quality? | Repeated queries, variance analysis | EASY |
| Hallucination Rate | Does it make things up? | Fact-checking against known-answer questions | MEDIUM |
| Safety & Guardrails | Can it be jailbroken? Does it refuse harmful requests? | Adversarial prompt battery | MEDIUM |
| Personality Consistency | Does it maintain its persona across turns? | LLM-as-judge across conversation19 | HARD |
| Cost Efficiency | Tokens used per task (proxy for API cost) | Token counting per test interaction | EASY |
Output format: A single page report card — overall score (A–F or 0–100), dimension breakdown, specific failing test cases, actionable recommendations. Sharable URL. Embeddable badge: “claw.degree certified — A-”
| Metric | Benchmark (HubSpot Grader) | Benchmark (Dev Tools) | claw.degree Est. |
|---|---|---|---|
| Free users graded Y1 | 1M+ URLs (HubSpot first 18 mo)1 | — | 10K–100K agents |
| Email capture rate | 100% (required) | 60–80% | 100% (required for report) |
| Free → Paid conversion | 2–5% (SaaS benchmark) | 1–3% (dev tools) | 2% |
| ARPU (Pro) | — | Braintrust $249/mo10 | US$39/mo |
| ARPU (Cert badge) | — | MetricsLM custom15 | US$99–199/yr |
| Component | Per-Unit Cost | Assumption | Source |
|---|---|---|---|
| LLM-as-judge (GPT-4o mini) | US$0.0003/eval | ~1K tokens per dimension judgment | OpenAI pricing20 |
| Full eval (8 dimensions) | US$0.025–0.05 | 8 dimensions × multi-query + structured output | Calculated |
| Reasoning judge (o4-mini) | US$0.003/eval | For harder dimensions (hallucination, safety) | OpenAI pricing20 |
| Infrastructure | US$50–200/mo | Vercel/Railway + Supabase (existing stack) | Current infra |
| Domain (claw.degree) | US$8–63/yr | Registration via Namecheap or Domain Cost Club | Registrar pricing21 |
At US$0.05/eval and 1,000 free evals/day, COGS = US$1,500/mo. Even at pessimistic tier, gross margin is 92%+. The cost structure is almost pure software — no hardware, no human review at base tier.
No specific “agent grading tool” has failed because none have been built yet in the consumer-grade form claw.degree proposes. But adjacent failures are instructive:
| Pattern | Example | What Happened | Lesson |
|---|---|---|---|
| Static benchmarks become irrelevant | GLUE, SuperGLUE | Models saturated the benchmark. Score became meaningless. | DYNAMIC TESTS Must evolve the test battery as models improve. |
| Leaderboard gaming | Various LLM leaderboards | Companies optimize for benchmarks, not real-world quality. | REAL TASKS Test with realistic scenarios, not synthetic tasks. |
| Enterprise-only → too narrow | Truera (acq. by Snowflake) | Good tech, tiny market at the time. Acqui-hired. | GO BROAD Consumer-grade tool captures more surface area. |
| Open-source captures the floor | Langfuse11 | 21.6K GitHub stars. Acquired by ClickHouse. Free tier kills paid alternatives. | RISK Must differentiate from Langfuse’s free eval features. |
| Cost-blind evaluation | CLEAR framework findings18 | Leading agents show 50× cost variation for similar accuracy. No benchmark reports cost. | INCLUDE COST Cost efficiency as a test dimension is a differentiator. |
| Asset | Relevance |
|---|---|
| Donna (own AI PA) | First test subject. Dog-food on day 1. “Here’s Donna’s score” is the launch tweet. |
| OpenClaw ecosystem | 173K+ GitHub stars.22 2M visitors in first week. Natural distribution channel. |
| Agent Elo research | Already mapped the agent ranking thesis. claw.degree IS the evaluation layer. |
| Existing infra | Supabase, Vercel, Node.js — the stack is already there. |
| Builder network | Conrad, Philip, Tom, Jason, Penny — all building or testing agents. 10+ beta users on day 1. |
| @ericsanio Twitter | AI-age thinking audience. Agent grading content is on-brand. |
| Phase | Timeline | Action | Success Metric |
|---|---|---|---|
| 0. Dog-food | Week 1 | Grade Donna. Grade Conrad’s agent. Grade Eugene’s WA bot. Fix scoring until it’s credible. | 3+ agents graded. Scores feel accurate to builders. |
| 1. MVP Launch | Week 2–3 | Deploy claw.degree. Single page: paste API endpoint → get score. Share on Twitter. | 100 agents graded. 50+ emails captured. |
| 2. OpenClaw Community | Week 3–4 | Post in OpenClaw Discord/GitHub. “Grade your OpenClaw agent.” | 1K agents graded. 10+ organic shares. |
| 3. HN/Reddit Launch | Month 2 | Show HN: “I built a Website Grader for AI agents.” | 10K agents graded. 5K emails. First paid conversions. |
| 4. Badge & Cert | Month 3 | Launch “claw.degree certified” badge. Embeddable on agent pages. | Paid tier: $99–199/yr for cert. 50+ paying. |
| 5. Agent Elo Feed | Month 4+ | Scores feed into Agent Elo leaderboard. Cross-pollination. | Two products, one data moat. |
SG PSG/EDG: Unlikely to apply — this is a global SaaS, not a local enterprise deployment. HK ITSF: Possible for R&D component (AI evaluation methodology). Not a primary GTM lever — the free tool IS the growth engine.
The strongest objection: “Agent builders don’t need a score. They need their agent to work.” The argument is that testing/evaluation is a means to an end, and most builders will just iterate by using their agent, not by running it through a grading tool. This is the same reason most developers don’t write tests — they ship and fix.
Counter: Most developers don’t write tests, true. But most developers DO run their site through PageSpeed Insights at least once. The bar isn’t “regular usage” — it’s “check once, get hooked.” HubSpot Website Grader didn’t need repeat users to generate 2M+ leads. The free, one-time grading IS the product for 95% of users. The 2–5% who want monitoring become paying customers.
If the scores have no credibility. If the first 100 agent builders grade their agents and say “this score doesn’t reflect reality,” word spreads fast in developer communities. The fix: launch with only objectively measurable dimensions (latency, instruction following, consistency, cost). Add subjective scoring later. Underpromise, overdeliver on accuracy.
Eric already researched Agent Elo — an agent ranking/marketplace concept.23 claw.degree is not a competitor to Agent Elo. It’s the evaluation infrastructure that feeds it.
| Concept | Agent Elo | claw.degree |
|---|---|---|
| Question answered | “Which agent is best for this task?” | “How good is MY agent?” |
| User | Agent consumer (person choosing an agent) | Agent builder (person improving their agent) |
| Revenue model | Marketplace commission / premium listing | Freemium SaaS (grading → monitoring → cert) |
| Data flow | Consumes claw.degree scores for ranking | Produces quality scores for each agent |
| Timing | Needs agent density (later) | Works from agent 1 (now) |
Build it. It’s a weekend MVP with a proven playbook.
claw.degree is the HubSpot Website Grader for AI agents. The playbook is 18 years proven (2M+ URLs graded, drove HubSpot to IPO). The market is timing perfectly: 1B+ agents by 2029, 40% of enterprises deploying, and nobody offers a consumer-grade “paste your agent, get a score” tool.
The unit economics are exceptional: US$0.05/eval COGS, 92%+ gross margin, near-zero infra cost using Eric’s existing stack. The dog-food signal is strong — Eric builds agents (Donna, avet, Sourcy WA bot) and genuinely wants to know how good they are. The domain is available for US$8–63/yr.
What makes this special: it’s the missing evaluation layer that the Agent Elo research already identified. claw.degree grades agents → scores feed Agent Elo rankings → one data moat, two products. And unlike Agent Elo (which needs agent density), claw.degree works from agent #1.
The one risk: score credibility. If the first 100 builders say the score is BS, it’s dead. Mitigant: launch with only objectively measurable dimensions (latency, instruction following, consistency, tool accuracy). No subjective scoring until the credibility is established.
Minimum viable test: Build claw.degree this weekend. Grade Donna. Grade Conrad’s agent. Share scores on Twitter. If 100 people grade their agents in the first week — you have signal. Total cost: US$8 (domain) + US$0 (existing infra) + a weekend.
STRONG SIDE PROJECT — BUILD NOW