Cyber Joins the Crypto AI Benchmark Alliance (CAIBA)

AI is quickly becoming the go-to starting point for crypto users. Whether you're chasing the next viral memecoin or checking if a contract is safe, chances are you've asked AI for help. But relying on AI without rigorous benchmarks is like navigating crypto blindfolded. One bad answer can lead to exploited protocols, misrouted funds, or drained wallets.

In industries where accuracy is mission-critical, like law and medicine, benchmarks are built to keep AI honest. They provide builders with clear standards and tools for improvement. With its high stakes transactions and rapid pace of innovation, crypto requires the same rigor.

To address this critical need, Cyber has joined forces with 13 other leading projects to launch the Crypto AI Benchmark Alliance (CAIBA). CAIBA is an open, community-driven initiative to establish transparent, reliable benchmarks for crypto‑specific AI tasks and to help the entire industry raise the bar together.

Why Benchmarks Are Essential in Crypto

Across industries, the push for AI evaluation is gaining serious momentum. LMArena recently raised $100 million to build a dedicated benchmarking platform.

Sectors like law and healthcare have already recognized the need for rigorous testing. Legal professionals rely on benchmarks like Harvey’s BigLaw Bench to assess legal reasoning, while clinicians use Stanford’s MedHELM to evaluate AI performance on high stakes medical tasks. Similarly, platforms like Vals.ai have emerged to test LLMs against task-specific challenges in finance, healthcare, math, and academia.

The need for domain-specific evaluation is clear. A recent study by Vals.ai tested 22 top AI models on finance-specific tasks and found that even the best performers averaged below 50% accuracy. General-purpose models struggled with domain complexity — frequently hallucinating, misreading questions, or failing to use tools correctly.

With over $100 billion locked in DeFi (DefiLlama) and AI already being used to automate trading, governance, and onchain analysis, there’s no room for hallucinations or half-truths in crypto. If our industry is going to lean on AI, it needs benchmarks built for it. CAIBA is here to solve this problem.

What CAIBA Is and How It Works

CAIBA is an alliance that publishes industry-specific benchmarks, plus the tools and frameworks developers need to build more accurate crypto AI models and agents.

The effort is larger than testing alone. By bringing together protocols, data providers, researchers, and auditors, CAIBA promotes transparency and fairness while guarding against any single project skewing results.

Shunyu Yao’s influential essay, The Second Half of AI argues that “evaluation is the last unsolved piece of the intelligence puzzle.” CAIBA takes that view to heart by turning real crypto workflows into multistep challenges that test agents on three pillars of fluency:

Pillar	What it measures
Knowledge	Answering practical questions about protocols, tokens, and onchain data
Planning	Charting multi-step tasks such as crosschain swaps or restaking flows
Action	Using wallets, explorers, and APIs safely and reliably

Models and agents receive a numerical score for each pillar, and those scores feed a live leaderboard that highlights which ones truly grasp crypto’s complexities. By enabling teams to collect data and run evaluations at scale, CAIBA helps builders pinpoint where their apps and models fall short, leading to improvements in the areas that matter most to users.

To ensure accountability, CAIBA publishes its grading systems and public datasets on open source platforms like GitHub and Hugging Face under permissive licenses, when allowed. Like GAIA and Vals.ai’s benchmarks, some question‑and‑answer sets are kept private to prevent over‑fitting and to protect confidentiality. When distribution is restricted, this data is overseen by a rotating council of protocols, auditors, and researchers.

Why Cyber is Joining

Since 2021, Cyber has focused on bringing social interactions onchain. As AI reshapes how people discover and share information, Cyber is building two products that depend on reliable AI:

A crypto search engine to surface real-time, trustworthy data
An agentic framework that lets developers build smarter AI and copilots

By joining CAIBA, Cyber commits to contributing real-world workflows and representative tasks to ensure that benchmarks reflect how people actually use crypto today. Cyber will also test upcoming products against the same open standards as everyone else. Just one voice among many, Cyber is doing our part to make AI safer for all in crypto.

CAIA: The First Benchmark for Crypto AI Agents

Launched alongside CAIBA, a benchmark for Crypto AI Agents (CAIA) is the alliance’s inaugural evaluation. CAIA builds on general-purpose benchmarks like GAIA and incorporates domain-specific adaptations to test whether AI agents can perform real, analyst-level tasks in crypto.

The benchmark evaluates agents across three core crypto workflows. Scoring well on CAIA indicates that an agent has the practical skills of a junior crypto analyst. High performing models are able to parse onchain data, explain tokenomics, and navigate projects with context and accuracy, much like a human would.

Workflows Evaluated and Representative Tasks

Crypto Workflow	Capabilities Tested	Representative Task
Onchain Analysis	Parsing contract ABIs, transaction logs, and using data indexers	Fetch the daily swap volume (USD) for the ETH/USDC 0.05% pool on Uniswap V3 on Ethereum mainnet for January 2, 2025 (UTC).
Project Discovery	Breaking down supply and vesting terms in plain language	List three major interoperability protocols that compete with Hyperlane and provide their official documentation or website links.
Tokenomics Diagnostics	Identifying projects from minimal context	Report the total supply of the $EIGEN token as of April 30, 2025 (UTC).

CAIA evaluates both foundational models (like GPT-4o, Claude 3.7, Gemini 2.5, DeepSeek-1) and crypto-native agents. Model scores are published on a public leaderboard, and those meeting a performance threshold receive a Crypto-Ready badge to signal of reliability for builders and users alike.

Roadmap

CAIBA will continue expanding its evaluation coverage with three additional benchmarks already planned for 2025:

Crypto Named Entity Recognition (CNER): Inspired by traditional Named Entity Recognition, this measures how well models identify protocols, tokens, wallets, and contracts to reduce false positives in crypto data.
Blockchain-Use Benchmark: Based on the Mind2Web framework, this evaluates how effectively agents follow natural-language instructions to complete tasks on live crypto websites and test real-world usability.
Crypto LM Arena: Modeled after crowdsourced evaluation platforms, this uses community voting to assess the usefulness and accuracy of AI responses and highlight the most effective models.

Together, these represent a foundation for holding crypto AI to a higher standard. CAIBA will grow into a complete platform where builders test and improve their agents, and users compare models with confidence. If crypto is to trust AI, standards must be built now because the tools of tomorrow depend on the work done today.

Help Shape the Standard

CAIBA is open to everyone:

Projects & researchers – join the alliance, contribute datasets, or submit an agent.
Developers – propose new tasks that track emerging primitives.
Everyday users – share the questions you wish AI could answer.

Crypto keeps evolving, let’s make sure AI keeps up. Learn more or get involved at caiba.ai or by contacting @James_dai on Telegram.