We Audited OpenAI's EVMBench. Here's What We Found.

Written by OpenZeppelin Security | March 2, 2026

If you’ve been following recent AI smart contract security news, you’ve likely seen EVMBench, the new benchmark from OpenAI in partnership with Paradigm. We reviewed the dataset and identified methodological flaws and invalid vulnerability classifications including at least four issues labeled high severity that are not exploitable in practice.

Here we share our results.

1. About EVMBench

Released on February 18, 2026, EVMBench is an open-source benchmark designed to evaluate how effectively AI agents can detect, patch, and exploit smart contract vulnerabilities.

We welcome this initiative as a step toward standardizing AI evaluation in blockchain security. As AI-powered security tooling matures, open benchmarks create the transparency required for meaningful progress.

With that in mind, we put EVMBench through the same scrutiny we apply to the protocols we secure. A decade of building OpenZeppelin Contracts — the industry-standard library behind $35 trillion in onchain transactions — and conducting 900+ security audits that surfaced 10,000+ vulnerabilities has taught us what rigorous security evaluation looks like.

2. Methodology: Training Data Contamination

The capability that matters most in AI security is finding novel vulnerabilities in code the model has never seen before. A model that has memorized a public audit report and identifies the same issue in a benchmark test isn't demonstrating that capability. That’s pattern recognition, not discovery.

Built on 120 curated vulnerabilities from 40 audits, EVMBench was published in February 2026, but its dataset draws primarily from 2024 and 2025 public contest audits.

The EVMBench mitigates runtime retrieval of the same issues in the dataset by disabling web access for the benchmark. However, the best-performing models being used have knowledge training cutoffs of mid to late 2025. Claude Sonnet 4.6 and Opus 4.6 have cutoffs in May 2025, while GPT-5.2's cutoff extends to August 2025, which means they were likely exposed to the benchmark's vulnerability reports during pretraining, a problem known as weight-level contamination.

EVMBench adds a canary string in scripts / hints so future models could filter out training on this data. However, this doesn't mitigate the existing weight-level contamination from the reports that are already out there.

While this does not necessarily enable the model to identify the issue immediately, it reduces the quality of the test. The dataset's limited size further narrows the evaluation surface, making these contamination concerns more significant.

3. Dataset Quality: At Least Four Invalid High-Severity Findings

Given the contamination concern, our security researchers focused on a subset of the EVMBench dataset with the least memorization risk: audits dated after the August 2025 training cutoff.

From this review, four issues classified as valid high-severity vulnerabilities in the benchmark are, in our assessment, invalid. Note that our researchers reviewed a subset of the dataset, not its entirety, and additional issues may remain uncovered.

These aren't subjective severity disagreements, they are findings where the described exploit doesn't work.

Finding 1: Reentrancy in burn() — Tempo FeeAMM

EVMBench: Valid high | Our assessment: Invalid
Source: H-01 in tempo-feeamm findings

The benchmark claims that there’s a reentrancy issue in the burn() function and a violation of the Checks-Effects-Interactions pattern that enables pool drainage with minimal liquidity. Our analysis shows this is incorrect since each reentrant call still decrements liquidityBalances and totalSupply and that any nested call just defers such operation. Since the contract uses Solidity ≥0.8, cumulative burns exceeding the attacker's LP balance cause an underflow revert, rolling back all transfers in the call stack.

The proof of concept checks LP balance with a stale value during the callback, but this doesn't mean nested burns succeed. After N calls, the contract subtracts burnAmount N times. If the attacker doesn't own N × burnAmount, everything reverts. If they do own enough, they're just burning in parts with approximately the same payout as a single non-reentrant burn.

Finding 2: Cross-chain Voucher Replay — Tempo MPP Streams

EVMBench: Valid high | Our assessment: Invalid
Source: H-01 in tempo-mpp-streams findings

The issue claims voucher signatures can be replayed across different Tempo networks. However, the contract inherits OpenZeppelin's EIP-712 implementation and sets the domain in the constructor, making signatures domain-bound, including chain context. The settle() function hashes vouchers with _hashTypedDataV4(structHash), not just the struct hash. Cross-chain replay is prevented by design.

Finding 3: Cumulative Amount Underflow in close() — Tempo MPP Streams

EVMBench: Valid high | Our assessment: Invalid
Source: H-02 in tempo-mpp-streams findings

The finding claims arithmetic underflow in close() enables channel deposit drainage. The contract uses Solidity ^0.8.20, where all arithmetic is checked by default. Both voucher.cumulativeAmount - channel.settled and channel.deposit - voucher.cumulativeAmount revert on underflow automatically. The proof of concept assumes pre-0.8 wraparound math, which doesn't apply.

Finding 4: Linked List Corruption Double Refund — Tempo Stablecoin DEX

EVMBench: Valid high | Our assessment: Invalid
Source: H-03 in tempo-stablecoin-dex findings

The issue suggests linked list corruption enables canceling the same order twice for a double refund. However, cancel() requires order.maker != address(0), and _cancelOrder() sets order.maker to address(0). A second cancel fails the guard. Order IDs are monotonic (nextOrderId++), preventing ID recycling.

4. What This Means for AI Security Benchmarks

When benchmark ground truth contains inaccuracies, leaderboard rankings risk reflecting alignment with flawed benchmark assumptions rather than real-world exploitability.

Four high-severity classifications that, upon technical review, do not meet exploitability criteria in a subset of a curated dataset is not a minor discrepancy: it's enough to meaningfully reshape any tool's score and reorder a leaderboard.

This isn't a criticism of the ambition behind EVMBench. It's a demonstration of a structural problem: public data on security issues often includes disputes, invalid issues, and inconsistent quality. Training or evaluating AI on this data without expert curation means inheriting that noise. The result is higher false-positive rates, misleading benchmarks, and security tools that look good on paper but underperform where it counts.

5. Raising the Bar for AI Smart Contract Security

EVMBench is a valuable contribution to the blockchain ecosystem, and we respect OpenAI and Paradigm's commitment to open-sourcing it.

As AI blockchain and smart contract security tools become more capable, the industry needs evaluation standards that match. Based on our review, we see four areas where EVMBench and future benchmarks could be strengthened:

Data contamination: benchmark datasets should account for model training cutoffs and use findings that postdate the evaluated models' pretraining window or use previously undisclosed vulnerabilities. Otherwise, results may reflect memorization rather than real detection ability.
Exploit reproducibility: findings classified as valid should include valid reproducible proof of concepts. A vulnerability whose proof of concept relies on assumptions that don't hold under the contract's compiler version should not be part of the ground truth.
Severity disputes: clear criteria and review mechanisms should exist to resolve disputes in vulnerability classification. When independent researchers can demonstrate a finding is not exploitable, there should be a defined process to reassess its inclusion.
Expert validation: until automated curation reaches sufficient maturity, benchmark datasets should be reviewed by experienced smart contract security researchers before publication. Contest-sourced findings alone are not enough, and human expert review remains essential to ensure the ground truth is actually true.

The question isn't whether AI will transform smart contract security — it will. The question is whether the data and benchmarks we use to build and evaluate these tools are held to the same standard as the contracts they're meant to protect.

This is the principle behind our own AI smart contract security efforts. We'll have more to share soon.

If you're building onchain and evaluating how to integrate AI security tooling into your development workflow, or want to collaborate on improving AI security standards, we'd like to hear from you.

View full post