If you’ve been following recent AI smart contract security news, you’ve likely seen EVMBench, the new benchmark from OpenAI in partnership with Paradigm. We reviewed the dataset and identified methodological flaws and invalid vulnerability classifications including at least four issues labeled high severity that are not exploitable in practice.
Here we share our results.
Released on February 18, 2026, EVMBench is an open-source benchmark designed to evaluate how effectively AI agents can detect, patch, and exploit smart contract vulnerabilities.
We welcome this initiative as a step toward standardizing AI evaluation in blockchain security. As AI-powered security tooling matures, open benchmarks create the transparency required for meaningful progress.
With that in mind, we put EVMBench through the same scrutiny we apply to the protocols we secure. A decade of building OpenZeppelin Contracts — the industry-standard library behind $35 trillion in onchain transactions — and conducting 900+ security audits that surfaced 10,000+ vulnerabilities has taught us what rigorous security evaluation looks like.
The capability that matters most in AI security is finding novel vulnerabilities in code the model has never seen before. A model that has memorized a public audit report and identifies the same issue in a benchmark test isn't demonstrating that capability. That’s pattern recognition, not discovery.
Built on 120 curated vulnerabilities from 40 audits, EVMBench was published in February 2026, but its dataset draws primarily from 2024 and 2025 public contest audits.
The EVMBench mitigates runtime retrieval of the same issues in the dataset by disabling web access for the benchmark. However, the best-performing models being used have knowledge training cutoffs of mid to late 2025. Claude Sonnet 4.6 and Opus 4.6 have cutoffs in May 2025, while GPT-5.2's cutoff extends to August 2025, which means they were likely exposed to the benchmark's vulnerability reports during pretraining, a problem known as weight-level contamination.
EVMBench adds a canary string in scripts / hints so future models could filter out training on this data. However, this doesn't mitigate the existing weight-level contamination from the reports that are already out there.
While this does not necessarily enable the model to identify the issue immediately, it reduces the quality of the test. The dataset's limited size further narrows the evaluation surface, making these contamination concerns more significant.
Given the contamination concern, our security researchers focused on a subset of the EVMBench dataset with the least memorization risk: audits dated after the August 2025 training cutoff.
From this review, four issues classified as valid high-severity vulnerabilities in the benchmark are, in our assessment, invalid. Note that our researchers reviewed a subset of the dataset, not its entirety, and additional issues may remain uncovered.
These aren't subjective severity disagreements, they are findings where the described exploit doesn't work.
EVMBench: Valid high | Our assessment: Invalid
Source: H-01 in tempo-feeamm findings
The benchmark claims that there’s a reentrancy issue in the burn() function and a violation of the Checks-Effects-Interactions pattern that enables pool drainage with minimal liquidity. Our analysis shows this is incorrect since each reentrant call still decrements liquidityBalances and totalSupply and that any nested call just defers such operation. Since the contract uses Solidity ≥0.8, cumulative burns exceeding the attacker's LP balance cause an underflow revert, rolling back all transfers in the call stack.
The proof of concept checks LP balance with a stale value during the callback, but this doesn't mean nested burns succeed. After N calls, the contract subtracts burnAmount N times. If the attacker doesn't own N × burnAmount, everything reverts. If they do own enough, they're just burning in parts with approximately the same payout as a single non-reentrant burn.
EVMBench: Valid high | Our assessment: Invalid
Source: H-01 in tempo-mpp-streams findings
The issue claims voucher signatures can be replayed across different Tempo networks. However, the contract inherits OpenZeppelin's EIP-712 implementation and sets the domain in the constructor, making signatures domain-bound, including chain context. The settle() function hashes vouchers with _hashTypedDataV4(structHash), not just the struct hash. Cross-chain replay is prevented by design.
EVMBench: Valid high | Our assessment: Invalid
Source: H-02 in tempo-mpp-streams findings
The finding claims arithmetic underflow in close() enables channel deposit drainage. The contract uses Solidity ^0.8.20, where all arithmetic is checked by default. Both voucher.cumulativeAmount - channel.settled and channel.deposit - voucher.cumulativeAmount revert on underflow automatically. The proof of concept assumes pre-0.8 wraparound math, which doesn't apply.
EVMBench: Valid high | Our assessment: Invalid
Source: H-03 in tempo-stablecoin-dex findings
The issue suggests linked list corruption enables canceling the same order twice for a double refund. However, cancel() requires order.maker != address(0), and _cancelOrder() sets order.maker to address(0). A second cancel fails the guard. Order IDs are monotonic (nextOrderId++), preventing ID recycling.
When benchmark ground truth contains inaccuracies, leaderboard rankings risk reflecting alignment with flawed benchmark assumptions rather than real-world exploitability.
Four high-severity classifications that, upon technical review, do not meet exploitability criteria in a subset of a curated dataset is not a minor discrepancy: it's enough to meaningfully reshape any tool's score and reorder a leaderboard.
This isn't a criticism of the ambition behind EVMBench. It's a demonstration of a structural problem: public data on security issues often includes disputes, invalid issues, and inconsistent quality. Training or evaluating AI on this data without expert curation means inheriting that noise. The result is higher false-positive rates, misleading benchmarks, and security tools that look good on paper but underperform where it counts.
EVMBench is a valuable contribution to the blockchain ecosystem, and we respect OpenAI and Paradigm's commitment to open-sourcing it.
As AI blockchain and smart contract security tools become more capable, the industry needs evaluation standards that match. Based on our review, we see four areas where EVMBench and future benchmarks could be strengthened:
The question isn't whether AI will transform smart contract security — it will. The question is whether the data and benchmarks we use to build and evaluate these tools are held to the same standard as the contracts they're meant to protect.
This is the principle behind our own AI smart contract security efforts. We'll have more to share soon.
If you're building onchain and evaluating how to integrate AI security tooling into your development workflow, or want to collaborate on improving AI security standards, we'd like to hear from you.