System 7: we taught an AI to exploit poker — and what actually won was something else

A technical field report on building a poker agent for dev.fun's Arena: the heuristic method, LLMs at the table, and the honest results — including the ones we didn't expect.

The experiment

On dev.fun there's a place where AI agents sit down at a No-Limit Texas Hold'em table and play each other. It's not a demo: real hands, real chips, real opponents (some humans behind a bot, some pure bots) and, in the tournaments, real on-chain money on the line.

The question we asked is deceptively simple: how do you build an agent that wins?

The short answer, after months of code, thousands of hands, and a dashboard that ended up looking like a quant fund's control room, is this: we built an extraordinarily sophisticated machine to exploit opponents… and the thing that actually beat the toughest adversary was an idea so simple it's almost dumb. This post is the long story of how we got there, and what it taught us about poker, about LLMs, and about measuring things honestly.

dev.fun and the Arena

The dev.fun Arena is a competition of poker agents. Your agent is a program that talks to a real-time API. The loop is always the same:

join — you take a seat.
pending-actions — the API tells you "it's your turn, here's the situation" (your cards, the board, the bets, the stacks, who did what).
action — you respond: fold, check, call, bet, raise and how much.

There's a deadline per decision: don't answer in time and you get auto-folded. It's real poker, with a real clock.

The Arena has three "tracks," and the difference matters because each rewards different things:

Eval — your agent plays 500 hands against a fixed panel of five reference bots. It's a benchmark: it scores you in bb/100 (big blinds won per 100 hands, poker's universal metric) and ranks you. No prize, one-shot per agent. This is no field of fish: the panel is the closest thing to a "perfect" opponent you'll face.
Playground — PvP with play money, free and continuous. Here you do get real opponents, with their quirks and their leaks.
Tournament — an MTT (multi-table tournament) with on-chain entry (Monad chain, MON token): you buy in once, get a fixed stack, and compete across tables of 2 to 6 players that melt down until one champion remains.

That distinction — a near-perfect opponent (Eval) versus leaky opponents (Playground/Tournament) — is the axis of everything that follows.

Two ways to play: GTO versus exploitation

Modern poker lives between two philosophies.

GTO (Game Theory Optimal). You play the Nash equilibrium: an unexploitable strategy. Whatever you do, you don't lose in the long run, because you mix your plays so the opponent can never gain an edge. It's what solvers and Counterfactual Regret Minimization (CFR/DeepCFR) networks approximate. GTO doesn't try to win big; it tries to never lose.

Exploitation. You give up part of that armor to punish the opponent's specific mistakes. Folds too much to raises? You bluff more. Calls with anything? You stop bluffing and bet bigger for value. It's how human pros think: equilibrium is the floor, but the money is in the other guy's leaks.

Our starting thesis was the exploitative player's: use GTO as a defensive floor and, from there, deviate aggressively to squeeze every weakness. In the jargon, node-locking: you mentally "lock" the opponent's error at a node of the decision tree and attack right there. We built an entire engine around this idea. It's called System 7.

System 7: a hand-written exploitation engine

The first thing to understand about System 7 is what it is not: it's not an LLM, and at decision time it calls no network and no model. It's pure, deterministic Python —and open source on GitHub. It takes the table, runs the math, and returns a play in milliseconds. All the "intelligence" is encoded as rules, distilled from an exploitative methodology (the EducaPoker / "University Strategy" school, with the fingerprint of players like Raúl Mestre).

The engine is organized into four directives.

D1 — How much you trust what you see (the sample N). You can't exploit an opponent you know nothing about. System 7 scales its aggression by how many hands it has observed the villain:

N < 100: blindness. Ignore the stats (pure noise), play standard ranges, respect opponent aggression.
100 ≤ N < 500: start profiling by the gap between VPIP (how often they put money in) and PFR (how often they raise). Out come the archetypes: Nit (super-tight), TAG (solid), LAG (loose-aggressive), Calling Station (calls everything) and Maniac. Each is played differently: against the station, zero tolerance for bluffs and bigger value bets; against the nit, steal their blinds relentlessly.
N ≥ 500: fine node-locking. If the opponent folds to 3-bets more than 55% of the time, we re-raise any hand with an Ace blocker. If they abandon the flop to a continuation bet more than 45% of the time, we bet 100% of our range with a small sizing.

D2 — What you have and where you are (the geometry). Instead of thinking "I have a pair of aces," System 7 classifies the hand by its relative fit with the board: MMF (monster hands: sets, flushes, straights), MF (strong hands: top pair good kicker, overpair), MM (medium), MD (weak), and Air/Draws. And it classifies the board by texture: dry, semi-coordinated, coordinated and extremely coordinated (where even a set drops a tier because the danger is huge). On top of that, the dynamics of the running cards: a scary card (an Ace, a King that completes a flush) favors the aggressor and lowers the bluffing threshold; a defensive card brakes the aggression and calls for pot control.

D3 — The relentless arithmetic (pot odds vs equity). No hunches here. It computes the pot odds (what you risk versus what you can win) and compares them to your real equity, counting adjusted outs for the texture — a flush draw is worth less if the opponent can hold a higher one; overcards are worth almost zero on a very coordinated board. The rule is a mandate: if your equity doesn't cover the pot odds, you fold — unless the HUD certifies the opponent will fold (then the bluff has value of its own).

D4 — The postflop and the "Killer Parsley." When the SPR (stack-to-pot ratio) drops below 3, you commit with any strong hand: pot control vanishes, you're all in. And for the bluffs there's the protocol nicknamed "Perejil Asesino" (Killer Parsley): a conditional bluff. You only raise as a bluff if conditions are met (evident opponent weakness, a lifeline of 8–10 real outs backing the play in case you get called, adjustments for the number of opponents…). You never bluff on impulse with trash; every aggression is backed by math or a documented leak.

And above it all, the HUD: a module that reads opponents' live stats (sample size, VPIP, PFR, 3-bet, fold-to-cbet, showdown…), sorts them into archetypes, and feeds the D1 deviations. The HUD is what turns System 7 from a disciplined robot into an exploiter.

It is, in short, a serious and complete attempt to encode how an exploitative pro thinks. It worked. The question was: did it win?

Putting an LLM at the table

The heuristic engine is fast and cheap, but it has a ceiling: rules, however refined, don't cover every weird spot. So we tried a hybrid engine: the heuristic handles the trivial spots (the vast majority) and, in the hard spots, hands the decision to an LLM that reasons the play out.

We used MiniMax M3, a reasoning model, with the method's own system prompt (the four directives in plain language, plus a JSON output contract). The LLM gets the reconstructed situation and returns a justified play. The cost? About 27 seconds per decision and non-trivial token usage — which is why it only steps in for the spots that truly warrant it (difficulty gating).

Here's a theme that's become central across the whole AI ecosystem: reasoning models. An M3 or a DeepSeek-R1 "thinks out loud" before answering, which gives them a real edge on problems with many interdependent variables — and poker, in its hard spots, is exactly that. This comes back elegantly later, in our own coach.

What happened (and what we didn't expect)

Here's the honest part, which is also the most interesting.

The Eval panel — that "near-perfect" opponent — turned out to be five identical clones of a DeepCFR bot: a balanced TAG (VPIP ~21.7, PFR ~16.4, aggression factor ~1.95), with no obvious leaks. The closest thing to GTO we'd get to measure.

We measured:

System 7 heuristic (pure node-locking): −9.75 bb/100.
Hybrid with M3 (reasoning the hard spots): −1.66 bb/100.

That is: our elaborate exploitation machine lost against the panel. The LLM improved it (from −9.75 to −1.66 — its reasoning in hard spots genuinely helped), but neither beat the opponent.

The lesson, in one sentence: you can't over-exploit someone who has no leaks. The whole node-locking apparatus — the archetypes, the Killer Parsley, the fine deviations — only pays when the opponent makes systematic errors. Against a near-GTO with no leaks, that machinery spins in a vacuum and even hurts you (you deviate from equilibrium for no reward).

And then, the twist. We tried a strategy so simple it's almost dumb: very wide ranges and steal-aggression (open far more hands, steal blinds relentlessly). Against the same panel, that wide strategy scored +119 bb/100 in the qualifiers.

Why? Because the panel, "GTO" as it was, had one exploitable property: it folded around 78% of its hands preflop. Even a disciplined bot is, structurally, too tight. And that's a leak — just not the kind the HUD spots hand by hand, but a structural one, punished with the oldest play in the book: stealing.

There's the project's nice irony: exploitation did win — but the right exploit was the simple, structural one (steal), not the elaborate multi-street read. We built a microscope to find subtle leaks when what paid was a hammer.

A mandatory note of statistical humility: bb/100 over 500 hands has a ±20 confidence interval. Both the −9.75 and the +119 are noisy in absolute terms. What was consistent across runs was the direction: against this panel, width ≫ node-locking. Treating a 500-hand number as a verdict is one of the easiest — and most expensive — mistakes to make.

So where does all the exploitative machinery shine? Where there are real leaks: in the Playground and the Tournament, against real opponents who do call too much, fold too little, and bluff badly. There the HUD and node-locking stop spinning in a vacuum.

From lab to production: the tooling

An agent is nothing without the factory that produces it. Half the project is a control dashboard that organizes the agent's whole lifecycle into three zones:

LAB — where a strategy is born: a 13×13 grid (poker's 169 hands) to paint ranges by hand, pick engine (heuristic / hybrid), model, and HUD, and backtest several agents at once (with concurrency capped so as not to blow up the APIs or trigger rate-limits). Results are reported as bb/100 aggregated per strategy, with its confidence interval — not as a single run.
COACH — where the agent learns. A diagnosis of your play versus optimal (VPIP, PFR, c-bet… against reference bands) and an LLM coach that critiques your hands and proposes concrete tweaks. The latest addition: a hand-by-hand coach — you take your 10 most and least profitable hands, select them, and a reasoning model (DeepSeek-R1) tells you for each one whether you played it well or badly and where the leak is. (A fun detail: the reasoning model is noticeably more critical than the standard one; where the standard model applauds an all-in, the reasoning model spots that the overbet scared off the worse hands and only got called by the better ones.)
PRODUCTION — where the agent competes: live deployment, a real equity curve versus theoretical EV, a hand replayer to review every play, and a tracker that profiles opponents.

And all of it closes a self-improvement loop: propose a version → test it → compare against the control → deploy it → the coach critiques it → repeat. An agent, and a coach for that agent, in the same place.

The jump to tournaments

Cash and tournaments don't play the same. In an MTT the blinds rise and your stack, measured in big blinds, shrinks — and the strategy has to change with depth. System 7 adapts its ranges by stack tier: deep (you play normal poker), medium, short, and very short (push/fold: shove or fold, no middle ground). At the final table, when the tournament narrows to two players, the engine detects heads-up dynamically and switches to a position-aware frame (the button acts last postflop, c-bets its range, barrels). The finest part is still pending — ICM (the fact that a chip's value isn't linear near the payouts) — but the tournament foundation is there.

What we take away

Beyond poker, the project leaves a few lessons that apply to any agent making decisions under uncertainty:

GTO and exploitation aren't rivals, they're layers. Equilibrium is your floor; exploitation is your ceiling. But exploitation only pays if the other side has leaks — and the right leak might be the simplest of all.
The LLM shines as judgment in the hard spots, not as a full controller. The cheap, deterministic thing handles 95%; the expensive, slow model is reserved for the 5% that truly needs it. And reasoning models add value where many variables are tangled.
Measuring honestly is half the work. A noisy number is not a verdict. Confidence intervals, direction over absolute value, and the discipline not to fool yourself are worth more than any strategy trick.
The meta-game is building the factory. The edge wasn't in a brilliant play but in the loop: authoring → backtest → comparison → deployment → coaching → repeat. An agent is only as good as the machinery that iterates it.

That, in the end, is what QuantArmy is about: agents for deciding under uncertainty. Poker turned out to be a perfect testing ground — bounded, measurable, and merciless with self-deception. It teaches you fast that your mental model isn't reality, and that the toughest opponent isn't beaten with more sophistication, but with the right idea.

Resources & links

💻 Code (open source): github.com/quantarmyz/system7-poker-arena — the heuristic engine, the LLM hybrid, the dashboard and all the tooling.
🃏 The platform: dev.fun · the poker Arena.
🧠 Reasoning models: DeepSeek · MiniMax.

The project is still live in the Arena. If you're into the intersection of AI, game theory, and decision-making under uncertainty, this is only the beginning.