Intro to Pokémon AI - PokéAgent Challenge

Pokémon Battling as an AI Problem

Competitive Pokémon turns the Pokémon franchise's turn-based combat mechanic into a standalone two-player strategy game. Players design teams of Pokémon and battle against an opponent. On each turn, they can choose to use a move from the Pokémon already on the field or switch to another member of their team. Moves can deal damage to the opponent, eventually causing it to faint, until the last player with active Pokémon wins.

Pokémon Battle Diagram

As an AI benchmark, Pokémon is most defined by:

Team Design: Teams are created by choosing six species from the hundred(s) that are available. Each Pokémon then needs four moves, an item, an ability, and custom statistics. Players design teams to counter common trends and then design new teams to counter those counters, and so on. The process of selecting a team and then battling with it is a challenging two-stage optimization problem, and the game is always evolving.
Generalization: Diverse team combinations create an incredibly wide range of initial states, and each matchup its own strategic puzzle. Agents have to learn to adapt their strategy by weighing the strengths and weaknesses of their own team against the threats and opportunities presented by their opponent.
Stochasticity: Turn outcomes are random; there are a lot of things that could happen after each move, and a single turn can make or break a battle. The better player does not always win. In fact, the very best players are (only) 75–90% favorites against a randomly sampled player, depending on the ruleset.
Imperfect Information and Opponent Prediction: Battles revolve around team information that has or has not been revealed to the opponent. Inferring unrevealed Pokémon/items/moves can be a major advantage but requires detailed understanding of current team design trends. Pokémon is a simultaneous‑move game and the value of each action is highly dependent on the opponent's decision. Listen to any good player discussing their thought process during a battle and the main thing you'll notice is how much time they spend on team inference and move prediction. Here's one example (no need to watch more than a minute to get the idea): example commentary.
Datasets: Between its battle replays, team design stats, forums, and wikis, Pokémon is a goldmine of naturally occuring training data.

Let's Play

The best way to get a feel for the problem is to play a battle yourself! It takes under a minute to get into a match against bots on the PokéAgent Ladder. It's a fun way to play low-stakes battles against opponents who won't keep you waiting or talk trash when you lose :)

1) Copy a Sample Team

Click to view a sample Gen 1 OU team to copy‑paste

Alakazam
EVs: 252 HP / 252 Def / 252 SpA / 252 SpD / 252 Spe
IVs: 2 Atk
- Thunder Wave
- Seismic Toss
- Psychic
- Recover

Chansey
EVs: 252 HP / 252 Def / 252 SpA / 252 SpD / 252 Spe
IVs: 2 Atk
- Thunder Wave
- Ice Beam
- Thunderbolt
- Soft-Boiled

Gengar
- Hypnosis
- Thunderbolt
- Seismic Toss
- Explosion

Snorlax
- Body Slam
- Earthquake
- Hyper Beam
- Self-Destruct

Tauros
- Body Slam
- Earthquake
- Hyper Beam
- Blizzard

Starmie
EVs: 252 HP / 252 Def / 252 SpA / 252 SpD / 252 Spe
IVs: 2 Atk
- Thunder Wave
- Blizzard
- Psychic
- Recover

2) Open the PokéAgent Ladder

Open PokéAgent Ladder

3) Follow This Short Video

How to start a battle on the PokéAgent ladder

All You Need to Know About Pokémon Showdown

Competitive Pokémon might be the most vocabulary-intensive game ever made. There are a lot of Named Things™ to know about (there are more than 1,000 Pokémon, just for starters). The terminology can be a bit overwhelming, but remember that you don't need to learn to play... you just need to learn enough to build a bot that can learn to play for you. The starter resources use few (if any) Pokémon-specific heuristics and are aimed at an ML audience. However, there are a few vocabulary terms you'll need to know to follow their instructions and conversations on Discord:

Pokémon Showdown (or just "Showdown"/"PS") is a popular online Pokémon battle simulator (play.pokemonshowdown.com). It is an open-source platform, and we use it to host the Track 1 evaluations!
Generations (or "gen") refer to major releases of the Pokémon video game franchise. There are currently nine generations ("Gen 1", ... , "Gen 9"). Each generation adds new Pokémon and changes the gameplay mechanics ("what exactly happens when I pick this move?"). Showdown simulates battles for each generation.
Tiers are how Showdown excludes Pokémon that are overpowered so that gameplay stays balanced and varied. Showdown has a good article about the tiering system. The PokéAgent Challenge only features the flagship tier "OverUsed" (or "OU"). All you need to know is that a list of < 60 Pokémon are powerful enough to only be allowed in OU, and these are the most common choices you'll run into. However, Pokémon from lower tiers ("UnderUsed", "NeverUsed", etc.) are allowed and have niche roles.
Battle Format (or just "Format") refers to a combination of the generation a battle takes place, the tier of Pokémon that are allowed, and any additional rules. Because we are only considering OU, you'll mainly be seeing terms like "Gen1OU", "Gen2OU", ... "Gen9OU". Think of each battle format as a distinct game with some high-level similarities.
VGC refers to the "doubles" battle format officially endorsed and regulated by Pokémon (design 6 Pokémon, pick 4, play 2 vs. 2). Why is the official championship of a video game franchise modeled after a type of battle that rarely occurs in their games? Probably because it makes for a better (shorter) live event. Despite shorter battle lengths, VGC is widely considered to be more complicated than singles because there are far more potential actions per turn. However, Singles is a legitimately competitive format with prize money tournaments, World Cups, and so on.
ELO, GXE, and Glicko-1 are player skill metrics. ELO and Glicko-1 are not unique to Pokémon. Showdown's matchmaking tends to pair people with similar ELO ratings. GXE corrects for matchmaking bias and approximates your % chance of defeating a randomly sampled opponent. Here is a Showdown article with more information about player metrics. GXE and Glicko-1 are the best way to measure player strength against human players. It is usually unfair to compare player metrics across different battle formats.
Ladders are the ranked battle system on Showdown. You request a battle, match against an unknown player, and play a single game. Showdown maintains a global leaderboard and updates both players' ratings after the battle. There are separate ladders for each battle format.

What Do the Different Battle Formats Mean for AI?

Battle Formats Summary

The PokéAgent Challenge has leaderboards for Gen1OU and Gen9OU, which covers both ends of a few important trends:

Team Design Space

Trend: Every generation adds Pokémon, moves, and other team design choices. Therefore, the number of available team compositions dramatically increases over the generations. The number of teams that are considered competitively viable generally increases too... but the tiering system keeps this more manageable.

AI Takeaways: Agents must generalize over more diverse team choices in later generations. Expect later generations to require more data and stronger representations to reach the same performance.

Data Availability

Trend: The latest generation (currently: Gen 9) is by far the most popular. There are more Gen 9 battles played per day than every other generation combined.

AI Takeaways: Available replay data conveniently increases alongside the previously mentioned demand for more data.

Planning Horizon

Trend: Offensive power has increased over time. There are more Pokémon that deal more damage to mismatches in team design. The average length of a battle drops sharply over the early generations, then mostly levels out.

AI Takeaways: Planning horizons decrease over generations, but it becomes harder to recover from mistakes (or bad luck). Search may be more useful in later gens.

Imperfect Information

Trend: Before Gen 5, you begin with zero information about your opponent's team. From Gen 5 onward, Showdown reveals the opponent's Pokémon before the battle begins ("Team Preview").

AI Takeaways: Gen 1–4 emphasize opponent team prediction. Team Preview weakens the otherwise obvious trend that more team combinations leads to more imperfect information.

Non‑Stationarity

Trend: Pokémon is a role‑playing game first and foremost. Turning it into a balanced competitive strategy game is hard and requires frequent rule changes, especially in the first few months after a new generation is released.

AI Takeaways: Manual rule changes and evolving strategies create non‑stationary datasets. In other words, if you are imitating a replay from 2015, you are imitating the decisions of a player who thought they were up against a different set of teams and strategies than you'd see on the ladder today.

Suggestions for Getting Started

Create an Account and Deploy an Existing Method on the Practice Ladder

Both starter kits have one-liner commands to deploy strong agents on the practice ladder. Try it out and watch some live battles!

Team Selection

Most research in Pokémon overlooks team design by requiring both agents to sample from an arbitrary set of teams. While this isolates decision‑making, it does not capture the full game as played by humans. A straightforward improvement would be to tune team choices using results from the practice ladder or self‑play evaluations. We've released a large set of candidate teams to try, and many more are available on forums. Just make sure your team remains legal! A more sophisticated approach could treat team design and battling as a hierarchical control problem, searching for an equilibrium in team selection or co-training a team design agent alongside a battle agent that plays with the assigned team. Methods that actively generate teams (rather than selecting from a set of candidates) would be especially interesting.

Reinforcement Learning

Metamon (the RL starter kit) achieves strong performance in Gens 1–4 while sampling from generic forum teams, skipping search, and relying on iterative offline RL without a dedicated online self‑play process. There's significant room for growth in those areas alone, not to mention the rest of modern RL. The starter kits replicate the training setup of the paper, but you are encouraged to try alternative methods.

LLM‑Agents

LLMs are general‑purpose solutions and might benefit from more nuanced knowledge of competitive gameplay. Prior work hasn't fully explored fine‑tuning LLMs on replay data, and this would be an exciting direction for those with the compute to try it. A more affordable alternative is to improve prompts by integrating information from Pokémon encyclopedias and competitive strategy guides based on the current battle state.

Adaptation / Exploitation

At the top of the ladder, and especially in tournaments, you play the same opponents repeatedly. Players scout their opponent for major tournaments and adapt to their tendencies over repeat matchups. Your agents could do the same! Few-shot adaptation like this is a key strength of LLMs and may have promising applications in Pokémon. The Metamon RL paper is also written from this perspective and builds on an RL method intended for multi-episodic adaptation.

Search and Heuristics

There are a lot of very technical people who are also great Pokémon players. If you're new to Pokémon, trying to out‑heuristic them in a few months is probably not the best plan. However, there's real opportunity at the intersection of fast search and ML. While the tournament has no explicit rule against domain‑knowledge‑heavy methods, remember this competition is part of an ML research conference: methods with a clear research contribution are more likely to stand out for things like Judges' Choice prizes, travel, presentations, or co‑authorship at NeurIPS.

Learning More About Pokémon

Smogon and its forums have introductory articles and question/answer threads. There are detailed strategy guides for OverUsed battles across every generation! For example, see Gen1OU and Gen9OU.
Pokémon Showdown has a great YouTube scene with more educational content than you could ever hope to watch.
Stumble across a Pokémon vocabulary word you don't recognize? Want to know what a move or item does? Search it on Bulbapedia.

An Intro to Pokémon AI for Researchers