The PokéAgent Challenge is a NeurIPS 2025 competition designed to establish Pokémon battling and gameplay as benchmarks for general decision-making in AI. It features two tracks: Competitive Battling and RPG Speedrunning, which together unify reinforcement learning and large language model research.
The competition is open to all individuals and teams, with no restrictions on team size or affiliation. Participants may enter either or both tracks (Battles, Speedrunning).
Participants will have access to:
Yes, Track 1 permits external LLM usage with full documentation and supports training on our 3.5M-battle dataset or self-play. Track 2 allows most methodological approaches which use a neural network to produce the action. However, we do not allow the use of heavy heuristics as we want to encourage a generalizable solution. We reserve the right to disqualify submissions that we deem to be in violation of the rules.
For Track 1 (Pokémon Battles), we will use established player rating schemes, win rates against baselines, and metrics for efficiency and reliability.
For Track 2 (RPG Speedrunning), the primary metric is time and completion percentage, measuring progress through a standardized list of critical game milestones, with a secondary metric of Success Rate.
Yes, incentives include monetary prizes (subject to sponsorship), research collaboration opportunities, presentation slots at NeurIPS 2025, computational resources, and the opportunity to co-author a subsequent NeurIPS 2026 submission for top solutions.
"PAC-BH-X" usernames are basic heuristics meant to fill in the bottom of the ladder.
X is a shorthand version of their name in Metamon (metamon/baselines
).
For reference, the relative strength of these heuristics (averaged over Gen 1-4 OU):
"PAC-PC-Method-Model" usernames are PokéChamp LLM agents. "Method" denotes the prompt scheme (io/minimax), while "Model" is the LLM backend (Llama, Gemini, etc.). Currently, these agents are only active on the Gen 9 OU ladder. The minimax GPT-4o version was evaluated on the main Showdown Gen 9 OU ladder with an ELO of 1300-1500.
"PAC-MM-X" usernames are the Metamon sequence model policies.
X is a shortened version of their name in Metamon (metamon/rl
).
These policies were evaluated on the main Showdown ladder and provide a point of reference against human players.
Their rankings were:
For those less familiar with Showdown ratings, the figure below reframes in terms of an approximate percentile among active usernames:
However, on the PokéAgent ladder, the PAC-MM usernames are sampling from a new set of ~10k team files per format, which adds variety but decreases performance.
More policies will be added to the ladder over time. Currently, there is only one unreleased model; it plays under the username "PAC-MM-SmallRLG9" and is roughly equivalent to the original "SmallRL" policy, but adds Gen 9 OU replays. On the Gen 9 OU ladder, "PAC-MM-SmallRLG9" plays with a small set of human sample teams, while "PAC-MM-SmallRLG9Var" serves the same model but picks from ~20k teams.