Frequently Asked Questions

What is the PokéAgent Challenge?

The PokéAgent Challenge is a NeurIPS 2025 competition designed to establish Pokémon battling and gameplay as benchmarks for general decision-making in AI. It features two tracks: Competitive Battling and RPG Speedrunning, which together unify reinforcement learning and large language model research.

Who can participate in the competition?

The competition is open to all individuals and teams, with no restrictions on team size or affiliation. Participants may enter either or both tracks (Battles, Speedrunning).

What resources will be provided to participants?

Participants will have access to:

  • A dataset of over 3.5 million Pokémon battles
  • Starter code and baseline implementations
  • A comprehensive knowledge base compiled from Bulbapedia
  • Detailed documentation and tutorials
  • A dedicated Discord server for support
Can I use external resources like LLMs in my solution?

Yes, Track 1 permits external LLM usage with full documentation and supports training on our 3.5M-battle dataset or self-play. Track 2 allows most methodological approaches which use a neural network to produce the action. However, we do not allow the use of heavy heuristics as we want to encourage a generalizable solution. We reserve the right to disqualify submissions that we deem to be in violation of the rules.

How will submissions be evaluated?

For Track 1 (Pokémon Battles), we will use established player rating schemes, win rates against baselines, and metrics for efficiency and reliability.

For Track 2 (RPG Speedrunning), the primary metric is time and completion percentage, measuring progress through a standardized list of critical game milestones, with a secondary metric of Success Rate.

Are there prizes for winning teams?

Yes, incentives include monetary prizes (subject to sponsorship), research collaboration opportunities, presentation slots at NeurIPS 2025, computational resources, and the opportunity to co-author a subsequent NeurIPS 2026 submission for top solutions.

What are the organizer-hosted baselines on the Track 1 ladder?

"PAC-BH-X" usernames are basic heuristics meant to fill in the bottom of the ladder. X is a shorthand version of their name in Metamon (metamon/baselines). For reference, the relative strength of these heuristics (averaged over Gen 1-4 OU):

Heuristic Baseline Strength Heatmap

"PAC-PC-Method-Model" usernames are PokéChamp LLM agents. "Method" denotes the prompt scheme (io/minimax), while "Model" is the LLM backend (Llama, Gemini, etc.). Currently, these agents are only active on the Gen 9 OU ladder. The minimax GPT-4o version was evaluated on the main Showdown Gen 9 OU ladder with an ELO of 1300-1500.

"PAC-MM-X" usernames are the Metamon sequence model policies. X is a shortened version of their name in Metamon (metamon/rl). These policies were evaluated on the main Showdown ladder and provide a point of reference against human players. Their rankings were:

Metamon RL Policy Human Ratings

For those less familiar with Showdown ratings, the figure below reframes in terms of an approximate percentile among active usernames:

Showdown Rating Percentile Translation

However, on the PokéAgent ladder, the PAC-MM usernames are sampling from a new set of ~10k team files per format, which adds variety but decreases performance.

More policies will be added to the ladder over time. Currently, there is only one unreleased model; it plays under the username "PAC-MM-SmallRLG9" and is roughly equivalent to the original "SmallRL" policy, but adds Gen 9 OU replays. On the Gen 9 OU ladder, "PAC-MM-SmallRLG9" plays with a small set of human sample teams, while "PAC-MM-SmallRLG9Var" serves the same model but picks from ~20k teams.