Track 1: Competitive Battling - PokéAgent Challenge

Pokémon Showdown is an open-source simulator that transforms Pokémon's turn-based battles into a competitive strategy game enjoyed by thousands of daily players. Competitive Pokémon battles are two-player stochastic games with imperfect information. Players build teams of Pokémon and navigate complex battles by mastering nuanced gameplay mechanics and making decisions under uncertainty. Key details about the opponent’s team remain hidden until they impact the battle, prompting players to infer missing information and anticipate future moves. Top human players excel by accurately predicting their opponent's strategies and leveraging their own team's strengths. The randomness, partial observability, and vast team diversity in Pokémon battles challenge AI's ability to plan and generalize.

Though Pokémon Showdown battle bots have existed for many years, advances in language models, large-scale reinforcement learning datasets, and accessible open-source tools have sparked renewed interest within the machine learning research community. Recent methods have achieved human-level gameplay in popular singles rulesets, prompting an exciting question: How much further can we push the capabilities of Competitive Pokémon AI? Join Track 1 of the PokéAgent Challenge and help us find out!

Timeline and Details

Free Play Period

Available Formats & Baselines

The PokéAgent Showdown server supports the following rulesets ("formats") where ML baselines and datasets are readily available: Gen1OU, Gen2OU, Gen3OU, Gen4OU, Gen9OU, and Gen9 VGC Regulation I.

See this FAQ for more information on format rules.

Organizer baselines will keep the ladders active by serving opponents at various skill levels for testing and development.

September 13th–14th

Ladder Exhibition Weekend

Introduction to Pokémon, lightning-round talks from Pokémon AI projects, and organizer office hours. The top-ranked participants on the Gen1OU and Gen9OU ladders receive $1,000 worth of GCP credits to help fund their research.

September 16th–October 12th

Practice Tournaments

Practice tournaments replicating the final bracket format. Announced on Discord, filled first-come-first-serve. Starter kits will release updates for joining tournaments.

Qualifying Window

All times listed in Central Daylight Time (GMT -5:00)

Rules & Qualification Standard

Teams earn a spot in the final tournament bracket by climbing the ranked ladder.

Top 8 teams ranked by Elo (below a threshold Glicko-1 rating deviation ≤ 40) qualify. This cutoff may expand to 16 teams depending on participation levels.

Monday, October 13th

Team Registration Deadline & Ladder Reset

Showdown server exits free-play mode. Only registered usernames can participate moving forward. Register here.

October 13th – October 19th

Gen1OU Qualifier

Begins: Monday, October 13th at 12:01 AM
Ends: Sunday, October 19th at 11:00 PM

October 20th – October 26th

Gen9OU Qualifier

Begins: Monday, October 20th at 12:01 AM
Ends: Sunday, October 26th at 11:00 PM

Tournament Stage

Tournament Format

Gen1OU and Gen9OU single-elimination brackets
- Bracket seeds determined by qualifying window ladder rankings
- Matches are best-of-99 battles; draws go to the higher seed
- These details are still subject to change as we work to ensure a fair but affordable tournament.
- Agents will be able to pick their own Pokémon teams and change teams between battles (as in a standard OU tournament)

October 29th onwards

Tournament Brackets Begin

Qualifying teams compete in bracket tournaments for cash prizes and GCP credits.

Starting Resources

Competitive Pokémon is extremely complex and creates an interesting game-playing benchmark all by itself. However, we'd add that Pokémon's popularity and Showdown's replay dataset are key features that give this domain a unique place in current AI research:

Pokémon's widespread internet presence equips LLMs with extensive knowledge of its mechanics and strategies.
Pokémon Showdown makes millions of competitive battles publicly available, enabling the creation of large, high-quality datasets.

Taken together, Pokémon creates a fun opportunity for areas like LLM-Agents and RL to compete and collaborate on a level playing field. In that spirit, the PokéAgent Challenge was organized by the teams behind PokéChamp and Metamon, which are recent papers that demonstrate strong human-level play in singles formats. Both projects are open-source and have recently been updated to create a more helpful starting point for this competition.

RL on Human Replays

LLM-Agents, Search, and Scaffolding

These projects have their own repositories and publications where you can find more detailed information. The "Resources" section below highlights datasets and baselines from these efforts that may be more broadly useful in the development of new methods. Participants looking for more of a blank slate to get started are encouraged to check out poke-env — the Python interface to Showdown used by most recent academic work.

Compute Credits

Application closed and credits awarded. All compute credits have been distributed to approved teams.

Interested in ML but new to Pokémon? Read an intro guide here

Submission Guidelines

Prize Eligibility Notice

Exact clones of organizer-hosted baselines are not eligible for prizes. Submissions must demonstrate novel approaches, meaningful modifications, or original implementations. Simple repackaging or minimal changes to existing baseline code will be disqualified from prize consideration.

How to Submit for Track 1

Evaluation for Track 1 will take place on the PokéAgent Showdown server. Test your method by searching for battles on the ladder!

Open PokéAgent's Showdown Server

1. Create Account: Create a Showdown username and password by clicking the gear icon in the top right corner. Bot usernames should begin with "PAC" (PokéAgent Challenge).

2. Deploy: Starter kits have specific instructions and quick-setups to deploy agents on the server. For more general cases, the poke-env server config details would be:

PokeAgentServerConfiguration = ServerConfiguration(
    "wss://pokeagentshowdown.com/showdown/websocket",
    "https://play.pokemonshowdown.com/action.php?",
)

3. Battle: Watch live battles and climb to the top of the ranked ladder! You'll be matched against other participants and a set of organizer-hosted baselines that keep the ladder active.

4. Team Registration (Required): On Monday, October 13th, the practice ladder will be taken offline and only registered usernames may play in the qualifiers. Each team may register exactly one Showdown username for the remainder of the competition.

Additional Resources and Support

Datasets

Showdown Replay Logs

Showdown makes battle replays accessible via a public API. The competition organizers maintain curated datasets for convenience and (a little) extra privacy, and host them on Hugging Face to spare Showdown download requests from this competition. We'd encourage you to use them unless you have a good reason not to. Collectively, they cover the entire range of supported rulesets for the competition:

	Formats	Time Period	Battles
`pokechamp`	Many (39+ (Gen 1-9 OU, Gen 9 VGC, and more))	2024-2025	2M
`metamon-raw-replays`	All PokéAgent Except VGC	2014-2025	1.8M

Replays as Agent Training Data

Showdown replays are saved from the point of view of a spectator rather than the point of view of a player, which can make it difficult to use them directly in an imperfect information game like Pokémon. We need to reconstruct (and often predict) the perspective of each player to create a more typical offline RL or imitation learning dataset.

metamon converts its replay dataset into a flexible format that allows customization of observations, actions, and rewards. More than 3.5M full battle trajectories are stored on Hugging Face at metamon-parsed-replays and can be accessed through the metamon repo. pokechamp has a large, diverse set of replays from almost every gamemode that can be accessed at pokechamp-replays. Utilities in pokechamp can recreate its LLM-Agent prompts and decisions from replays in a similar fashion.

Miscellaneous

teams: All of the Showdown rulesets in the competition require players to pick their own teams. This dataset provides sets of teams gathered from forums, predicted from replays, and/or procedurally generated from Showdown trends. This creates a starting point for anyone less familiar with Competitive Pokémon, and establishes diverse team sets for self-play.

usage-stats: A convenient way to access the Showdown team usage stats for the formats covered by the competition and the timeframe covered by the replay datasets. This dataset also includes a log of all the partially revealed teams in replays. Team prediction is an interesting subproblem that your method may want to address!

Baselines

Organizers will inflate the player pool on the PokéAgent ladder with a rotating cast of existing baselines covering a wide range of skill levels and team choices. At launch, these will include:

Simple Heuristics to simulate the low-ELO ladder and let new methods get started. These are mainly sourced from metamon's basic evaluation opponents.
The best metamon and pokechamp agents playing with competitive team choices. These baselines have already demonstrated performance comparable to strong human players. If all goes well, they will be at the bottom of the leaderboard by the end of the competition!
Mixed metamon and pokechamp agents aimed at increasing variety and forcing participants to battle teams that resemble the real Showdown ladder. For example, varied LLM backends with several prompting and search strategies. Hundreds of checkpoints from metamon policies at various stages of training, sampling from thousands of unique teams.

Launch baselines are already open-source, so you are free to skip the ladder queue by hosting them on your local server with help from their home repo. Any additional (stronger) agents in development by the organizers will be added to the ladder rotation as the competition progresses.

We hope the ladder will also be full of participants trying new ideas; your agents' competition will always be improving!

Support

Competition staff will be active on the community Discord and are committed to answering questions (and fixing any issues that may arise) related to the starter datasets, baselines, competition logistics, and Showdown/Pokémon more broadly. However, due to limited bandwidth, we caution that the technical details involved in improving upon provided methods may be deemed out-of-scope (e.g., RL training details beyond the provided documentation). This is mainly because, given the datasets and baselines, you would have many other viable options that are maintained by larger teams and better suited to a broad audience. Still, please feel free to reach out, and we will help in any way we can.

⏰ Loading...

Overview