Overcooked V2¶
Seven environments from Gessler et al., 2025 that test coordination under asymmetric information and stochasticity. Each episode samples a hidden target recipe. One agent can observe the recipe through a nearby indicator; the other must infer it through communication or partner behavior.
Uses the same actions and cooking pipeline as Overcooked V1. Key differences: partial observability, stochastic recipe selection, open pots that accept any ingredient, and reward shaping that penalizes incorrect actions.
Variants¶
| Environment ID | Category | Layout |
|---|---|---|
OvercookedV2-CrampedRoomIndicator-V0 |
Indicator Only | 5x4 |
OvercookedV2-GroundedCoordSimple-V0 |
Grounded Coordination | 8x5 |
OvercookedV2-GroundedCoordRing-V0 |
Grounded Coordination | 9x9 |
OvercookedV2-TestTimeSimple-V0 |
Test-Time Protocol | 8x5 |
OvercookedV2-TestTimeWide-V0 |
Test-Time Protocol | 6x7 |
OvercookedV2-DemoCookSimple-V0 |
Demo Cook | 11x5 |
OvercookedV2-DemoCookWide-V0 |
Demo Cook | 11x6 |
Coordination Categories¶
| Category | Button | Incorrect Penalty | How Recipe is Communicated |
|---|---|---|---|
| Grounded Coordination | Yes (-5 cost) | -20 | Button reveals recipe to partner for 10 steps |
| Test-Time Protocol | No | -20 | Agents must develop implicit signaling conventions |
| Demo Cook | No | None | One agent demonstrates the recipe through actions |
Stochastic Recipe Selection¶
At reset, a target recipe is drawn uniformly from ["onion_soup", "tomato_soup"]. After each correct delivery, the target is resampled. Agents cannot memorize a fixed strategy; they must read the recipe each episode and adapt.
Partial Observability¶
All V2 environments use local_view with local_view_radius=2, producing a 5x5 agent-centered window. Each agent sees only the grid cells within 2 steps of its position. This creates information asymmetry: an agent near the recipe indicator can read the target, while an agent elsewhere cannot.
Recipe and Button Indicators¶
The RecipeIndicator (R) is a wall tile whose state encodes the current target recipe. It is always active — any agent whose view window covers it can read the recipe.
The ButtonIndicator (L) is inactive by default. An agent can toggle it (empty-handed, costs -5 reward) to reveal the target recipe at that tile for 10 steps, then it deactivates. This is the communication channel in Grounded Coordination layouts.
OpenPot¶
The OpenPot (u) accepts any 3-ingredient combination — onion, tomato, broccoli, or mushroom in any mix. There is no validation at placement time. Correctness is checked only at delivery: correct soup = +20, incorrect = -20 (or 0 in Demo Cook). Cook time is 20 steps.
V2-Specific Objects¶
In addition to shared Overcooked objects:
| Char | Name | Description |
|---|---|---|
u |
OpenPot | Pot accepting any 3-ingredient combination (cook time 20) |
R |
RecipeIndicator | Displays the current target recipe (wall tile) |
L |
ButtonIndicator | Toggle to reveal recipe for 10 steps (-5 cost) |
X |
OpenDeliveryZone | Accepts any soup type for delivery |
B |
BroccoliStack | Distractor ingredient dispenser |
M |
MushroomStack | Distractor ingredient dispenser |
Observations¶
The local view is a (5, 5, 35) tensor (flattened to 875) with these channel groups:
| Channels | Count | Description |
|---|---|---|
| Core | 8 | Object type map, agent positions (2), directions (2), inventories (2), object state map |
| Pot state | 4 | Is cooking, is ready, fill level, cook timer (all at pot positions only) |
| Pot ingredients | 4 | Per-ingredient count in pot (onion, tomato, broccoli, mushroom), normalized by capacity |
| Decomposed inventory | 12 | [plate, cooked, onion, tomato, broccoli, mushroom] at each agent's position (self first, 6 channels per agent) |
| Recipe decomposition | 6 | [plate, cooked, onion, tomato, broccoli, mushroom] at RecipeIndicator and active ButtonIndicator positions |
| Delivery indicator | 1 | 1.0 at delivery zone positions on the step a delivery occurred |
Recipe decomposition channels are non-zero only at indicator tiles. The button's channels activate when its timer is running and return to zero when it expires.
Rewards¶
| Class | Coefficient | Common | Trigger |
|---|---|---|---|
TargetRecipeDeliveryReward |
20.0 | Yes | Correct delivery +20, incorrect -20 |
TargetRecipeIngredientInPotReward |
3.0 | Yes | Correct ingredient in pot +3, incorrect -3 |
TargetRecipeSoupInDishReward |
5.0 | Yes | Correct soup plated +5, incorrect -5 |
ButtonActivationCost |
-5.0 | Yes | Toggle the button indicator |
Shaped rewards (ingredient-in-pot and soup-in-dish) can be annealed to zero over training via env.set_reward_coefficients().
Layouts¶
Cramped Room Indicator¶
Grounded Coordination Simple¶
Grounded Coordination Ring¶
Test-Time Protocol Simple¶
Test-Time Protocol Wide¶
Demo Cook Simple¶
Demo Cook Wide¶
Configuration¶
| Parameter | Type | Default | Description |
|---|---|---|---|
target_recipes |
list[str] |
["onion_soup", "tomato_soup"] |
Recipes the target is sampled from |
resample_on_delivery |
bool |
True |
Resample target recipe after correct delivery |
local_view_radius |
int |
2 |
Observation window radius (5x5 at radius 2) |
max_steps |
int |
400 |
Episode length |