Overcooked V2¶

Seven environments from Gessler et al., 2025 that test coordination under asymmetric information and stochasticity. Each episode samples a hidden target recipe. One agent can observe the recipe through a nearby indicator; the other must infer it through communication or partner behavior.

Uses the same actions and cooking pipeline as Overcooked V1. Key differences: partial observability, stochastic recipe selection, open pots that accept any ingredient, and reward shaping that penalizes incorrect actions.

Variants¶

Environment ID	Category	Layout
`OvercookedV2-CrampedRoomIndicator-V0`	Indicator Only	5x4
`OvercookedV2-GroundedCoordSimple-V0`	Grounded Coordination	8x5
`OvercookedV2-GroundedCoordRing-V0`	Grounded Coordination	9x9
`OvercookedV2-TestTimeSimple-V0`	Test-Time Protocol	8x5
`OvercookedV2-TestTimeWide-V0`	Test-Time Protocol	6x7
`OvercookedV2-DemoCookSimple-V0`	Demo Cook	11x5
`OvercookedV2-DemoCookWide-V0`	Demo Cook	11x6

Coordination Categories¶

Category	Button	Incorrect Penalty	How Recipe is Communicated
Grounded Coordination	Yes (-5 cost)	-20	Button reveals recipe to partner for 10 steps
Test-Time Protocol	No	-20	Agents must develop implicit signaling conventions
Demo Cook	No	None	One agent demonstrates the recipe through actions

Stochastic Recipe Selection¶

At reset, a target recipe is drawn uniformly from ["onion_soup", "tomato_soup"]. After each correct delivery, the target is resampled. Agents cannot memorize a fixed strategy; they must read the recipe each episode and adapt.

Partial Observability¶

All V2 environments use local_view with local_view_radius=2, producing a 5x5 agent-centered window. Each agent sees only the grid cells within 2 steps of its position. This creates information asymmetry: an agent near the recipe indicator can read the target, while an agent elsewhere cannot.

Recipe and Button Indicators¶

The RecipeIndicator (R) is a wall tile whose state encodes the current target recipe. It is always active — any agent whose view window covers it can read the recipe.

The ButtonIndicator (L) is inactive by default. An agent can toggle it (empty-handed, costs -5 reward) to reveal the target recipe at that tile for 10 steps, then it deactivates. This is the communication channel in Grounded Coordination layouts.

OpenPot¶

The OpenPot (u) accepts any 3-ingredient combination — onion, tomato, broccoli, or mushroom in any mix. There is no validation at placement time. Correctness is checked only at delivery: correct soup = +20, incorrect = -20 (or 0 in Demo Cook). Cook time is 20 steps.

V2-Specific Objects¶

In addition to shared Overcooked objects:

Char	Name	Description
`u`	OpenPot	Pot accepting any 3-ingredient combination (cook time 20)
`R`	RecipeIndicator	Displays the current target recipe (wall tile)
`L`	ButtonIndicator	Toggle to reveal recipe for 10 steps (-5 cost)
`X`	OpenDeliveryZone	Accepts any soup type for delivery
`B`	BroccoliStack	Distractor ingredient dispenser
`M`	MushroomStack	Distractor ingredient dispenser

Observations¶

The local view is a (5, 5, 35) tensor (flattened to 875) with these channel groups:

Channels	Count	Description
Core	8	Object type map, agent positions (2), directions (2), inventories (2), object state map
Pot state	4	Is cooking, is ready, fill level, cook timer (all at pot positions only)
Pot ingredients	4	Per-ingredient count in pot (onion, tomato, broccoli, mushroom), normalized by capacity
Decomposed inventory	12	`[plate, cooked, onion, tomato, broccoli, mushroom]` at each agent's position (self first, 6 channels per agent)
Recipe decomposition	6	`[plate, cooked, onion, tomato, broccoli, mushroom]` at RecipeIndicator and active ButtonIndicator positions
Delivery indicator	1	1.0 at delivery zone positions on the step a delivery occurred

Recipe decomposition channels are non-zero only at indicator tiles. The button's channels activate when its timer is running and return to zero when it expires.

Rewards¶

Class	Coefficient	Common	Trigger
`TargetRecipeDeliveryReward`	20.0	Yes	Correct delivery +20, incorrect -20
`TargetRecipeIngredientInPotReward`	3.0	Yes	Correct ingredient in pot +3, incorrect -3
`TargetRecipeSoupInDishReward`	5.0	Yes	Correct soup plated +5, incorrect -5
`ButtonActivationCost`	-5.0	Yes	Toggle the button indicator

Shaped rewards (ingredient-in-pot and soup-in-dish) can be annealed to zero over training via env.set_reward_coefficients().

Layouts¶

Cramped Room Indicator¶

CCuCC
O   T
=   R
CCCXC

Grounded Coordination Simple¶

CCBCCCCC
C  C=  O
R +Lu+ X
C  C=  T
CCBCCCCC

Grounded Coordination Ring¶

CCCBRBCCC
C       C
C CCLCC C
B O   = B
R+X+u + R
B T   = B
C CCLCC C
C       C
CCCBRBCCC

Test-Time Protocol Simple¶

CCBCCCCC
C  C=  O
R +Cu+ X
C  C=  T
CCBCCCCC

Test-Time Protocol Wide¶

CCX=CC
O +  O
T    T
CuCuCC
M +  M
C    C
CCRCCC

Demo Cook Simple¶

CCCCCRBCoCC
O      C  =
C     +u+ X
T      C  =
CCCCCRBCtCC

Demo Cook Wide¶

CCCC=X=CCCC
CCCO + TCCC
CCCCCuCCCCC
C    +    C
O  CMRMC  O
CTCCCCCCCTC

Configuration¶

Parameter	Type	Default	Description
`target_recipes`	`list[str]`	`["onion_soup", "tomato_soup"]`	Recipes the target is sampled from
`resample_on_delivery`	`bool`	`True`	Resample target recipe after correct delivery
`local_view_radius`	`int`	`2`	Observation window radius (5x5 at radius 2)
`max_steps`	`int`	`400`	Episode length