Demystifying the Mechanisms Behind Emergent Exploration in Goal-conditioned RL

Abstract

We take a first step toward elucidating the mechanisms behind emergent exploration in unsupervised reinforcement learning.
We study Single-Goal Contrastive Reinforcement Learning (SGCRL) (Liu et al., 2025), a self-supervised algorithm capable of solving challenging long-horizon goal-reaching tasks without external rewards or curricula.
We combine theoretical analysis of the algorithm's objective function with controlled experiments to understand what drives its exploration.
We show that SGCRL maximizes implicit rewards shaped by its learned representations, which automatically modify the reward landscape to promote exploration before reaching the goal and exploitation thereafter.
Our experiments demonstrate that these exploration dynamics arise from learning low-rank representations of the state space rather than from neural network function approximation.
Our improved understanding enables us to adapt SGCRL to perform safety-aware exploration.

🚀 Try Our Interactive Demo!

Explore how SGCRL representations drive exploration in the Four Rooms environment with our hands-on notebook

✨ No setup required • Runs directly in your browser

Contrastive representations are essential for exploration!

Our main finding is that contrastive representations induce an implicit curriculum: exploration before the goal is reached and exploitation afterward. This follows from how similarity to the goal evolves in representation space (a measure we call goal-similarity). Early in training, states have high goal-similarity, so the agent optimistically explores many regions. As training continues, representations of explored non-goal states drift away, reducing revisit probability; when the goal is found, states along the successful path increase slightly in goal-similarity, enabling exploitation.

Demo: Exploration behavior in single-goal contrastive RL.

Demo: Tower of Hanoi animation showing how representations change with state visitation. Top left: representational similarity to the goal of each state. Bottom left: state visitation count of the past 4 trajectories collected and 1 evaluation trajectory. Right: Rendering of the evaluation trajectory. Observe that before the goal is reached, the representational goal similarity of frequently visited states decreases. After the goal is reached, the representational goal similarity of frequently visited states increases.

BibTeX

@misc{bastankhah2025demystifyingmechanismsemergentexploration,
      title={Demystifying the Mechanisms Behind Emergent Exploration in Goal-conditioned RL}, 
      author={Mahsa Bastankhah and Grace Liu and Dilip Arumugam and Thomas L. Griffiths and Benjamin Eysenbach},
      year={2025},
      eprint={2510.14129},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2510.14129}, 
}