Abstract
- We take a first step toward elucidating the mechanisms behind emergent exploration in unsupervised reinforcement learning.
- We study Single-Goal Contrastive Reinforcement Learning (SGCRL) (Liu et al., 2025), a self-supervised algorithm capable of solving challenging long-horizon goal-reaching tasks without external rewards or curricula.
- We combine theoretical analysis of the algorithm's objective function with controlled experiments to understand what drives its exploration.
- We show that SGCRL maximizes implicit rewards shaped by its learned representations, which automatically modify the reward landscape to promote exploration before reaching the goal and exploitation thereafter.
- Our experiments demonstrate that these exploration dynamics arise from learning low-rank representations of the state space rather than from neural network function approximation.
- Our improved understanding enables us to adapt SGCRL to perform safety-aware exploration.
🚀 Try Our Interactive Demo!
Explore how SGCRL representations drive exploration in the Four Rooms environment with our hands-on notebook
Open Interactive Notebook✨ No setup required • Runs directly in your browser
Contrastive representations are essential for exploration!
Our main finding is that contrastive representations induce an implicit curriculum: exploration before the goal is reached and exploitation afterward. This follows from how similarity to the goal evolves in representation space (a measure we call goal-similarity). Early in training, states have high goal-similarity, so the agent optimistically explores many regions. As training continues, representations of explored non-goal states drift away, reducing revisit probability; when the goal is found, states along the successful path increase slightly in goal-similarity, enabling exploitation.
BibTeX
@misc{bastankhah2025demystifyingmechanismsemergentexploration,
title={Demystifying the Mechanisms Behind Emergent Exploration in Goal-conditioned RL},
author={Mahsa Bastankhah and Grace Liu and Dilip Arumugam and Thomas L. Griffiths and Benjamin Eysenbach},
year={2025},
eprint={2510.14129},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2510.14129},
}