Workshop on Reinforcement Learning Beyond Rewards

Reinforcement Learning Conference (RLC) 2024

August 9, 2024

@RLBRew_2024 · #RLBRew_2024


This page contains a non-exhaustive list of resources for machine learning and reinforcement learning researchers and practitioners to learn more about reward free RL. Feel free to provide additional resource suggestions via a pull request on GitHub.

Detailed Abstract

Reinforcement Learning has been successful in solving tasks in a wide range of fields including robotics, finance, gameplay, and managing control systems [4]. However, this success is limited to tasks that can be described via carefully crafted reward functions. For the success of RL algorithms, it is crucial to design a reward function that aligns with the human intent for the task and is suitably dense. As demonstrated by several works [5,6,7,8], constructing such reward functions is not easy. The problem only becomes more pronounced when the agent is expected to solve multiple tasks. Additionally, the heavy dependence of RL agents on reward-annotated environment interactions forces the agent to explore every time the reward function changes. The perils of reward design pose the unresolved question of what and how an agent can learn from the often substantial quantity of reward-free interactions with the environment, as well as with alternative learning signals. Prior works have investigated this question by using reward-free interactions for Intrinsic Motivation, Contrastive Learning, Skill Discovery, and Representation Learning. The difficulty of reward function design has also motivated prior art to propose alternative learning signals that can make learning easier such as expert demonstrations, preferences, implicit human feedback, or target distributions to describe the task to an RL agent.

This workshop investigates the following core premise: reward-free interactions are abundant in the real world, often generated both by AI systems and humans. Indeed, the difficulty of reward design can be sidestepped by considering alternative learning signals that are easy to collect such as demonstrations or preferences. With the success of language and vision in leveraging large data to create generalist agents, one of the key objectives of this workshop is to ask the question of how far we can progress in the goal of creating similar generalist behavior agents using reward-free interactions. We hope to facilitate a confluence between researchers working on these two often disjoint areas—those exploring how to make the best use of reward-free interactions and those exploring scalable alternative learning signals to make agents more capable. The discussions sparked by the ongoing work in this space can inspire novel algorithms/methods/benchmarks that would allow us to take a step in making training general-purpose RL agents more practical.

References

  1. Smith, L., Cao, Y., & Levine, S. (2023). Grow Your Limits: Continuous Improvement with Real-World RL for Robotic Locomotion. ArXiv. /abs/2310.17634
  2. Hambly, B., Xu, R., & Yang, H. (2021). Recent Advances in Reinforcement Learning in Finance. ArXiv. https://doi.org/10.13140/RG.2.2.30278.40002
  3. Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., & Hassabis, D. (2017). Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm. ArXiv. /abs/1712.01815
  4. Tracey, B. D., Michi, A., Chervonyi, Y., Davies, I., Paduraru, C., Lazic, N., Felici, F., Ewalds, T., Donner, C., Galperti, C., Buchli, J., Neunert, M., Huber, A., Evens, J., Kurylowicz, P., Mankowitz, D. J., Riedmiller, M., & Team, T. T. (2023). Towards practical reinforcement learning for tokamak magnetic control. ArXiv. /abs/2307.11546
  5. Peng, Xue Bin, et al. "Ase: Large-scale reusable adversarial skill embeddings for physically simulated characters." ACM Transactions On Graphics (TOG) 41.4 (2022): 1-17.
  6. Wüthrich, Manuel et al. “TriFinger: An Open-Source Robot for Learning Dexterity.” Conference on Robot Learning (2020).
  7. Knox, W. Bradley, et al. "Reward (mis) design for autonomous driving." Artificial Intelligence 316 (2023): 103829.
  8. Andrew Y. Ng, Daishi Harada, and Stuart J. Russell. 1999. Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML '99). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 278–287.
  9. Durugkar, Ishan, et al. "Adversarial intrinsic motivation for reinforcement learning." Advances in Neural Information Processing Systems 34 (2021): 8622-8636.
  10. Adeniji, Ademi, Amber Xie, and Pieter Abbeel. "Skill-based reinforcement learning with intrinsic reward matching." arXiv preprint arXiv:2210.07426 (2022).
  11. Agarwal, Rishabh, et al. "Contrastive behavioral similarity embeddings for generalization in reinforcement learning." arXiv preprint arXiv:2101.05265 (2021).
  12. Schwarzer, Max, et al. "Data-efficient reinforcement learning with self-predictive representations." arXiv preprint arXiv:2007.05929 (2020).
  13. Chuck, Caleb, Supawit Chockchowwat, and Scott Niekum. "Hypothesis-driven skill discovery for hierarchical deep reinforcement learning." 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020.
  14. Chuck, Caleb, et al. "Granger-Causal Hierarchical Skill Discovery." arXiv preprint arXiv:2306.09509 (2023).
  15. Ma, Yecheng Jason, et al. "Vip: Towards universal visual reward and representation via value-implicit pre-training." arXiv preprint arXiv:2210.00030 (2022).
  16. Nair, Suraj, et al. "R3m: A universal visual representation for robot manipulation." arXiv preprint arXiv:2203.12601 (2022).
  17. Sikchi, Harshit, et al. "Dual rl: Unification and new methods for reinforcement and imitation learning." Sixteenth European Workshop on Reinforcement Learning. 2023.
  18. Ni, T., Sikchi, H., Wang, Y., Gupta, T., Lee, L., & Eysenbach, B. (2021, October). f-irl: Inverse reinforcement learning via state marginal matching. In Conference on Robot Learning (pp. 529-551). PMLR.
  19. Hejna, Joey, and Dorsa Sadigh. "Inverse preference learning: Preference-based rl without a reward function." Advances in Neural Information Processing Systems 36 (2024).
  20. Hejna, J., Rafailov, R., Sikchi, H., Finn, C., Niekum, S., Knox, W. B., & Sadigh, D. (2023). Contrastive prefence learning: Learning from human feedback without rl. arXiv preprint arXiv:2310.13639.
  21. Xie, Tengyang, et al. "Interaction-grounded learning." International Conference on Machine Learning. PMLR, 2021.
  22. Xie, Tengyang, et al. "Interaction-grounded learning with action-inclusive feedback." Advances in Neural Information Processing Systems 35 (2022): 12529-12541.
  23. Maghakian, Jessica, et al. "Personalized reward learning with interaction-grounded learning (IGL)." arXiv preprint arXiv:2211.15823 (2022).
  24. Agarwal, Siddhant, et al. "f-Policy Gradients: A General Framework for Goal-Conditioned RL using f-Divergences." Advances in Neural Information Processing Systems 36 (2024).
  25. Feng, F., Huang, B., Zhang, K., & Magliacane, S. (2022). Factored adaptation for non-stationary reinforcement learning. Advances in Neural Information Processing Systems, 35.
  26. Feng, F., & Magliacane, S. (2023). Learning dynamic attribute-factored world models for efficient multi-object reinforcement learning. Advances in Neural Information Processing Systems, 36.
  27. Gaya, J. B., Doan, T., Caccia, L., Soulier, L., Denoyer, L., & Raileanu, R. (2022). Building a subspace of policies for scalable continual learning. arXiv preprint arXiv:2211.10445.
  28. Mediratta, I., You, Q., Jiang, M., & Raileanu, R. (2023). The Generalization Gap in Offline Reinforcement Learning. arXiv preprint arXiv:2312.05742.
  29. Cui, Yuchen, et al. "The empathic framework for task learning from implicit human feedback." Conference on Robot Learning. PMLR, 2021.
  30. Cui, Yuchen, et al. "No, to the right: Online language corrections for robotic manipulation via shared autonomy." Proceedings of the 2023 ACM/IEEE International Conference on Human-Robot Interaction. 2023.
  31. Sikchi, Harshit, et al. "Score Models for Offline Goal-Conditioned Reinforcement Learning." arXiv preprint arXiv:2311.02013 (2023).
  32. Ouyang, Long, et al. "Training language models to follow instructions with human feedback." Advances in Neural Information Processing Systems 35 (2022): 27730-27744.
  33. Gao, Leo, John Schulman, and Jacob Hilton. "Scaling laws for reward model overoptimization." International Conference on Machine Learning. PMLR, 2023.
  34. Sikchi, Harshit, Wenxuan Zhou, and David Held. "Learning off-policy with online planning." Conference on Robot Learning. PMLR, 2022.
  35. Sikchi, Harshit, et al. "A ranking game for imitation learning." arXiv preprint arXiv:2202.03481 (2022).
  36. Agarwal, Shubhankar, et al. "Imitative planning using conditional normalizing flow." arXiv preprint arXiv:2007.16162 (2020).
  37. Leike, Jan, et al. "Scalable agent alignment via reward modeling: a research direction." arXiv preprint arXiv:1811.07871 (2018).
  38. Ziegler, Daniel M., et al. "Fine-tuning language models from human preferences." arXiv preprint arXiv:1909.08593 (2019).
  39. Rafailov, Rafael, et al. "Direct preference optimization: Your language model is secretly a reward model." arXiv preprint arXiv:2305.18290 (2023).
  40. Korbak, Tomasz, et al. "Pretraining language models with human preferences." International Conference on Machine Learning. PMLR, 2023.
  41. Song, Feifan, et al. "Preference ranking optimization for human alignment." arXiv preprint arXiv:2306.17492 (2023).
  42. Dai, Josef, et al. "Safe rlhf: Safe reinforcement learning from human feedback." arXiv preprint arXiv:2310.12773 (2023).
  43. Swamy, Gokul, et al. "A minimaximalist approach to reinforcement learning from human feedback." arXiv preprint arXiv:2401.04056 (2024).
  44. Korbak, Tomasz, et al. "Pretraining language models with human preferences." International Conference on Machine Learning. PMLR, 2023.
  45. Ma, Yecheng Jason, et al. "LIV: Language-Image Representations and Rewards for Robotic Control." arXiv preprint arXiv:2306.00958 (2023).
  46. Ghosh, Dibya, Chethan Anand Bhateja, and Sergey Levine. "Reinforcement learning from passive data via latent intentions." International Conference on Machine Learning. PMLR, 2023.
  47. Park, Seohong, Oleh Rybkin, and Sergey Levine. "METRA: Scalable Unsupervised RL with Metric-Aware Abstraction." arXiv preprint arXiv:2310.08887 (2023).
  48. Zheng, Chongyi, et al. "Stabilizing Contrastive RL: Techniques for Offline Goal Reaching." arXiv preprint arXiv:2306.03346 (2023).
  49. Schmidt, Dominik, and Minqi Jiang. "Learning to Act without Actions." arXiv preprint arXiv:2312.10812 (2023).
  50. Bhateja, Chethan, et al. "Robotic Offline RL from Internet Videos via Value-Function Pre-Training." arXiv preprint arXiv:2309.13041 (2023).
  51. Sun, Hao, et al. "When is Off-Policy Evaluation Useful? A Data-Centric Perspective." arXiv preprint arXiv:2311.14110 (2023).
  52. Sun, Hao, et al. "Accountability in offline reinforcement learning: Explaining decisions with a corpus of examples." arXiv preprint arXiv:2310.07747 (2023).
  53. Hüyük, Alihan, Daniel Jarrett, and Mihaela van der Schaar. "Explaining by imitating: Understanding decisions by interpretable policy learning." arXiv preprint arXiv:2310.19831 (2023).
  54. Gao, Ge, et al. "Hope: Human-centric off-policy evaluation for e-learning and healthcare." arXiv preprint arXiv:2302.09212 (2023).
  55. Eysenbach, Benjamin, Ruslan Salakhutdinov, and Sergey Levine. "C-learning: Learning to achieve goals via recursive classification." arXiv preprint arXiv:2011.08909 (2020).
  56. Zheng, Chongyi, et al. "Stabilizing Contrastive RL: Techniques for Offline Goal Reaching." arXiv preprint arXiv:2306.03346 (2023).
  57. Yang, Mengjiao, Sergey Levine, and Ofir Nachum. "Trail: Near-optimal imitation learning with suboptimal data." arXiv preprint arXiv:2110.14770 (2021).
  58. Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training
  59. Ball, P. J., Smith, L., Kostrikov, I., & Levine, S. (2023). Efficient online reinforcement learning with offline data. arXiv preprint arXiv:2302.02948.
  60. Mandlekar, A., Xu, D., Wong, J., Nasiriany, S., Wang, C., Kulkarni, R., ... & Martín-Martín, R. (2021). What matters in learning from offline human demonstrations for robot manipulation. arXiv preprint arXiv:2108.03298.
  61. Christiano, Paul F., et al. "Deep reinforcement learning from human preferences." Advances in neural information processing systems 30 (2017).
  62. Swamy, G., Choudhury, S., Bagnell, J. A., & Wu, Z. S. (2023). Inverse Reinforcement Learning without Reinforcement Learning. arXiv e-prints, arXiv-2303.
  63. Zhu, Banghua, Jiantao Jiao, and Michael I. Jordan. "Principled Reinforcement Learning with Human Feedback from Pairwise or $ K $-wise Comparisons." arXiv preprint arXiv:2301.11270 (2023).
  64. Swamy, G., Choudhury, S., Bagnell, J. A., & Wu, S. (2021, July). Of moments and matching: A game-theoretic framework for closing the imitation gap. In International Conference on Machine Learning (pp. 10022-10032). PMLR.
  65. Liu, Z., Guo, Z., Lin, H., Yao, Y., Zhu, J., Cen, Z., ... & Zhao, D. (2023). Datasets and Benchmarks for Offline Safe Reinforcement Learning. arXiv preprint arXiv:2306.09303.
  66. Cen, Z., Liu, Z., Wang, Z., Yao, Y., Lam, H., & Zhao, D. (2024). Learning from Sparse Offline Datasets via Conservative Density Estimation. International Conference on Learning Representations.
  67. Liu, Z., Guo, Z., Yao, Y., Cen, Z., Yu, W., Zhang, T., & Zhao, D. (2023). Constrained Decision Transformer for Offline Safe Reinforcement Learning. International Conference on Machine Learning. PMLR.
  68. Xu, H., Zhan, X., & Zhu, X. (2022, June). Constraints penalized q-learning for safe offline reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 36, No. 8, pp. 8753-8760).
  69. Xu, H., Zhan, X., Yin, H., & Qin, H. (2022, June). Discriminator-weighted offline imitation learning from suboptimal demonstrations. In International Conference on Machine Learning (pp. 24725-24742). PMLR.
  70. Xu, H., Jiang, L., Jianxiong, L., & Zhan, X. (2022). A policy-guided imitation approach for offline reinforcement learning. Advances in Neural Information Processing Systems, 35, 4085-4098.
  71. Xu, H., Jiang, L., Li, J., Yang, Z., Wang, Z., Chan, V. W. K., & Zhan, X. (2023). Offline rl with no ood actions: In-sample learning via implicit value regularization. International Conference on Learning Representations.
  72. Kumar, A., Agarwal, R., Geng, X., Tucker, G., & Levine, S. (2022). Offline q-learning on diverse multi-task data both scales and generalizes. arXiv preprint arXiv:2211.15144.
  73. Song, Y., Zhou, Y., Sekhari, A., Bagnell, J. A., Krishnamurthy, A., & Sun, W. (2022). Hybrid rl: Using both offline and online data can make rl efficient. International Conference on Learning Representations.
  74. Li, J., Hu, X., Xu, H., Liu, J., Zhan, X., & Zhang, Y. Q. (2023). PROTO: Iterative Policy Regularized Offline-to-Online Reinforcement Learning. arXiv preprint arXiv:2305.15669.
  75. Li, Q., Zhang, J., Ghosh, D., Zhang, A., & Levine, S. (2023). Accelerating exploration with unlabeled prior data. Advances in Neural Information Processing Systems.
  76. Li, J., Hu, X., Xu, H., Liu, J., Zhan, X., Jia, Q. S., & Zhang, Y. Q. (2023). Mind the gap: Offline policy optimization for imperfect rewards. International Conference on Learning Representations.
  77. Padalkar, A., Pooley, A., Jain, A., Bewley, A., Herzog, A., Irpan, A., ... & Jain, V. (2023). Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864.
  78. Chebotar, Y., Hausman, K., Lu, Y., Xiao, T., Kalashnikov, D., Varley, J., ... & Levine, S. (2021). Actionable models: Unsupervised offline reinforcement learning of robotic skills. arXiv preprint arXiv:2104.07749.
  79. Yang, S., Nachum, O., Du, Y., Wei, J., Abbeel, P., & Schuurmans, D. (2023). Foundation models for decision making: Problems, methods, and opportunities. arXiv preprint arXiv:2303.04129.
  80. Wu, J., Ma, H., Deng, C., & Long, M. (2023). Pre-training Contextualized World Models with In-the-wild Videos for Reinforcement Learning. arXiv preprint arXiv:2305.18499.
  81. Huang, B., Feng, F., Lu, C., Magliacane, S., and Zhang, K. "AdaRL: What, Where, and How to Adapt in Transfer Reinforcement Learning." In International Conference on Learning Representations. 2022.