詳解強化学習の発展と応用ロボット制御・ゲーム開発のための実践的理論

著：	小林泰介氏（国立情報学研究所／総合研究大学院大学）
定価：	3,960円（本体3,600円＋税）
判型：	A5
ページ数：	212 ページ
ISBN：	978-4-910558-27-1
発売日：	2024/3/19
管理No：	125

目次
参考文献
口コミ

【目次】

第１章　強化学習とは

１－１　強化学習の目的
１－２　解決すべき課題
1. １－２－１　間接的な教示
2. １－２－２　データの収集
3. １－２－３　収益の予測
参考文献

第２章　強化学習の基本的な問題設定

２－１　マルコフ決定過程
２－２　方策関数
1. ２－２－１　離散行動空間における方策関数
2. ２－２－２　連続行動空間における方策関数
２－３　収益・価値関数
1. ２－３－１　収益の定義
2. ２－３－２　価値関数の導入
3. ２－３－３　方策オン型と方策オフ型
２－４　関数近似
1. ２－４－１　線形関数近似
2. ２－４－２　非線形関数近似
参考文献

第３章　基本的な学習アルゴリズム

３－１　価値関数の学習
1. ３－１－１　モンテカルロ法
2. ３－１－２　TD法
3. ３－１－３　アドバンテージ関数
３－２　価値関数の一般化
1. ３－２－１　nステップTD法
2. ３－２－２　TD(λ)法
3. ３－２－３　適正度履歴
３－３　方策関数の学習
1. ３－３－１　行動価値関数を用いたモデル
2. ３－３－２　方策勾配法
3. ３－３－３　Actor-Critic法
３－４　学習を支援する技術
1. ３－４－１　深層学習
2. ３－４－２　経験再生
3. ３－４－３　ターゲットネットワーク
4. ３－４－４　アンサンブル学習
参考文献

第４章　方策勾配法の発展

４－１　重要なテクニック
1. ４－１－１　確率分布間の乖離度
2. ４－１－２　重点サンプリング
3. ４－１－３　再パラメータ化トリック
４－２　方策更新の制限
1. ４－２－１　Trust Region Policy Optimization: TRPO
2. ４－２－２　Proximal Policy Optimization: PPO
3. ４－２－３　Locally Lipschitz Continuous Constraint: L2C2
４－３　直接的な方策勾配の計算
1. ４－３－１　Deterministic Policy Gradient: DPG
2. ４－３－２　Twin Delayed DDPG: TD3
４－４　方策エントロピーの最大化
1. ４－４－１　Soft Q-learning: SQL
2. ４－４－２　Soft Actor-Critic: SAC
3. ４－４－３　SACの改良例
参考文献

第５章　モデルベース強化学習

５－１　世界モデルの学習
1. ５－１－１　状態遷移確率・報酬関数の学習
2. ５－１－２　表現学習
3. ５－１－３　世界モデルの学習アルゴリズム例：PlaNet
５－２　世界モデルの活用
1. ５－２－１　収益の推定
2. ５－２－２　仮想的な経験の生成
3. ５－２－３　プランニング
4. ５－２－４　プランニングの改良例
５－３　残差強化学習
参考文献

第６章　報酬設計の課題と対策

６－１　疎な報酬
1. ６－１－１　Hindsight Experience Replay: HER
2. ６－１－２　内発的動機付け
６－２　多目的性
1. ６－２－１　セーフ強化学習
2. ６－２－２　多目的強化学習
3. ６－２－３　階層強化学習
６－３　エキスパートの模倣
1. ６－３－１　模倣による方策の初期化
2. ６－３－２　逆強化学習
６－４　学習難易度の調整
1. ６－４－１　カリキュラム学習
2. ６－４－２　自己競争
参考文献

第７章　今後の展望

７－１　マルチエージェント強化学習
７－２　確率推論としての強化学習
７－３　生物の意思決定モデル
参考文献

【参考文献】

Richard S Sutton, Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
森村哲郎. 強化学習（機械学習プロフェッショナルシリーズ）. 講談社, 2019.
八谷大岳, 杉山将. 強くなるロボティック・ゲームプレイヤーの作り方実践で学ぶ強化学習プレミアムブックス版. マイナビ出版, 2016.
曽我部東馬. 強化学習アルゴリズム入門「平均」からはじめる基礎と応用. オーム社, 2019.
小山田創哲, 前田新一, 小山雅典. 速習強化学習: 基礎理論とアルゴリズム. 共立出版, 2017.
Vincent François-Lavet, Peter Henderson, Riashat Islam, Marc G Bellemare, Joelle Pineau, et al. An introduction to deep reinforcement learning. Foundations and Trends® in Machine Learning, 11(3-4):219–354, 2018.
大塚敏之. 非線形最適制御入門. コロナ社, 2011.
Takayuki Osa, Joni Pajarinen, Gerhard Neumann, J Andrew Bagnell, Pieter Abbeel, Jan Peters, et al. An algorithmic perspective on imitation learning. Foundations and Trends® in Robotics, 7(1-2):1–179, 2018.
本多淳也. バンディット問題の理論とアルゴリズム（機械学習プロフェッショナルシリーズ）. 講談社, 2016.
Sergey Levine, Aviral Kumar, George Tucker, Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
Jan Peters, Stefan Schaal. Reinforcement learning by rewardweighted regression for operational space control. In International Conference on Machine Learning, pp. 745–750, 2007.
佐藤和也, 下本陽一, 熊澤典良. はじめての現代制御理論. 講談社, 2012.
R. E. Bellman. Dynamic Programming. Princeton University Press, 1957.
Richard S Sutton, Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
森村哲郎. 強化学習（機械学習プロフェッショナルシリーズ）. 講談社, 2019.
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel,Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp. 1861–1870. PMLR, 2018.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
Sepp Hochreiter, Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
OpenAI: Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipulation. The International Journal of Robotics Research, 39(1):3–20, 2020.
Ahmad EL Sallab, Mohammed Abdou, Etienne Perot, Senthil Yogamani. Deep reinforcement learning framework for autonomous driving. arXiv preprint arXiv:1704.02532, 2017.
Smruti Amarjyoti. Deep reinforcement learning for robotic manipulation-the state of the art. arXiv preprint arXiv:1701.08878, 2017.
Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. Scalable deep reinforcement learning for vision-based robotic manipulation. In Conference on Robot Learning, pp. 651–673. PMLR, 2018.
Yoshihisa Tsurumine, Yunduan Cui, Eiji Uchibe,Takamitsu Matsubara. Deep reinforcement learning with smooth policy update: Application to robotic cloth manipulation. Robotics and Autonomous Systems, 112:72–83, 2019.
Chi Jin, Sham Kakade, Akshay Krishnamurthy, Qinghua Liu. Sample-efficient reinforcement learning of undercomplete pomdps. Advances in Neural Information Processing Systems, 33:18530–18539, 2020.
Gautam Singh, Skand Peri, Junghyun Kim, Hyunseok Kim, Sungjin Ahn. Structured world belief for reinforcement learning in pomdp. In International Conference on Machine Learning, pp. 9744–9755. PMLR, 2021.
Qinghua Liu, Alan Chung, Csaba Szepesvári, Chi Jin. When is partially observable reinforcement learning not scary? In Conference on Learning Theory, pp. 5175–5220. PMLR, 2022.
Sergey Levine, Aviral Kumar, George Tucker,Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
Rishabh Agarwal, Dale Schuurmans,Mohammad Norouzi. An optimistic perspective on offline reinforcement learning. In International Conference on Machine Learning, pp. 104–114. PMLR, 2020.
Aviral Kumar, Aurick Zhou, George Tucker, Sergey Levine. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
Michel Tokic, Günther Palm. Value-difference based exploration: adaptive control between epsilon-greedy and softmax. In Annual conference on artificial intelligence, pp. 335–346. Springer, 2011.
Kavosh Asadi, Michael L Littman. An alternative softmax operator for reinforcement learning. In International Conference on Machine Learning, pp. 243–252. PMLR, 2017.
Sekitoshi Kanai, Yasuhiro Fujiwara, Yuki Yamanaka, Shuichi Adachi. Sigsoftmax: Reanalysis of the softmax bottleneck. Advances in Neural Information Processing Systems, 31, 2018.
Christian Daniel, Gerhard Neumann, Jan Peters. Hierarchical relative entropy policy search. In Artificial Intelligence and Statistics, pp. 273–281. PMLR, 2012.
Hikaru Sasaki, Takamitsu Matsubara. Variational policy search using sparse gaussian process priors for learning multimodal optimal actions. Neural Networks, 143:291–302, 2021.
Taisuke Kobayashi. Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence, 49(12):4335–4347, 2019.
Po-Wei Chou, Daniel Maturana, Sebastian Scherer. Improving stochastic policy gradients in continuous control with deep reinforcement learning using the beta distribution. In International conference on machine learning, pp. 834–843. PMLR, 2017.
Bogdan Mazoure, Thang Doan, Audrey Durand, Joelle Pineau, R Devon Hjelm. Leveraging exploration in off-policy algorithms via normalizing flows. In Conference on Robot Learning, pp. 430–444. PMLR, 2020.
Taisuke Kobayashi, Takumi Aotani. Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics, pp. 1–18, 2023.
浮田浩行, 濱上知樹, 藤吉弘亘, 大町真一郎, 戸田智基, 岩崎敦, 小林泰介, 鈴木亮太, 木村雄喜, 橋本大樹, 玉垣勇樹, 水谷麻紀子, 永田毅, 木村光成, 李晃伸, 川嶋宏彰. 機械学習の可能性(計測・制御セレクションシリーズ). コロナ社, 2023.
Arthur Aubret, Laetitia Matignon, Salima Hassas. A survey on intrinsic motivation in reinforcement learning. arXiv preprint arXiv:1908.06976, 2019.
大塚敏之. 非線形最適制御入門. コロナ社, 2011.
Athanasios S Polydoros, Lazaros Nalpantidis. Survey of model-based reinforcement learning: Applications on robotics. Journal of Intelligent & Robotic Systems, 86(2):153–173, 2017.
Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3-4):293–321, 1992.
William Fedus, Prajit Ramachandran, Rishabh Agarwal, Yoshua Bengio, Hugo Larochelle, Mark Rowland, Will Dabney. Revisiting fundamentals of experience replay. In International Conference on Machine Learning, pp. 3061–3071. PMLR, 2020.
Shangtong Zhang, Wendelin Boehmer,Shimon Whiteson. Generalized off-policy actor-critic. Advances in neural information processing systems, 32, 2019.
]Rasool Fakoor, Pratik Chaudhari, Alexander J Smola. P3O: Policy-on policy-off policy optimization. In Uncertainty in Artificial Intelligence, pp. 1017–1027. PMLR, 2020.
Christopher M Bishop, Nasser M Nasrabadi. Pattern recognition and machine learning. Springer, 2006.
持橋大地, 大羽成征. ガウス過程と機械学習（機械学習プロフェッショナルシリーズ）. 講談社, 2019.
Ian Goodfellow, Yoshua Bengio, Aaron Courville. Deep learning. MIT press, 2016.
Kunihiko Fukushima. Neocognitron: A hierarchical neural network capable of visual pattern recognition. Neural networks, 1(2):119–130, 1988.
Alex Krizhevsky, Ilya Sutskever, Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
Yoh-Han Pao, Gwang-Hoon Park, Dejan J Sobajic. Learning and generalization characteristics of the random vector functional-link net. Neurocomputing, 6(2):163–180, 1994.
田中剛平, 中根了昌, 廣瀬明. リザバーコンピューティング時系列パターン認識のための高速機械学習の理論とハードウェア. 森北出版, 2021.
Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016.
森村哲郎. 強化学習（機械学習プロフェッショナルシリーズ）. 講談社, 2019.
Andrew Barto, Michael Duff. Monte carlo matrix inversion and reinforcement learning. Advances in Neural Information Processing Systems, 6, 1993.
Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3:9–44, 1988.
Rajeev Motwani, Prabhakar Raghavan. Randomized algorithms. Cambridge university press, 1995.
Tommi Jaakkola, Satinder Singh, Michael Jordan. Reinforcement learning algorithm for partially observable markov decision problems. Advances in neural information processing systems, 7, 1994.
Megumi Miyashita, Shiro Yano, Toshiyuki Kondo. Evaluation of safe reinforcement learning with comirror algorithm in a nonmarkovian reward problem. In International Conference on Intelligent Autonomous Systems, pp. 62–72. Springer, 2022.
金谷健一. これなら分かる最適化数学-基礎原理から計算手法まで. 2005.
Bradley Efron, Robert J Tibshirani. An introduction to the boot-strap. CRC press, 1994.
William Feller. An introduction to probability theory and its applications, Volume 2, volume 81. John Wiley & Sons, 1991.
Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Ken Goldberg, Joseph Gonzalez, Michael Jordan, Ion Stoica. Rllib: Abstractions for distributed reinforcement learning. In International Conference on Machine Learning, pp.3053–3062. PMLR, 2018.
Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, Noah Dormann. Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research, 22(268):1–8, 2021.
Shengyi Huang, Rousslan Fernand JulienDossa Dossa, Chang Ye, Jeff Braga, Dipam Chakraborty, Kinal Mehta, Joao G M Araújo. Cleanrl: High-quality single-file implementations of deep reinforcement learning algorithms. The Journal of Machine Learning Research, 23(1):12585–12602, 2022.
Taisuke Kobayashi. Intentionally-underestimated value function at terminal state for temporal-difference learning with misdesigned reward. arXiv preprint arXiv:2308.12772, 2023.
Richard S Sutton, Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
Peter J Huber. Robust estimation of a location parameter. In Breakthroughs in statistics: Methodology and distribution, pp.492–518. Springer, 1992.
Masashi Sugiyama, Hirotaka Hachiya, Hisashi Kashima, Tetsuro Morimura. Least absolute policy iteration for robust value function approximation. In IEEE International Conference on Robotics and Automation, pp. 2904–2909. IEEE, 2009.
Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp.1928–1937. PMLR, 2016.
Yoshihisa Tsurumine, Yunduan Cui, Eiji Uchibe, Takamitsu Matsubara. Deep reinforcement learning with smooth policy update: Application to robotic cloth manipulation. Robotics and Autonomous Systems, 112:72–83, 2019.
Gavin A Rummery, Mahesan Niranjan. On-line Q-learning using connectionist systems, volume 37. University of Cambridge, Department of Engineering Cambridge, UK, 1994.
Christopher JCH Watkins, Peter Dayan. Q-learning. Machine learning, 8:279–292, 1992.
John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
Brett Daley, Christopher Amato. Reconciling λ-returns with experience replay. Advances in Neural Information Processing Systems, 32, 2019.
A Harry Klopf. Brain function and adaptive systems: a heterostatic theory. Number 133. Air Force Cambridge Research Laboratories, Air Force Systems Command, United …, 1972.
Satinder P Singh, Richard S Sutton. Reinforcement learning with replacing eligibility traces. Machine learning, 22(1-3):123–158, 1996.
Harm Van Seijen, A Rupam Mahmood, Patrick M Pilarski, Marlos C Machado,Richard S Sutton. True online temporaldifference learning. The Journal of Machine Learning Research, 17(1):5057–5096, 2016.
Harm van Seijen. Effective multi-step temporal-difference learning for non-linear function approximation. arXiv preprint arXiv:1608.05151, 2016.
Taisuke Kobayashi. Adaptive and multiple time-scale eligibility traces for online deep reinforcement learning. Robotics and Autonomous Systems, 151:104019, 2022.
Michel Tokic, Günther Palm. Value-difference based exploration: adaptive control between epsilon-greedy and softmax. In Annual conference on artificial intelligence, pp. 335–346. Springer, 2011.
Kavosh Asadi, Michael L Littman. An alternative softmax operator for reinforcement learning. In International Conference on Machine Learning, pp. 243–252. PMLR, 2017.
Sekitoshi Kanai, Yasuhiro Fujiwara, Yuki Yamanaka, Shuichi Adachi. Sigsoftmax: Reanalysis of the softmax bottleneck. Advances in Neural Information Processing Systems, 31, 2018.
曽我部東馬. 強化学習アルゴリズム入門「平均」からはじめる基礎と応用. オーム社, 2019.
Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229–256, 1992.
Vijay Konda, John Tsitsiklis. Actor-critic algorithms. Advances in neural information processing systems, 12, 1999.
Evan Greensmith, Peter L Bartlett, Jonathan Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(9), 2004.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
Sham M Kakade. A natural policy gradient. Advances in neural information processing systems, 14, 2001.
Philip Thomas. Bias in natural actor-critic algorithms. In International conference on machine learning, pp. 441–448. PMLR, 2014.
Kunihiko Fukushima. Neocognitron: A hierarchical neural network capable of visual pattern recognition. Neural networks, 1(2):119–130, 1988.
Alex Krizhevsky, Ilya Sutskever,Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
Samuel Schoenholz, Ekin Dogus Cubuk. Jax md: a framework for differentiable physics. Advances in Neural Information Processing Systems, 33:11428–11441, 2020.
Scott Fujimoto, Herke Hoof, David Meger. Addressing function approximation error in actor-critic methods. In International conference on machine learning, pp. 1587–1596. PMLR, 2018.
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp. 1861–1870. PMLR, 2018.
Diederik P Kingma, Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Yu E Nesterov. A method for solving the convex programming problem with convergence rate o(1/k2). In Dokl. Akad. Nauk SSSR, volume 269, pp. 543–547, 1983.
Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3-4):293–321, 1992.
Steven Kapturowski, Georg Ostrovski, John Quan, Remi Munos,Will Dabney. Recurrent experience replay in distributed reinforcement learning. In International conference on learning representations, 2018.
Tom Schaul, John Quan, Ioannis Antonoglou,David Silver. Prioritized experience replay. In International conference on learning representations, 2016.
David Isele, Akansel Cosgun. Selective experience replay for lifelong learning. In AAAI Conference on Artificial Intelligence, volume 32, 2018.
William Fedus, Prajit Ramachandran, Rishabh Agarwal, Yoshua Bengio, Hugo Larochelle, Mark Rowland,Will Dabney. Revisiting fundamentals of experience replay. In International Conference on Machine Learning, pp. 3061–3071. PMLR, 2020.
Samarth Sinha, Jiaming Song, Animesh Garg, Stefano Ermon. Experience replay with likelihood-free importance weights. In Learning for Dynamics and Control Conference, pp. 110–123. PMLR, 2022.
Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, Daan Wierstra. Continuous control with deep reinforcement learning. In International conference on learning representations, 2016.
Taisuke Kobayashi, Wendyam Eric Lionel Ilboudo. T-soft update of target network for deep reinforcement learning. Neural Networks, 136:63–71, 2021.
Taisuke Kobayashi. Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:2202.12504, 2022.
Ian Osband, John Aslanides, Albin Cassirer. Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems, 31, 2018.
Oren Peer, Chen Tessler, Nadav Merlis,Ron Meir. Ensemble bootstrapping for q-learning. In International Conference on Machine Learning, pp. 8454–8463. PMLR, 2021.
Deepak Pathak, Dhiraj Gandhi, Abhinav Gupta. Self-supervised exploration via disagreement. In International conference on machine learning, pp. 5062–5071. PMLR, 2019.
Taisuke Kobayashi. Reward bonuses with gain scheduling inspired by iterative deepening search. Results in Control and Optimization, pp. 100244, 2023.
OpenAI: Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipulation. The International Journal of Robotics Research, 39(1):3–20, 2020.
Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. Scalable deep reinforcement learning for vision-based robotic manipulation. In Conference on Robot Learning, pp. 651–673. PMLR, 2018.
Tuomas Haarnoja, Ben Moran, Guy Lever, Sandy H Huang, Dhruva Tirumala, Markus Wulfmeier, Jan Humplik, Saran Tunyasuvunakool, Noah Y Siegel, Roland Hafner, et al. Learning agile soccer skills for a bipedal robot with deep reinforcement learning. arXiv preprint arXiv:2304.13653, 2023.
John R Hershey, Peder A Olsen. Approximating the kullback leibler divergence between gaussian mixture models. In IEEE International Conference on Acoustics, Speech and Signal Processing, volume 4, pp. 317–320. IEEE, 2007.
Jonathan Ho, Stefano Ermon. Generative adversarial imitation learning. Advances in neural information processing systems, 29:4565–4573, 2016.
Taisuke Kobayashi. Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization, 10:100192, 2023.
Seyed Kamyar Seyed Ghasemipour, Richard Zemel, Shixiang Gu. A divergence minimization perspective on imitation learning methods. In Conference on Robot Learning, pp. 1259–1277. PMLR, 2020.
Neng Wan, Dapeng Li, Naira Hovakimyan. F-divergence variational inference. Advances in neural information processing systems, 33:17370–17379, 2020.
Tom Schaul, John Quan, Ioannis Antonoglou, David Silver. Prioritized experience replay. In International conference on learning representations, 2016.
Makoto Yamada, Taiji Suzuki, Takafumi Kanamori, Hirotaka Hachiya, Masashi Sugiyama. Relative density-ratio estimation for robust distribution comparison. Neural computation, 25(5):1324–1370, 2013.
Taisuke Kobayashi, Takumi Aotani, Julio Rogelio Guadarrama- Olvera, Emmanuel Dean-Leon, Gordon Cheng. Reward-punishment actor-critic algorithm applying to robotic nongrasping manipulation. In Joint IEEE International Conference on Development and Learning and Epigenetic Robotics, pp. 37– 42. IEEE, 2019.
Edward L Ionides. Truncated importance sampling. Journal of Computational and Graphical Statistics, 17(2):295–311, 2008.
Rémi Munos, Tom Stepleton, Anna Harutyunyan, Marc Bellemare. Safe and efficient off-policy reinforcement learning. Advances in neural information processing systems, 29, 2016.
Diederik P Kingma, Max Welling. Auto-encoding variational bayes. In International Conference on Learning Representations, 2014.
Eric Jang, Shixiang Gu, Ben Poole. Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representations, 2017.
Mikhail Figurnov, Shakir Mohamed, Andriy Mnih. Implicit reparameterization gradients. Advances in neural information processing systems, 31, 2018.
Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, Nando de Freitas. Sample efficient actor-critic with experience replay. In International conference on learning representations, 2017.
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pp. 1889–1897, 2015.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017.
John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
William Fedus, Prajit Ramachandran, Rishabh Agarwal, Yoshua Bengio, Hugo Larochelle, Mark Rowland, Will Dabney. Revisiting fundamentals of experience replay. In International Conference on Machine Learning, pp. 3061–3071. PMLR, 2020.
Yuhui Wang, Hao He, Xiaoyang Tan. Truly proximal policy optimization. In Uncertainty in Artificial Intelligence, pp. 113–122. PMLR, 2020.
Taisuke Kobayashi. L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 4032–4039, 2022.
Henry Gouk, Eibe Frank, Bernhard Pfahringer, Michael J Cree. Regularisation of neural networks by enforcing lipschitz continuity. Machine Learning, 110(2):393–416, 2021.
Kevin Scaman, Aladin Virmaux. Lipschitz regularity of deep neural networks: analysis and efficient estimation. In International Conference on Neural Information Processing Systems, pp. 3839–3848, 2018.
Siddharth Mysore, Bassel Mabsout, Renato Mancuso, Kate Saenko. Regularizing action policies for smooth control with reinforcement learning. In IEEE International Conference on Robotics and Automation, pp. 1810–1816. IEEE, 2021.
Ming Xu, Matias Quiroz, Robert Kohn, Scott A Sisson. Variance reduction properties of the reparameterization trick. In International conference on artificial intelligence and statistics, pp. 2711–2720. PMLR, 2019.
Paavo Parmas, Masashi Sugiyama. A unified view of likelihood ratio and reparameterization gradients. In International Conference on Artificial Intelligence and Statistics, pp. 4078–4086. PMLR, 2021.
David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, Martin Riedmiller. Deterministic policy gradient algorithms. In International conference on machine learning, pp. 387–395. Pmlr, 2014.
Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, Daan Wierstra. Continuous control with deep reinforcement learning. In International conference on learning representations, 2016.
George E Uhlenbeck, Leonard S Ornstein. On the theory of the brownian motion. Physical review, 36(5):823, 1930.
Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229–256, 1992.
Thomas Degris, Patrick M Pilarski, Richard S Sutton. Model-free reinforcement learning with continuous action in practice. In American Control Conference, pp. 2177–2182. IEEE, 2012.
Scott Fujimoto, Herke Hoof, David Meger. Addressing function approximation error in actor-critic methods. In International conference on machine learning, pp. 1587–1596. PMLR, 2018.
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp. 1861–1870. PMLR, 2018.
Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, Sergey Levine. Reinforcement learning with deep energy-based policies. In International conference on machine learning, pp. 1352–1361. PMLR, 2017.
Dilin Wang, Qiang Liu. Learning to draw samples: With application to amortized mle for generative adversarial learning. arXiv preprint arXiv:1611.01722, 2016.
Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, Sergey Levine. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018.
Amir Beck, Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003.
Benjamin Eysenbach, Sergey Levine. Maximum entropy rl (provably) solves some robust rl problems. In International Conference on Learning Representations, 2021.
Rob Brekelmans, Tim Genewein, Jordi Grau-Moya, Grégoire Delétang, Markus Kunesch, Shane Legg, Pedro Ortega. Your policy regularizer is secretly an adversary. Transactions on Machine Learning Research, 2022.
Taisuke Kobayashi. Soft actor-critic algorithm with trulysatisfied inequality constraint. arXiv preprint arXiv:2303.04356, 2023.
Athanasios S Polydoros, Lazaros Nalpantidis. Survey of model-based reinforcement learning: Applications on robotics. Journal of Intelligent & Robotic Systems, 86(2):153–173, 2017.
Richard S Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine learning proceedings, pp. 216–224. Elsevier, 1990.
Grady Williams, Paul Drews, Brian Goldfain, James M Rehg, Evangelos A Theodorou. Information-theoretic model predictive control: Theory and applications to autonomous driving. IEEE Transactions on Robotics, 34(6):1603–1622, 2018.
Taisuke Kobayashi, Kota Fukumoto. Real-time sampling-based model predictive control based on reverse kullback-leibler divergence and its adaptive acceleration. arXiv preprint arXiv:2212.04298, 2022.
Michael Janner, Justin Fu, Marvin Zhang, Sergey Levine. When to trust your model: Model-based policy optimization. Advances in neural information processing systems, 32, 2019.
Ignasi Clavera, Violet Fu, Pieter Abbeel. Model-augmented actor-critic: Backpropagating through paths. In International conference on learning representations, 2020.
Vladimir Feinberg, Alvin Wan, Ion Stoica, Michael I Jordan, Joseph E Gonzalez, Sergey Levine. Model-based value estimation for efficient model-free reinforcement learning. arXiv preprint arXiv:1803.00101, 2018.
Jacob Buckman, Danijar Hafner, George Tucker, Eugene Brevdo, Honglak Lee. Sample-efficient reinforcement learning with stochastic ensemble value expansion. Advances in neural information processing systems, 31, 2018.
Seyed Kamyar Seyed Ghasemipour, Richard Zemel, Shixiang Gu. A divergence minimization perspective on imitation learning methods. In Conference on Robot Learning, pp. 1259–1277. PMLR, 2020.
Danijar Hafner, Pedro A Ortega, Jimmy Ba, Thomas Parr, Karl Friston, Nicolas Heess. Action and perception as divergence minimization. arXiv preprint arXiv:2009.01791, 2020.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
Shohreh Deldari, Hao Xue, Aaqib Saeed, Jiayuan He, Daniel V Smith, Flora D Salim. Beyond just vision: A review on selfsupervised representation learning on multimodal and temporal data. arXiv preprint arXiv:2206.02353, 2022.
Yong Yu, Xiaosheng Si, Changhua Hu, Jianxun Zhang. A review of recurrent neural networks: Lstm cells and network architectures. Neural computation, 31(7):1235–1270, 2019.
Sepp Hochreiter, Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
Paul J Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10):1550–1560, 1990.
Ronald J Williams, David Zipser. Gradient-based learning algorithms for recurrent networks and their computational complexity. In Backpropagation, pp. 433–486. Psychology Press, 2013.
Corentin Tallec, Yann Ollivier. Unbiasing truncated backpropagation through time. arXiv preprint arXiv:1705.08209, 2017.
Christopher Aicher, Nicholas J Foti, Emily B Fox. Adaptively truncating backpropagation through time to control gradient bias. In Uncertainty in Artificial Intelligence, pp. 799–808. PMLR, 2020.
Charles Fefferman, Sanjoy Mitter, Hariharan Narayanan. Testing the manifold hypothesis. Journal of the American Mathematical Society, 29(4):983–1049, 2016.
Geoffrey E Hinton, Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504–507, 2006.
Diederik P Kingma, Max Welling. Auto-encoding variational bayes. In International Conference on Learning Representations, 2014.
Marie Csete, John Doyle. Bow ties, metabolism and disease. TRENDS in Biotechnology, 22(9):446–450, 2004.
Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017.
Hiroshi Takahashi, Tomoharu Iwata, Yuki Yamanaka, Masanori Yamada, Satoshi Yagi. Variational autoencoder with implicit optimal priors. In AAAI Conference on Artificial Intelligence, volume 33, pp. 5066–5073, 2019.
Riddhish Bhalodia, Iain Lee, Shireen Elhabian. dpvaes: Fixing sample generation for regularized vaes. In Asian Conference on Computer Vision, 2020.
Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, James Davidson. Learning latent dynamics for planning from pixels. In International conference on machine learning, pp. 2555–2565. PMLR, 2019.
Danijar Hafner, Timothy Lillicrap, Jimmy Ba, Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representations, 2020.
Masashi Okada, Tadahiro Taniguchi. Dreaming: Model-based reinforcement learning by latent imagination without reconstruction. In IEEE International Conference on Robotics and Automation, pp. 4209–4215. IEEE, 2021.
Philipp Wu, Alejandro Escontrela, Danijar Hafner, Pieter Abbeel, Ken Goldberg. Daydreamer: World models for physical robot learning. In Conference on Robot Learning, pp. 2226– 2240. PMLR, 2023.
Vincent Micheli, Eloi Alonso, François Fleuret. Transformers are sample efficient world models. In International Conference on Learning Representations, 2023.
Kurtland Chua, Roberto Calandra, Rowan McAllister, Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Advances in neural information processing systems, 31, 2018.
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp. 1861–1870. PMLR, 2018.
大塚敏之. 非線形最適制御入門. コロナ社, 2011.
Zdravko I Botev, Dirk P Kroese, Reuven Y Rubinstein, Pierre L’Ecuyer. The cross-entropy method for optimization. In Handbook of statistics, volume 31, pp. 35–59. Elsevier, 2013.
Marc Deisenroth, Carl E Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In International Conference on machine learning, pp. 465–472, 2011.
Toshiyuki Ohtsuka. A continuation/gmres method for fast computation of nonlinear receding horizon control. Automatica, 40(4):563–574, 2004.
Jacob Sacks, Byron Boots. Learning sampling distributions for model predictive control. In Conference on Robot Learning. PMLR, 2022.
Masashi Okada, Tadahiro Taniguchi. Acceleration of gradient-based path integral method for efficient optimal and inverse optimal control. In IEEE International Conference on Robotics and Automation, pp. 3013–3020. IEEE, 2018.
Cristina Pinneri, Shambhuraj Sawant, Sebastian Blaes, Jan Achterhold, Joerg Stueckler, Michal Rolinek, Georg Martius. Sample-efficient cross-entropy method for real-time planning. In Conference on Robot Learning, pp. 1049–1065. PMLR, 2021.
Amir Beck, Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003.
Tobias Johannink, Shikhar Bahl, Ashvin Nair, Jianlan Luo, Avinash Kumar, Matthias Loskyll, Juan Aparicio Ojea, Eugen Solowjow, Sergey Levine. Residual reinforcement learning for robot control. In International Conference on Robotics and Automation, pp. 6023–6029. IEEE, 2019.
Andy Zeng, Shuran Song, Johnny Lee, Alberto Rodriguez, Thomas Funkhouser. Tossingbot: Learning to throw arbitrary objects with residual physics. IEEE Transactions on Robotics, 36(4):1307–1319, 2020.
Jonas Eschmann. Reward function design in reinforcement learning. Reinforcement Learning Algorithms: Analysis and Applications, pp. 25–33, 2021.
Eyal Even-Dar,Yishay Mansour. Convergence of optimistic and incremental q-learning. Advances in neural information processing systems, 14, 2001.
Marlos C Machado, Sriram Srinivasan, Michael Bowling. Domain-independent optimistic initialization for reinforcement learning. arXiv preprint arXiv:1410.4604, 2014.
Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, Wojciech Zaremba. Hindsight experience replay. Advances in neural information processing systems, 30, 2017.
Charles Packer, Pieter Abbeel, Joseph E Gonzalez. Hindsight task relabelling: Experience replay for sparse reward meta-rl. Advances in Neural Information Processing Systems, 34:2466– 2477, 2021.
Lorenzo Moro, Amarildo Likmeta, Enrico Prati, Marcello Restelli. Goal-directed planning via hindsight experience replay. In International Conference on Learning Representations, 2021.
Binyamin Manela, Armin Biess. Bias-reduced hindsight experience replay with virtual goal prioritization. Neurocomputing, 451:305–315, 2021.
Robert W White. Motivation reconsidered: the concept of competence. Psychological review, 66(5):297, 1959.
Nuttapong Chentanez, Andrew Barto, Satinder Singh. Intrinsically motivated reinforcement learning. Advances in neural information processing systems, 17, 2004.
Arthur Aubret, Laetitia Matignon, Salima Hassas. A survey on intrinsic motivation in reinforcement learning. arXiv preprint arXiv:1908.06976, 2019.
Deepak Pathak, Dhiraj Gandhi, Abhinav Gupta. Self-supervised exploration via disagreement. In International conference on machine learning, pp. 5062–5071. PMLR, 2019.
Query by committee. In Workshop on computational learning theory, pp. 287–294, 1992.
Eyke Hüllermeier, Willem Waegeman. Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods. Machine Learning, 110:457–506, 2021.
Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner,Deepak Pathak. Planning to explore via selfsupervised world models. In International Conference on Machine Learning, pp. 8583–8592. PMLR, 2020.
Ignasi Clavera, Violet Fu, Pieter Abbeel. Model-augmented actor-critic: Backpropagating through paths. In International conference on learning representations, 2020.
Javier Garcıa, Fernando Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.
Brijen Thananjeyan, Ashwin Balakrishna, Suraj Nair, Michael Luo, Krishnan Srinivasan, Minho Hwang, Joseph E Gonzalez, Julian Ibarz, Chelsea Finn, Ken Goldberg. Recovery rl: Safe reinforcement learning with learned recovery zones. IEEE Robotics and Automation Letters, 6(3):4915–4922, 2021.
Eitan Altman. Constrained Markov decision processes. Routledge, 2021.
Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, Sergey Levine. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018.
Joshua Achiam, David Held, Aviv Tamar, Pieter Abbeel. Constrained policy optimization. In International conference on machine learning, pp. 22–31. PMLR, 2017.
Jiexin Wang, Stefan Elfwing, Eiji Uchibe. Modular deep reinforcement learning from reward and punishment for robot navigation. Neural Networks, 135:115–126, 2021.
Taisuke Kobayashi, Takumi Aotani, Julio Rogelio Guadarrama- Olvera, Emmanuel Dean-Leon, Gordon Cheng. Rewardpunishment actor-critic algorithm applying to robotic nongrasping manipulation. In Joint IEEE International Conference on Development and Learning and Epigenetic Robotics, pp. 37– 42. IEEE, 2019.
Mohammed Alshiekh, Roderick Bloem, Rüdiger Ehlers, Bettina Könighofer, Scott Niekum, Ufuk Topcu. Safe reinforcement learning via shielding. In AAAI conference on artificial intelligence, volume 32, 2018.
Athanasios S Polydoros, Lazaros Nalpantidis. Survey of modelbased reinforcement learning: Applications on robotics. Journal of Intelligent & Robotic Systems, 86(2):153–173, 2017.
Zdravko I Botev, Dirk P Kroese, Reuven Y Rubinstein, Pierre L’Ecuyer. The cross-entropy method for optimization. In Handbook of statistics, volume 31, pp. 35–59. Elsevier, 2013.
Conor F Hayes, Roxana Rădulescu, Eugenio Bargiacchi, Johan Källström, Matthew Macfarlane, Mathieu Reymond, Timothy Verstraeten, Luisa M Zintgraf, Richard Dazeley, Fredrik Heintz, et al. A practical guide to multi-objective reinforcement learning and planning. Autonomous Agents and Multi-Agent Systems, 36(1):26, 2022.
Kristof Van Moffaert, Madalina M Drugan, Ann Nowé. Scalarized multi-objective reinforcement learning: Novel design techniques. In IEEE symposium on adaptive dynamic programming and reinforcement learning, pp. 191–199. IEEE, 2013.
Ioannis Giagkiozis, Peter J Fleming. Methods for multi-objective optimization: An analysis. Information Sciences, 293:338–350, 2015.
Nyoman Gunantara. A review of multi-objective optimization: Methods and its applications. Cogent Engineering, 5(1):1502242, 2018.
中山弘，隆谷野哲三. 多目的計画法の理論と応用. 1994.
Abbas Abdolmaleki, Sandy Huang, Leonard Hasenclever, Michael Neunert, Francis Song, Martina Zambelli, Murilo Martins, Nicolas Heess, Raia Hadsell, Martin Riedmiller. A distributional view on multi-objective policy optimization. In International conference on machine learning, pp. 11–22. PMLR, 2020.
Shubham Pateria, Budhitama Subagdja, Ah-hwee Tan,Chai Quek. Hierarchical reinforcement learning: A comprehensive survey. ACM Computing Surveys, 54(5):1–35, 2021.
Ofir Nachum, Shixiang Shane Gu, Honglak Lee,Sergey Levine. Data-efficient hierarchical reinforcement learning. Advances in neural information processing systems, 31, 2018.
Taisuke Kobayashi, Toshiki Sugino. Reinforcement learning for quadrupedal locomotion with design of continual–hierarchical curriculum. Engineering Applications of Artificial Intelligence, 95:103869, 2020.
Matheus RF Mendonca, Artur Ziviani, André MS Barreto. Graph-based skill acquisition for reinforcement learning. ACM Computing Surveys, 52(1):1–26, 2019.
Tobias Johannink, Shikhar Bahl, Ashvin Nair, Jianlan Luo, Avinash Kumar, Matthias Loskyll, Juan Aparicio Ojea, Eugen Solowjow, Sergey Levine. Residual reinforcement learning for robot control. In International Conference on Robotics and Automation, pp. 6023–6029. IEEE, 2019.
Takayuki Osa, Joni Pajarinen, Gerhard Neumann, J Andrew Bagnell, Pieter Abbeel, Jan Peters, et al. An algorithmic perspective on imitation learning. Foundations and Trends® in Robotics, 7(1-2):1–179, 2018.
Sergey Levine, Vladlen Koltun. Guided policy search. In International conference on machine learning, pp. 1–9. PMLR, 2013.
Michael Bain, Claude Sammut. A framework for behavioural cloning. In Machine Intelligence 15, pp. 103–129, 1995.
Mel Vecerik, Todd Hester, Jonathan Scholz, Fumin Wang, Olivier Pietquin, Bilal Piot, Nicolas Heess, Thomas Rothörl, Thomas Lampe, Martin Riedmiller. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017.
Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wojciech Zaremba, Pieter Abbeel. Overcoming exploration in reinforcement learning with demonstrations. In IEEE international conference on robotics and automation, pp. 6292–6299. IEEE, 2018.
Yoshihisa Tsurumine, Yunduan Cui, Eiji Uchibe, Takamitsu Matsubara. Deep reinforcement learning with smooth policy update: Application to robotic cloth manipulation. Robotics and Autonomous Systems, 112:72–83, 2019.
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp. 1861–1870. PMLR, 2018.
Seyed Kamyar Seyed Ghasemipour, Richard Zemel, Shixiang Gu. A divergence minimization perspective on imitation learning methods. In Conference on Robot Learning, pp. 1259–1277. PMLR, 2020.
Sergey Levine, Aviral Kumar, George Tucker, Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
Rishabh Agarwal, Dale Schuurmans, Mohammad Norouzi. An optimistic perspective on offline reinforcement learning. In International Conference on Machine Learning, pp. 104–114. PMLR, 2020.
Aviral Kumar, Aurick Zhou, George Tucker, Sergey Levine. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
Scott Fujimoto, Shixiang Shane Gu. A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34:20132–20145, 2021.
Andrew Y Ng, Stuart J Russell. Algorithms for inverse reinforcement learning. In International Conference on Machine Learning, pp. 663–670, 2000.
Jonathan Ho, Stefano Ermon. Generative adversarial imitation learning. Advances in neural information processing systems, 29:4565–4573, 2016.
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
Benyamin Ghojogh, Ali Ghodsi, Fakhri Karray, Mark Crowley. Generative adversarial networks and adversarial autoencoders: Tutorial and survey. arXiv preprint arXiv:2111.13282, 2021.
Masashi Sugiyama, Taiji Suzuki, Takafumi Kanamori. Density ratio estimation in machine learning. Cambridge University Press, 2012.
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pp. 1889–1897, 2015.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
Eiji Uchibe, Kenji Doya. Forward and inverse reinforcement learning sharing network weights and hyperparameters. Neural Networks, 144:138–153, 2021.
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg,Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
Sanmit Narvekar, Bei Peng, Matteo Leonetti, Jivko Sinapov, Matthew E Taylor, Peter Stone. Curriculum learning for reinforcement learning domains: A framework and survey. The Journal of Machine Learning Research, 21(1):7382–7431, 2020.
Pascal Klink, Carlo D’Eramo, Jan R Peters, Joni Pajarinen. Selfpaced deep reinforcement learning. Advances in Neural Information Processing Systems, 33:9216–9227, 2020.
Zhipeng Ren, Daoyi Dong, Huaxiong Li,Chunlin Chen. Selfpaced prioritized curriculum learning with coverage penalty in deep reinforcement learning. IEEE transactions on neural networks and learning systems, 29(6):2216–2226, 2018.
Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909, 2018.
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
Tuomas Haarnoja, Ben Moran, Guy Lever, Sandy H Huang, Dhruva Tirumala, Markus Wulfmeier, Jan Humplik, Saran Tunyasuvunakool, Noah Y Siegel, Roland Hafner, et al. Learning agile soccer skills for a bipedal robot with deep reinforcement learning. arXiv preprint arXiv:2304.13653, 2023.
Lerrel Pinto, James Davidson, Rahul Sukthankar, Abhinav Gupta. Robust adversarial reinforcement learning. In International Conference on Machine Learning, pp. 2817–2826. PMLR, 2017.
Kelvin Xu, Siddharth Verma, Chelsea Finn, Sergey Levine. Continual learning of control primitives: Skill discovery via resetgames. Advances in Neural Information Processing Systems, 33:4999–5010, 2020.
Pablo Hernandez-Leal, Bilal Kartal, Matthew E Taylor. A survey and critique of multiagent deep reinforcement learning. Autonomous Agents and Multi-Agent Systems, 33(6):750–797, 2019.
Jakob Foerster, Ioannis Alexandros Assael, Nando De Freitas, Shimon Whiteson. Learning to communicate with deep multi-agent reinforcement learning. Advances in neural information processing systems, 29, 2016.
Muning Wen, Jakub Kuba, Runji Lin, Weinan Zhang, Ying Wen, Jun Wang, Yaodong Yang. Multi-agent reinforcement learning is a sequence modeling problem. Advances in Neural Information Processing Systems, 35:16509–16521, 2022.
Ryan Lowe, Yi I Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in neural information processing systems, 30, 2017.
Takumi Aotani, Taisuke Kobayashi, Kenji Sugimoto. Bottomup multi-agent reinforcement learning by reward shaping for cooperative-competitive tasks. Applied Intelligence, 51:4434–4452, 2021.
Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909, 2018.
Emanuel Todorov. General duality between optimal control and estimation. In IEEE Conference on Decision and Control, pp. 4286–4292. IEEE, 2008.
Hilbert J Kappen, Vicenç Gómez, Manfred Opper. Optimal control as a graphical model inference problem. Machine learning, 87:159–182, 2012.
Konrad Rawlik, Marc Toussaint, Sethu Vijayakumar. On stochastic optimal control and reinforcement learning by approximate inference. 2012.
Pascal Klink, Carlo D’Eramo, Jan R Peters, Joni Pajarinen. Selfpaced deep reinforcement learning. Advances in Neural Information Processing Systems, 33:9216–9227, 2020.
Alex X Lee, Anusha Nagabandi, Pieter Abbeel, Sergey Levine. Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model. Advances in Neural Information Processing Systems, 33:741–752, 2020.
Masashi Okada, Tadahiro Taniguchi. Variational inference mpc for bayesian model-based reinforcement learning. In Conference on robot learning, pp. 258–272. PMLR, 2020.
Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, Martin Riedmiller. Maximum a posteriori policy optimisation. In International Conference on Learning Representations, 2018.
Taisuke Kobayashi. Optimistic reinforcement learning by forward kullback-leibler divergence optimization. Neural Networks, 152:169–180, 2022.
Taisuke Kobayashi, Kenta Yoshizawa. Optimization algorithm for feedback and feedforward policies towards robot control robust to sensing failures. ROBOMECH Journal, 9(1):1–16, 2022.
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp. 1861–1870. PMLR, 2018.
Wolfram Schultz, Peter Dayan, P Read Montague. A neural substrate of prediction and reward. Science, 275(5306):1593–1599, 1997.
Kenji Doya. Canonical cortical circuits and the duality of bayesian inference and optimal control. Current Opinion in Behavioral Sciences, 41:160–167, 2021.
Marc G Bellemare, Will Dabney, Rémi Munos. A distributional perspective on reinforcement learning. In International conference on machine learning, pp. 449–458. PMLR, 2017.
Will Dabney, Zeb Kurth-Nelson, Naoshige Uchida, Clara Kwon Starkweather, Demis Hassabis, Rémi Munos, Matthew Botvinick. A distributional code for value in dopaminebased reinforcement learning. Nature, 577(7792):671–675, 2020.
Saori C Tanaka, Katsunori Yamada, Hiroyasu Yoneda, Fumio Ohtake. Neural mechanisms of gain–loss asymmetry in temporal discounting. Journal of Neuroscience, 34(16):5595–5602, 2014.
Kanji Shimomura, Ayaka Kato, Kenji Morita. Reduced successor representation potentially interferes with cessation of habitual reward-seeking. Neuron, 25:515–532, 2020.
Ben Seymour, Nathaniel Daw, Peter Dayan, Tania Singer, Ray Dolan. Differential encoding of losses and gains in the human striatum. Journal of Neuroscience, 27(18):4826–4831, 2007.
Taisuke Kobayashi, Takumi Aotani, Julio Rogelio Guadarrama- Olvera, Emmanuel Dean-Leon, Gordon Cheng. Rewardpunishment actor-critic algorithm applying to robotic nongrasping manipulation. In Joint IEEE International Conference on Development and Learning and Epigenetic Robotics, pp. 37–42. IEEE, 2019.
Jiexin Wang, Stefan Elfwing, Eiji Uchibe. Modular deep reinforcement learning from reward and punishment for robot navigation. Neural Networks, 135:115–126, 2021.

【口コミ】

※口コミはありません。

詳解 強化学習の発展と応用ロボット制御・ゲーム開発のための実践的理論

【目次】

第１章 強化学習とは

第２章 強化学習の基本的な問題設定

第３章 基本的な学習アルゴリズム

第４章 方策勾配法の発展

第５章 モデルベース強化学習

第６章 報酬設計の課題と対策

第７章 今後の展望

【参考文献】

【口コミ】

詳解強化学習の発展と応用ロボット制御・ゲーム開発のための実践的理論

第１章　強化学習とは

第２章　強化学習の基本的な問題設定

第３章　基本的な学習アルゴリズム

第４章　方策勾配法の発展

第５章　モデルベース強化学習

第６章　報酬設計の課題と対策

第７章　今後の展望