A Dual Perspective of Reinforcement Learning for Imposing Policy Constraints

Saved in:
Bibliographic Details
Title: A Dual Perspective of Reinforcement Learning for Imposing Policy Constraints
Authors: De Cooman, B., Suykens, Johan
Source: IEEE Transactions on Artificial Intelligence. :1-14
Publication Status: Preprint
Publisher Information: Institute of Electrical and Electronics Engineers (IEEE), 2025.
Publication Year: 2025
Subject Terms: FOS: Computer and information sciences, 4603 Computer vision and multimedia computation, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), 4611 Machine learning, 4602 Artificial intelligence, Computer Science - Artificial Intelligence, I.2.8, FOS: Electrical engineering, electronic engineering, information engineering, Systems and Control (eess.SY), Electrical Engineering and Systems Science - Systems and Control, STADIUS-24-45, Machine Learning (cs.LG)
Description: Model-free reinforcement learning methods lack an inherent mechanism to impose behavioural constraints on the trained policies. Although certain extensions exist, they remain limited to specific types of constraints, such as value constraints with additional reward signals or visitation density constraints. In this work we unify these existing techniques and bridge the gap with classical optimization and control theory, using a generic primal-dual framework for value-based and actor-critic reinforcement learning methods. The obtained dual formulations turn out to be especially useful for imposing additional constraints on the learned policy, as an intrinsic relationship between such dual constraints (or regularization terms) and reward modifications in the primal is revealed. Furthermore, using this framework, we are able to introduce some novel types of constraints, allowing to impose bounds on the policy's action density or on costs associated with transitions between consecutive states and actions. From the adjusted primal-dual optimization problems, a practical algorithm is derived that supports various combinations of policy constraints that are automatically handled throughout training using trainable reward modifications. The proposed $\texttt{DualCRL}$ method is examined in more detail and evaluated under different (combinations of) constraints on two interpretable environments. The results highlight the efficacy of the method, which ultimately provides the designer of such systems with a versatile toolbox of possible policy constraints.
Accepted for publication in IEEE Transactions on Artificial Intelligence
Document Type: Article
ISSN: 2691-4581
DOI: 10.1109/tai.2025.3564898
DOI: 10.48550/arxiv.2404.16468
Access URL: http://arxiv.org/abs/2404.16468
https://lirias.kuleuven.be/handle/20.500.12942/764218
https://doi.org/10.1109/tai.2025.3564898
Rights: CC BY
Accession Number: edsair.doi.dedup.....7e77846a7acfe506d9acbc1b0f886eda
Database: OpenAIRE
Description
Abstract:Model-free reinforcement learning methods lack an inherent mechanism to impose behavioural constraints on the trained policies. Although certain extensions exist, they remain limited to specific types of constraints, such as value constraints with additional reward signals or visitation density constraints. In this work we unify these existing techniques and bridge the gap with classical optimization and control theory, using a generic primal-dual framework for value-based and actor-critic reinforcement learning methods. The obtained dual formulations turn out to be especially useful for imposing additional constraints on the learned policy, as an intrinsic relationship between such dual constraints (or regularization terms) and reward modifications in the primal is revealed. Furthermore, using this framework, we are able to introduce some novel types of constraints, allowing to impose bounds on the policy's action density or on costs associated with transitions between consecutive states and actions. From the adjusted primal-dual optimization problems, a practical algorithm is derived that supports various combinations of policy constraints that are automatically handled throughout training using trainable reward modifications. The proposed $\texttt{DualCRL}$ method is examined in more detail and evaluated under different (combinations of) constraints on two interpretable environments. The results highlight the efficacy of the method, which ultimately provides the designer of such systems with a versatile toolbox of possible policy constraints.<br />Accepted for publication in IEEE Transactions on Artificial Intelligence
ISSN:26914581
DOI:10.1109/tai.2025.3564898