Overview:

Robotic safety has been a focal point of research in both the control and learning communities over the past few decades, addressing a wide range of safety constraints from specific, narrow tasks to more general, broad applications. Achieving provable safety guarantees at every time step (state-wise safety or robustness) during a robot’s operation is essential for ensuring both widespread adoption and reliable performance. In real-world settings, algorithms with robust safety measures and theoretical guarantees are critical for the large-scale deployment of robots, as any failure could be irreversible or uncorrectable. For example, autonomous vehicles cannot tolerate even a single malfunction, and collaborative robots must perform efficiently while ensuring the safety of workers. Concurrently, reinforcement learning (RL) has emerged as a powerful tool for enabling superhuman performance, significantly accelerating the acquisition of complex skills. However, the exploratory nature of RL, particularly its trial-and-error approach, complicates the ability to guarantee state-wise performance theoretically. To harness the benefits of superhuman performance in RL while ensuring stability and reliability, we focus on the following key questions:

How can we achieve state-wise performance, or even worst-case state-wise performance, with provable theoretical guarantees?

How can we obtain the strongest possible theoretical guarantees within the stochastic framework of Markov Decision Processes (MDP)?

How can we bridge traditional control methods with RL to acquire robust and safe RL policy without any violations during training?

Research Topics

Safe Reinforcement Learning Benchmark

Image description

Safe RL, or constrained RL, has emerged as a promising solution, but comparing existing algorithms remains challenging due to the diversity of methods and tasks. This work introduces GUARD, a Generalized Unified SAfe Reinforcement Learning Development Benchmark, filling this gap by offering a generalized, customizable benchmark that includes a wide range of RL agents, tasks, and safety constraints. It also provides comprehensive, self-contained implementations of state-of-the-art safe RL algorithms.
Contributors: Weiye Zhao, Yifan Sun, Feihan Li, Rui Chen, Ruixuan Liu, Tianhao Wei

Publications:

[J23] Guard: A safe reinforcement learning benchmark
Weiye Zhao, Rui Chen, Yifan Sun, Ruixuan Liu, Tianhao Wei and Changliu Liu
Transactions on Machine Learning Research, 2024
Citation Formats:
```
    
```

High-Probability Monotonic Improvement Guarantee for Worst-Case Reward Performance

Image description

This work tackles a key limitation in trust region on-policy RL algorithms, which focus on expected performance but ignore worst-case outcomes. We introduce a novel objective function that ensures monotonic improvement in the lower probability bound with high confidence. Based on this, we present Absolute Policy Optimization (APO) and Proximal Absolute Policy Optimization (PAPO), which outperform existing methods, achieving significant gains in both worst-case and expected performance.
Contributors: Weiye Zhao, Feihan Li, Yifan Sun, Rui Chen, Tianhao Wei

Publications:

[C77] Absolute Policy Optimization: Enhancing Lower Probability Bound of Performance with High Confidence
Weiye Zhao, Feihan Li, Yifan Sun, Rui Chen, Tianhao Wei and Changliu Liu
International Conference on Machine Learning, 2024
Citation Formats:
```
    
```

High-Probability Satisfication of State-wise Worst-case Safety

Image description

To achieve state-wise worst-case safety, we build upon our State-wise Constrained MDP (SCMDP) and Maximum MDP (MMDP) framework. We introduce the State-wise Constrained Policy Optimization (SCPO) algorithm to ensure state-wise expected safety satisfaction. Building on these results, we further develop the Absolute SCPO (ASCPO) algorithm to achieve satisfaction of state-wise worst-case safety with high probability.
Contributors: Weiye Zhao, Feihan Li, Yifan Sun, Rui Chen, Tianhao Wei, Yujie Yang

Publications:

[J24] State-wise Constrained Policy Optimization
Weiye Zhao, Rui Chen, Yifan Sun, Tianhao Wei and Changliu Liu
Transactions on Machine Learning Research, 2024
Citation Formats:
```
    
```

[U] Absolute State-wise Constrained Policy Optimization: High-Probability State-wise Constraints Satisfaction
Weiye Zhao, Feihan Li, Yifan Sun, Yujie Wang, Rui Chen, Tianhao Wei and Changliu Liu
arXiv:2410.01212, 2024
Citation Formats:
```
    
```
Abstract:

Enforcing state-wise safety constraints is critical for the application of reinforcement learning (RL) in real-world problems, such as autonomous driving and robot manipulation. However, existing safe RL methods only enforce state-wise constraints in expectation or enforce hard state-wise constraints with strong assumptions. The former does not exclude the probability of safety violations, while the latter is impractical. Our insight is that although it is intractable to guarantee hard state-wise constraints in a model-free setting, we can enforce state-wise safety with high probability while excluding strong assumptions. To accomplish the goal, we propose Absolute State-wise Constrained Policy Optimization (ASCPO), a novel general-purpose policy search algorithm that guarantees high-probability state-wise constraint satisfaction for stochastic systems. We demonstrate the effectiveness of our approach by training neural network policies for extensive robot locomotion tasks, where the agent must adhere to various state-wise safety constraints. Our results show that ASCPO significantly outperforms existing methods in handling state-wise constraints across challenging continuous control tasks, highlighting its potential for real-world applications.

Acquiring a Robust and Safe RL Policy Without Violations During Training

Image description

This work presents Safe Set Guided Statewise Constrained Policy Optimization (S-3PO), a deep reinforcement learning algorithm that ensures zero training violations while learning safe optimal policies. S-3PO uses a safety-oriented monitor for safe exploration and introduces an "imaginary" cost to guide the agent within safety constraints. It outperforms existing methods in high-dimensional robotics tasks, marking a significant step towards real-world safe RL deployment.
Contributors: Weiye Zhao, Yifan Sun, Feihan Li, Rui Chen, Tianhao Wei

Publications:

Period of Performance: 2023 ~ Now

Point of Contact: Weiye Zhao