Among various on-policy reinforcement learning algorithms, Proximal Policy Optimization (PPO) demonstrates its unparalleled simplicity, numerical stability, and empirical performance. It optimizes policies via surrogate objectives based on importance ratios, which require nontrivial likelihood evaluation. Although the Gaussian policy assumption simplifies the likelihood evaluation step, it could potentially restrain the performance of the resulting policy. Replacing Gaussian policies with continuous normalizing flows (CNFs) represented via ordinary differential equations (ODEs) enhances expressiveness for multi-modal actions but inevitably leading to much more challenging importance ratio evaluation. Conventional likelihoods computation with CNFs is typically conducted along full-flow paths, which demands costly simulation and back-propagation and is prone to exploding or vanishing gradients. To resolve this issue, we propose a novel on-policy CNF-based reinforcement learning algorithm, named PolicyFlow, which integrates expressive policies with PPO-style objectives while avoiding likelihood evaluation along the full flow path. PolicyFlow approximates importance ratios using velocity field variations along a simple interpolation path, reducing computational overhead while preserving the stability of proximal updates. To avoid potential mode collapse and further encourage diverse behaviors, PolicyFlow introduces an implicit entropy regularizer, inspired by Brownian motion, which is both conceptually elegant and computationally lightweight. Experiments on diverse tasks in vairous environments such as MultiGoal, IsaacLab, and MuJoCo Playground show that PolicyFlow achieves competitive or superior performance compared to PPO with Gaussian policies and state-of-the-art flow-based method, with MultiGoal in particular demonstrating PolicyFlow’s ability to capture diverse multimodal action distributions.
(PointMaze-Medium-Diverse-GDense-v3) Exploration Density Maps. (a) Environment overview: the agent is initialized at the green point for each episode, and the four red points indicate goal locations with equal rewards. (b) Exploration heatmap of PPO, showing limited coverage due to the simple Gaussian policy. (c) Exploration heatmap of PolicyFlow without the Brownian regularizer, which improves coverage but still leaves some regions under-explored. (d) Exploration heatmap of PolicyFlow with the Brownian regularizer, achieving near-complete coverage of all feasible locations.
MultiGoal Test: sample 1000 trajectories starting at the same original point. (a) PPO with Gaussian entropy regularization (wg=0.001) covers only a limited set of goals. (b,c) DPPO and FPO collapse to a small number of modes, likely because neither method incorporates any form of entropy regularization. (d) PolicyFlow with uniform noise injection (weight 0.05) still suffers from mode collapse, concentrating on only a few modes. (e) PolicyFlow with only Gaussian entropy regularization (wg=0.001) partially alleviates mode collapse. (f) PolicyFlow with the proposed Brownian regularizer (wb=0.25) and Gaussian entropy regularization (wg=0.001) achieves the most diverse and more balanced goal-reaching behaviors.
Learning curves on MuJoCo Playground benchmarks. Plots show mean episodic reward with standard error (y-axis) over environment steps (x-axis, total 30M steps), averaged over 5 random seeds.