New AIS Framework Enhances Efficiency of Reinforcement Learning in Large Language Models

Published on May 15, 2026

The landscape of reinforcement learning, especially for large language models (LLMs), has been marked for rollout generation. Researchers typically rely on low-precision rollouts, like FP8, alongside BF16 training to improve processing speed and reduce memory usage. However, this method often leads to a mismatch that can undermine policy performance, risking the stability of training processes.

This challenge prompted the creation of Adaptive Importance Sampling (AIS), a novel framework designed to address the rollout-training mismatch. The framework adjusts its intervention strength based on real-time diagnostics, such as weight reliability and divergence severity. This strategic adaptation allows AIS to balance between fully corrected and uncorrected gradients, ultimately targeting the root causes of bias in policy training.

Initial tests demonstrate that AIS improves model performance without sacrificing speed. When implemented alongside techniques like GRPO on diffusion-based models such as LLaDA-8B-Instruct and autoregressive models like Qwen3-8B, AIS achieves results comparable to traditional BF16 baselines. It also maintains a significant advantage in rollout speed ranging from 1.5 to 2.76 times faster. Related News