In Robotics and Autonomous Vehicles, closed-loop data generation is expensive, even in simulation. The obvious solution is to pre-train on offline data, then finetune online. In practice, this often fails: the policy “unlearns” its good behavior before improving.
This post explains why and what people have explored to do about it. We’ll cover:
- The core problem with offline RL: overestimation bias
- How to mitigate overestimation bias
- How the fixes cause underestimation bias and why that’s a problem for finetuning
- What solutions exist for offline-to-online training
The Core Problem: Overestimation Bias
In offline RL, we have data collected from some behavior policy , and we want to learn a policy that performs better than . Rather than just imitating through behavior cloning (supervised learning on the state-action pairs), we use the reward information to improve upon it.
Since we can’t actually execute in the environment, using policy gradients would require us to re-weight each trajectory by to estimate the return of . Over many timesteps, this product explodes or vanishes exponentially, making the gradient estimates extremely high variance and training unstable. Consequently, all offline RL methods rely on the action-value function (the critic) and the Bellman equation. The idea is to use dynamic programming to figure out, given the data, the optimal “path” of actions through the environment:
(There’s also a version using a operator instead of the expectation, but it has the same issues we’ll discuss.)
Here is the next state we observed after taking action in state . The reason this can improve over is that we optimize in tandem with the critic:
Together, these updates mean that the target value doesn’t depend on what actually did in state , but on what appears to be optimal according to .
So what’s the issue? We’re using a neural network to estimate , which means some values will be estimated too high and some too low.1 Since is trained to take actions with high Q-values, it will tend to select actions where the Q-function happens to overestimate. These overestimated values then propagate backward through the Bellman updates, systematically inflating value estimates across the board.
In online RL, this usually isn’t a problem because we actually execute the selected actions and collect data that corrects the overestimation. In offline RL we can’t do that. We’re stuck with overestimated values that lead to suboptimal (often terrible) policies, but we have no data to tell us what the real values are. The core problem: we’re making value estimates for state-action pairs that aren’t well-represented in .
Fixing Overestimation with Regularization
The core problem is that Q-values are too high for state-action pairs that aren’t in the dataset, and we can’t collect more data to correct this. Most solutions add regularization that pushes down for out-of-distribution actions, restricting the policy to actions that have actually been observed.
There are several ways to do this:
Penalizing Out-of-Distribution Actions (CQL):
This regularizer pushes down Q-values for actions the learned policy might take (which could be out-of-distribution), while pushing up Q-values for actions actually in the dataset.2
Constraining the Policy: Instead of regularizing the Q-function, we can constrain the policy to stay close to the behavior policy:
- Regularizing the policy update to stay near 3
- Penalizing states where the learned policy deviates from 4
- Using implicit KL regularization via advantage-weighted regression,5 fitting to re-weighted samples from
Clipped Double Q-learning:
Use the minimum of two independently trained critics for the target value:
This was originally designed for online Q-learning: taking the minimum of two Q-networks makes us less likely to select overestimated values. Several works extend this with convex combinations of min and max, or by sampling from an ensemble.134
This is a neat technique because it automatically scales with epistemic uncertainty: the less data we have, the more the values fluctuate, and the more the operator adds regularization.
Fixing Overestimation by Avoiding the Max (IQL)
There’s one method, IQL,6 that takes a completely different approach. Instead of regularizing Q-values for out-of-distribution actions, it avoids having to pick the best action in state altogether. The key insight is that we can approximate the maximum Q-value over in-distribution actions by fitting a state value function using expectile regression:
where is an asymmetric loss that, for , weights positive errors more heavily. This makes converge to an upper expectile of over the actions present in the dataset. As , the expectile approaches the maximum.
Why expectiles instead of quantiles? Quantile regression uses an asymmetric loss: . While quantiles are perhaps more intuitive (the -quantile is the value below which fraction of the data falls), expectiles have a key advantage: they’re sensitive to the magnitude of values, not just their ranking. This matters because Q-values aren’t just ordinal; we care about how much better one action is than another. Intuitively, expectiles generalize the mean in the same way quantiles generalize the median.
The result is that we can learn an approximation to the in-distribution max without ever querying out-of-distribution actions.
This is elegant because we don’t need to arbitrarily bias values. However, there’s a drawback: the value function isn’t co-trained with a policy, so the final policy extraction step (distilling the learned Q-values into a policy) can be problematic. If is multi-modal over but the policy is unimodal (say, a Gaussian), the policy won’t capture the learned values well. More expressive policies help here; see IDQL7 for an example using diffusion policies. Additionally, fixing to some heuristic value means the learned values are only approximations.
Offline-to-Online: The Catch
The idea of offline-to-online (O2O) is to improve performance after offline pre-training by collecting more online data. In turn, pre-training should make us more data-efficient than starting from scratch.
Unfortunately, this doesn’t work out of the box. The fixes we just discussed actually interfere with online training.
The problem is that our offline Q-values are wrong. They’re either overestimated (without regularization), underestimated (with regularization), or just approximated (with IQL or double Q-learning). When we start online RL and get correct values from actual rollouts, things break:
- If our pre-trained Q-values are too low, even bad online rollouts look attractive compared to these overly pessimistic estimates. The policy unlearns its decent behavior while chasing these “wrongly” attractive actions.
- If our pre-trained Q-values are too high, we never learned a good policy to begin with, since the seemingly good actions were just spurious fluctuations.
Solutions for Offline-to-Online
Skip Pre-training (RLPD): One proposal is to skip pre-training entirely and just add the offline data as an additional source during online training.8 This works surprisingly well because the costliest part of training from scratch is exploration: finding a well-performing policy. Mixing in offline data from a decent policy gives a huge boost in exploration efficiency.
RLPD also uses other tricks: Clipped Double Q-learning, LayerNorm on the critic to mitigate value overestimation, and a high Update-to-Data (UTD) ratio for sample efficiency. The combination of LayerNorm and high UTD seems to be a good pairing, also picked up by Cal-QL.9
Calibrated Q-Values (Cal-QL): A more recent approach adds a lower bound to the regularized values, set at the level of what the behavior policy would achieve:
Cal-QL9 calls this a “calibrated” Q-function: conservative, but not too conservative. The Q-values won’t drop below the value of , so the policy can’t be tricked into thinking random exploration is better than what it already knows. Cal-QL performs on par with or better than RLPD.
Using Flow-Matching Policies:
Another approach uses Flow Matching10 (FM) policies and works well when the offline data has sufficient coverage of optimal actions. The idea (Q-chunking11) is to sidestep the policy optimization problem entirely:
- Pre-train the FM policy on the offline data using behavior cloning
- Train only a Q-function, but never update the policy itself
- At inference time, use “best-of-N” sampling: draw action candidates from and pick the one with the highest Q-value:
This continues unchanged into the online phase: we collect data using best-of-N, update only the Q-function, and leave frozen.
Why does this avoid overestimation? Because we only ever query Q-values for actions sampled from , and was trained on the offline data. We never ask the Q-function about out-of-distribution actions, so it can’t mislead us with overestimated values. The key enabler is that FM policies are expressive enough to capture the full support of the data distribution (unlike Gaussian policies, which can only represent a single mode).
This “sample and select” approach is also practical because fine-tuning FM policies directly with rewards is difficult (though possible; see FPO12 and FQL13). The downside is that it only works if the frozen policy can already sample good actions with reasonable probability, and inference cost scales with .
Footnotes
-
Fujimoto, S., Meger, D., & Precup, D. (2019). Off-Policy Deep Reinforcement Learning without Exploration. Proceedings of the 36th International Conference on Machine Learning (ICML). arXiv:1812.02900 ↩ ↩2
-
Kumar, A., Zhou, A., Tucker, G., & Levine, S. (2020). Conservative Q-Learning for Offline Reinforcement Learning. Advances in Neural Information Processing Systems (NeurIPS). arXiv:2006.04779 ↩
-
Kumar, A., Fu, J., Tucker, G., & Levine, S. (2019). Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction. Advances in Neural Information Processing Systems (NeurIPS). arXiv:1906.00949 ↩ ↩2
-
Wu, Y., Tucker, G., & Nachum, O. (2019). Behavior Regularized Offline Reinforcement Learning. arXiv preprint. arXiv:1911.11361 ↩ ↩2
-
Nair, A., Gupta, A., Dalal, M., & Levine, S. (2021). AWAC: Accelerating Online Reinforcement Learning with Offline Datasets. arXiv preprint. arXiv:2006.09359 ↩
-
Kostrikov, I., Nair, A., & Levine, S. (2022). Offline Reinforcement Learning with Implicit Q-Learning. International Conference on Learning Representations (ICLR). arXiv:2110.06169 ↩
-
Hansen-Estruch, P., Kostrikov, I., Janner, M., Kuba, J. G., & Levine, S. (2023). IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies. arXiv preprint. arXiv:2304.10573 ↩
-
Ball, P. J., Smith, L., Kostrikov, I., & Levine, S. (2023). Efficient Online Reinforcement Learning with Offline Data. Proceedings of the 40th International Conference on Machine Learning (ICML). arXiv:2302.02948 ↩
-
Nakamoto, M., Zhai, Y., Singh, A., Mark, M. S., Ma, Y., Finn, C., Kumar, A., & Levine, S. (2024). Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning. Advances in Neural Information Processing Systems (NeurIPS). arXiv:2303.05479 ↩ ↩2
-
Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., & Le, M. (2023). Flow Matching for Generative Modeling. International Conference on Learning Representations (ICLR). arXiv:2210.02747 ↩
-
Li, Q., Zhou, Z., & Levine, S. (2025). Reinforcement Learning with Action Chunking. arXiv preprint. arXiv:2507.07969 ↩
-
McAllister, D., Ge, S., Yi, B., Kim, C. M., Weber, E., Choi, H., Feng, H., & Kanazawa, A. (2025). Flow Matching Policy Gradients. arXiv preprint. arXiv:2507.21053 ↩
-
Park, S., Li, Q., & Levine, S. (2025). Flow Q-Learning. arXiv preprint. arXiv:2502.02538 ↩