Offline-to-Online RL: Why it fails and how to fix it

In Robotics and Autonomous Vehicles, closed-loop data generation is expensive, even in simulation. The obvious solution is to pre-train on offline data, then finetune online. In practice, this often fails: the policy “unlearns” its good behavior before improving.

This post explains why and what people have explored to do about it. We’ll cover:

The core problem with offline RL: overestimation bias
How to mitigate overestimation bias
How the fixes cause underestimation bias and why that’s a problem for finetuning
What solutions exist for offline-to-online training

The Core Problem: Overestimation Bias

In offline RL, we have data $(s, a, s', r)\in\mathcal{D}$ collected from some behavior policy $\mu(a|s)$ , and we want to learn a policy $\pi(a|s)$ that performs better than $\mu$ . Rather than just imitating $\mu$ through behavior cloning (supervised learning on the state-action pairs), we use the reward information $r$ to improve upon it.

Since we can’t actually execute $\pi$ in the environment, using policy gradients would require us to re-weight each trajectory by $\prod_t \frac{\pi_\theta(a_t|s_t)}{\mu(a_t|s_t)}$ to estimate the return of $\pi$ . Over many timesteps, this product explodes or vanishes exponentially, making the gradient estimates extremely high variance and training unstable. Consequently, all offline RL methods rely on the action-value function $Q(s,a)$ (the critic) and the Bellman equation. The idea is to use dynamic programming to figure out, given the data, the optimal “path” of actions through the environment:

Q(s,a) \leftarrow r + \gamma \mathbb{E}_{a' \sim \pi}[Q(s', a')]

(There’s also a version using a $\max$ operator instead of the expectation, but it has the same issues we’ll discuss.)

Here $s'$ is the next state we observed after taking action $a$ in state $s$ . The reason this can improve over $\mu$ is that we optimize $\pi(a|s)$ in tandem with the critic:

\pi = \arg\max_\pi \mathbb{E}_{s \sim \mathcal{D}, a \sim \pi}[Q(s, a)]

Together, these updates mean that the target value $(r + \gamma \mathbb{E}_{a' \sim \pi}[Q(s', a')])$ doesn’t depend on what $\mu$ actually did in state $s'$ , but on what appears to be optimal according to $\pi$ .

So what’s the issue? We’re using a neural network to estimate $Q(s,a)$ , which means some values will be estimated too high and some too low.¹ Since $\pi$ is trained to take actions with high Q-values, it will tend to select actions where the Q-function happens to overestimate. These overestimated values then propagate backward through the Bellman updates, systematically inflating value estimates across the board.

In online RL, this usually isn’t a problem because we actually execute the selected actions and collect data that corrects the overestimation. In offline RL we can’t do that. We’re stuck with overestimated values that lead to suboptimal (often terrible) policies, but we have no data to tell us what the real values are. The core problem: we’re making value estimates $Q(s,a)$ for state-action pairs that aren’t well-represented in $\mathcal{D}$ .

Fixing Overestimation with Regularization

The core problem is that Q-values are too high for state-action pairs that aren’t in the dataset, and we can’t collect more data to correct this. Most solutions add regularization that pushes down $Q(s,a)$ for out-of-distribution actions, restricting the policy to actions that have actually been observed.

There are several ways to do this:

Penalizing Out-of-Distribution Actions (CQL):

\mathcal{R}(\theta) = \alpha \left( \mathbb{E}_{s \sim \mathcal{D}, a \sim \pi}[Q_\theta(s,a)] - \mathbb{E}_{s,a \sim \mathcal{D}}[Q_\theta(s,a)] \right)

This regularizer pushes down Q-values for actions the learned policy might take (which could be out-of-distribution), while pushing up Q-values for actions actually in the dataset.²

Constraining the Policy: Instead of regularizing the Q-function, we can constrain the policy to stay close to the behavior policy:

Regularizing the policy update to stay near $\mu$ ³
Penalizing states where the learned policy deviates from $\mu$ ⁴
Using implicit KL regularization via advantage-weighted regression,⁵ fitting $\pi$ to re-weighted samples from $\mu$

Clipped Double Q-learning:

Use the minimum of two independently trained critics for the target value:

Q_\theta(s, a) \leftarrow r + \gamma \mathbb{E}_{a' \sim \pi}\left[\min\left(Q_{\theta_1}(s', a'), Q_{\theta_2}(s', a')\right)\right]

This was originally designed for online Q-learning: taking the minimum of two Q-networks makes us less likely to select overestimated values. Several works extend this with convex combinations of min and max, or by sampling from an ensemble.¹³⁴

This is a neat technique because it automatically scales with epistemic uncertainty: the less data we have, the more the values fluctuate, and the more the $\min$ operator adds regularization.

Fixing Overestimation by Avoiding the Max (IQL)

There’s one method, IQL,⁶ that takes a completely different approach. Instead of regularizing Q-values for out-of-distribution actions, it avoids having to pick the best action in state $s'$ altogether. The key insight is that we can approximate the maximum Q-value over in-distribution actions by fitting a state value function $V(s)$ using expectile regression:

L_V(\psi) = \mathbb{E}_{(s,a) \sim \mathcal{D}}\left[L_2^\tau(Q_\theta(s,a) - V_\psi(s))\right]

where $L_2^\tau(u) = |\tau - \mathbf{1}(u < 0)| \cdot u^2$ is an asymmetric loss that, for $\tau > 0.5$ , weights positive errors more heavily. This makes $V(s)$ converge to an upper expectile of $Q(s,a)$ over the actions present in the dataset. As $\tau \to 1$ , the expectile approaches the maximum.

Why expectiles instead of quantiles? Quantile regression uses an asymmetric $L_1$ loss: $L_1^\tau(u) = |\tau - \mathbf{1}(u < 0)| \cdot |u|$ . While quantiles are perhaps more intuitive (the $\tau$ -quantile is the value below which $\tau$ fraction of the data falls), expectiles have a key advantage: they’re sensitive to the magnitude of values, not just their ranking. This matters because Q-values aren’t just ordinal; we care about how much better one action is than another. Intuitively, expectiles generalize the mean in the same way quantiles generalize the median.

The result is that we can learn an approximation to the in-distribution max without ever querying out-of-distribution actions.

This is elegant because we don’t need to arbitrarily bias values. However, there’s a drawback: the value function isn’t co-trained with a policy, so the final policy extraction step (distilling the learned Q-values into a policy) can be problematic. If $Q(s,a)$ is multi-modal over $a$ but the policy is unimodal (say, a Gaussian), the policy won’t capture the learned values well. More expressive policies help here; see IDQL⁷ for an example using diffusion policies. Additionally, fixing $\tau$ to some heuristic value means the learned values are only approximations.

Offline-to-Online: The Catch

The idea of offline-to-online (O2O) is to improve performance after offline pre-training by collecting more online data. In turn, pre-training should make us more data-efficient than starting from scratch.

Unfortunately, this doesn’t work out of the box. The fixes we just discussed actually interfere with online training.

The problem is that our offline Q-values are wrong. They’re either overestimated (without regularization), underestimated (with regularization), or just approximated (with IQL or double Q-learning). When we start online RL and get correct values from actual rollouts, things break:

If our pre-trained Q-values are too low, even bad online rollouts look attractive compared to these overly pessimistic estimates. The policy unlearns its decent behavior while chasing these “wrongly” attractive actions.
If our pre-trained Q-values are too high, we never learned a good policy to begin with, since the seemingly good actions were just spurious fluctuations.

Solutions for Offline-to-Online

Skip Pre-training (RLPD): One proposal is to skip pre-training entirely and just add the offline data as an additional source during online training.⁸ This works surprisingly well because the costliest part of training from scratch is exploration: finding a well-performing policy. Mixing in offline data from a decent policy gives a huge boost in exploration efficiency.

RLPD also uses other tricks: Clipped Double Q-learning, LayerNorm on the critic to mitigate value overestimation, and a high Update-to-Data (UTD) ratio for sample efficiency. The combination of LayerNorm and high UTD seems to be a good pairing, also picked up by Cal-QL.⁹

Calibrated Q-Values (Cal-QL): A more recent approach adds a lower bound to the regularized values, set at the level of what the behavior policy $\mu$ would achieve:

\mathcal{R}(\theta) = \alpha \left( \mathbb{E}_{s \sim \mathcal{D}, a \sim \pi}[\max(Q_\theta(s,a), V^\mu(s))] - \mathbb{E}_{s,a \sim \mathcal{D}}[Q_\theta(s,a)] \right)

Cal-QL⁹ calls this a “calibrated” Q-function: conservative, but not too conservative. The Q-values won’t drop below the value of $\mu$ , so the policy can’t be tricked into thinking random exploration is better than what it already knows. Cal-QL performs on par with or better than RLPD.

Using Flow-Matching Policies:

Another approach uses Flow Matching¹⁰ (FM) policies and works well when the offline data has sufficient coverage of optimal actions. The idea (Q-chunking¹¹) is to sidestep the policy optimization problem entirely:

Pre-train the FM policy $\mu_\theta$ on the offline data using behavior cloning
Train only a Q-function, but never update the policy itself
At inference time, use “best-of-N” sampling: draw $N$ action candidates from $\mu_\theta$ and pick the one with the highest Q-value:

a^* = \arg\max_{a_i} Q(s, a_i), \quad \text{where } a_1, \ldots, a_N \sim \mu_\theta(\cdot|s)

This continues unchanged into the online phase: we collect data using best-of-N, update only the Q-function, and leave $\mu_\theta$ frozen.

Why does this avoid overestimation? Because we only ever query Q-values for actions sampled from $\mu_\theta$ , and $\mu_\theta$ was trained on the offline data. We never ask the Q-function about out-of-distribution actions, so it can’t mislead us with overestimated values. The key enabler is that FM policies are expressive enough to capture the full support of the data distribution (unlike Gaussian policies, which can only represent a single mode).

This “sample and select” approach is also practical because fine-tuning FM policies directly with rewards is difficult (though possible; see FPO¹² and FQL¹³). The downside is that it only works if the frozen policy $\mu_\theta$ can already sample good actions with reasonable probability, and inference cost scales with $N$ .

Fujimoto, S., Meger, D., & Precup, D. (2019). Off-Policy Deep Reinforcement Learning without Exploration. Proceedings of the 36th International Conference on Machine Learning (ICML). arXiv:1812.02900 ↩ ↩²
Kumar, A., Zhou, A., Tucker, G., & Levine, S. (2020). Conservative Q-Learning for Offline Reinforcement Learning. Advances in Neural Information Processing Systems (NeurIPS). arXiv:2006.04779 ↩
Kumar, A., Fu, J., Tucker, G., & Levine, S. (2019). Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction. Advances in Neural Information Processing Systems (NeurIPS). arXiv:1906.00949 ↩ ↩²
Wu, Y., Tucker, G., & Nachum, O. (2019). Behavior Regularized Offline Reinforcement Learning. arXiv preprint. arXiv:1911.11361 ↩ ↩²
Nair, A., Gupta, A., Dalal, M., & Levine, S. (2021). AWAC: Accelerating Online Reinforcement Learning with Offline Datasets. arXiv preprint. arXiv:2006.09359 ↩
Kostrikov, I., Nair, A., & Levine, S. (2022). Offline Reinforcement Learning with Implicit Q-Learning. International Conference on Learning Representations (ICLR). arXiv:2110.06169 ↩
Hansen-Estruch, P., Kostrikov, I., Janner, M., Kuba, J. G., & Levine, S. (2023). IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies. arXiv preprint. arXiv:2304.10573 ↩
Ball, P. J., Smith, L., Kostrikov, I., & Levine, S. (2023). Efficient Online Reinforcement Learning with Offline Data. Proceedings of the 40th International Conference on Machine Learning (ICML). arXiv:2302.02948 ↩
Nakamoto, M., Zhai, Y., Singh, A., Mark, M. S., Ma, Y., Finn, C., Kumar, A., & Levine, S. (2024). Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning. Advances in Neural Information Processing Systems (NeurIPS). arXiv:2303.05479 ↩ ↩²
Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., & Le, M. (2023). Flow Matching for Generative Modeling. International Conference on Learning Representations (ICLR). arXiv:2210.02747 ↩
Li, Q., Zhou, Z., & Levine, S. (2025). Reinforcement Learning with Action Chunking. arXiv preprint. arXiv:2507.07969 ↩
McAllister, D., Ge, S., Yi, B., Kim, C. M., Weber, E., Choi, H., Feng, H., & Kanazawa, A. (2025). Flow Matching Policy Gradients. arXiv preprint. arXiv:2507.21053 ↩
Park, S., Li, Q., & Levine, S. (2025). Flow Q-Learning. arXiv preprint. arXiv:2502.02538 ↩