IPS estimator, which is used for off-policy evaluation in a contextual bandit problem, is well explained here: Doubly Robust Policy Evaluation andOptimization The old policy $\mu$, or the behavior policy, is okay to be non-stationary in the IPS estimator even if the new policy $\nu$, or the target policy, should be stationary. I wonder…

