Identifiable Latent Bandits

Leveraging observational data for personalized decision-making

Identifiable latent bandits learn a latent variable model offline and use it online to choose actions.
Identifying the best treatment for a new patient using ILB. Offline, we learn a provably identifiable latent variable model (LVM), assumed known a priori in previous latent bandit algorithms. Online, we apply a decision-making algorithm making use of the LVM.

Motivation

For many chronic diseases, treatment is individualized and sequential: patients go through a long process of trying different options over time:
How can we use observational data to shorten treatment times?

Why latent bandits?

Context alone is not enough

A single observed context can be noisy and incomplete. The optimal action may depend on a stable latent state that only becomes clear across repeated observations.

Why identifiability?

Many latent models fit the same data

A learned LVM must recover the structure needed for decision-making — fitting the data well is not sufficient. Identifiability specifies the conditions under which this recovery is possible.

Why observational data?

Exploration is expensive

Historical decisions and outcomes can hasten personalization, reducing the amount of online exploration required for a new instance.

Abstract

Sequential decision-making algorithms such as multi-armed bandits can find optimal personalized decisions, but are notoriously sample-hungry. To combat this, latent bandits offer rapid exploration and personalization beyond what context variables alone can offer, provided that a latent variable model of problem instances can be learned consistently. However, existing works give no guidance as to how such a model can be found.

In this work, we propose an identifiable latent bandit (ILB) framework that leads to optimal decision-making with a shorter exploration time than classical bandits by learning from historical records of decisions and outcomes. Our method is based on nonlinear independent component analysis that provably identifies representations from observational data sufficient to infer optimal actions in new bandit instances.

Identifiable latent bandits use observational histories from previous instances to learn the hidden structure shared across individuals. The learned representation is then used online to infer a new instance's latent state from repeated contexts and choose actions with less exploration. We verify this strategy in simulated and semi-synthetic environments, showing substantial improvement over online and offline learning baselines when identifying conditions are satisfied.

Contributions

1

Learning latent bandit models

We introduce identifiable latent bandits, ILB, the first family of latent bandit algorithms that recover a continuous vector-valued latent state without requiring the latent variable model (LVM) to be known a priori.

2

Mean-contrastive representation learning

We build on nonlinear independent component analysis (ICA) for identifiable representations and introduce mean-contrastive learning and use it to provably learn the LVM.

3

Identifiability guarantees

We prove that this framework is partially identifiable to a degree sufficient for optimal decision-making and propose three algorithms that exploit the latent variable model for personalized sequential decision-making in the regret minimization setting.

4

Sample-efficiency

Our experiments show that, when identifying conditions hold, our algorithms improve over online bandits and offline regression baselines in synthetic and semi-synthetic treatment environments.

Offline: Identifiability & Estimation

In the offline stage, we learn from historical records of instance data using our mean contrastive objective: given a context observation, predict which historical instance generated it. For our identifiability guarantees we assume that each instance is generated according to strucutral equations given below.

Once this feature extractor is learned, repeated contexts from an instance can be averaged in representation space to estimate the latent state. A reward model is then fit from inferred latent states and observed rewards, giving action-value estimates for online decision-making.

Structural causal model for identifiable latent bandits.
The assumed structural equations for an individual instance: a fixed latent state generates noisy time-varying contexts, while actions and latent state determine rewards.

Online Decision-Making

For a new instance, we estimate the latent state using the learned LVM on observed contexts. Online algorithms then use this latent-state estimate with the learned reward model to choose actions, so personalization can start from structure learned across previous instances instead of from scratch. The methods differ in how much online evidence they use. CPG is a simple context-posterior greedy method, FPG refines the latent-state estimate using observed rewards, and FPG-TS samples from the posterior to keep exploration when the learned model is uncertain or biased.

CPG

Context posterior greedy

Uses the average learned representation of observed contexts to estimate the latent state, then acts greedily under the offline reward model.

FPG

Full posterior greedy

Refines the latent-state estimate using both context history and observed rewards, making it more adaptive when the representation is biased.

FPG-TS

Exploratory posterior sampling

Samples reward means under the posterior to trade off fast personalization with recovery from uncertainty or misspecification.

Results

We show results in synthetic and semi-synthetic Alzheimer's disease treatment environments, identifiable latent bandits converge faster than fully online bandits and avoid much of the bias seen in direct regression baselines when the identifying assumptions hold. The experiments also map the limitations: as latent-state noise, context noise, out-of-distribution shift, or the number of arms increases, the tradeoff between fast offline transfer and unbiased online exploration becomes more visible.

Cumulative regret results for synthetic and semi-synthetic ADCB environment.
Cumulative regret results for synthetic and semi-synthetic ADCB environment. FPG and CPG algorithms approach oracle behavior and improve over online-only and regression baselines.
Cumulative regret results for increasing number of arms.
Cumulative regret results for increasing number of arms. Our algorithms suffer less regret compared to a multi-armed bandit (MAB) baseline as the number of treatments increases.
Context emission noise cumulative regret results.
Increasing context noise stresses the learned latent variable model and highlights the benefit of adaptive variants.
Out-of-distribution cumulative regret results.
Cumulative regret results for out-of-distribution instances. Our models can generalize better compared to regression baselines.

BibTeX

@article{balcioglu2026identifiable,
  title   = {{Identifiable Latent Bandits}: Leveraging observational data for personalized decision-making},
  author  = {Balc{\i}o{\u{g}}lu, Ahmet Zahid and Mwai, Newton and Carlsson, Emil and Johansson, Fredrik D.},
  journal = {Transactions on Machine Learning Research},
  year    = {2026},
  url     = {https://openreview.net/forum?id=SvkZ76wKpu}
}