THE METHOD, IN PLAIN TERMS

How Komachi AI turns a health history into a survival-optimal alcohol recommendation.

This page is for the curious reader who wants to look under the hood. It walks through the three ideas the system is built on, survival analysis, offline reinforcement learning, and targeted causal estimation, with the actual math and a few diagrams. No clinical background required, just a willingness to follow a curve downhill.

01Survival analysisTurn a life trajectory into a hazard at each wave, and a hazard into a survival curve we can compare across what-if alcohol policies.Read ↓02Reinforcement learningConvert survival into a dense reward, then learn the alcohol policy that maximizes it while staying inside what the data can support.Read ↓03Deep LTMLEEstimate the counterfactual survival of any policy from observational data, with a doubly-robust correction and an independent cross-check.Read ↓

THE SETUP

A person is a trajectory over time

Everything starts from the same object: one subject, observed across four waves spanning two decades. At each wave we record their state and the alcohol action they took. The question is always counterfactual, what would have happened under a different drinking policy.

One subject, four waves. Baseline context W is fixed; the state L_t and alcohol action A_t update each wave. The model only ever looks at the past.

Baseline covariatesFixed-at-entry context: age, sex, race, plus the prior-disease history block (prior MI, stroke, heart failure, diabetes) that drives sick-quitter confounding.

L_t

Time-varying stateWhat changes wave to wave: smoking, blood pressure, weight, diabetes status, self-rated health, and prior cardiovascular events.

A_t

Alcohol actionA composite discrete action: abstain / light / moderate / heavy (K = 4 by default), derived from drink-or-not and an aligned dose bin.

Y_t

Outcome (death)An absorbing all-cause mortality indicator. Once it flips to 1 the subject leaves the risk set and stays there.

H_t

HistoryEverything observed up to wave t: H_t = (W, L_1:t, A_1:t−1). The model only ever conditions on the past.

01 · SURVIVAL ANALYSIS

From hazard to a survival curve

Survival analysis answers a deceptively simple question: given that you are alive now, how likely are you to make it to the next wave, and the one after that? The building block is the hazard, the per-wave risk of dying.

Because our cohort is observed at discrete waves, we use the discrete-time hazard. It is the probability of dying in the interval ending at wave t, conditional on having survived to its start and on the state S_t and action A_t at that wave:

λ(t | S_t, A_t) = Pr( Y_t = 1 | Y_t−1 = 0, S_t, A_t )

(1)

If you survive each wave given the past, then the probability of being alive all the way through wave τ is just the product of the per-wave survival probabilities. That product is the survival function:

S(τ) = ∏τt = 1( 1 − λ(t | S_t, A_t) )

(2)

The picture below is the whole intuition. Each wave shaves a little off the survivors (the hazard, in steel), and the survival curve (navy) is the running product of what is left. A policy that lowers the hazard at any wave lifts the entire downstream curve.

Each wave's hazard λ_t (steel) removes a slice of the survivors; the survival curve S(t) (navy) is the running product. Lower the hazard anywhere and the whole tail lifts.

The recommendation problem is counterfactual. We do not just want the survival curve a person happened to follow, we want the curve they would follow under a chosen alcohol policy g that sets each action from the history. Write that as S^g(τ), the counterfactual survival under policy g:

S^g(τ) = E∏τt = 1( 1 − λ(t | S_t, g(H_t)) )

(3)

WHY IT MATTERS

Comparing S^g(τ) across policies, always-abstain, observed behavior, the learned policy, is exactly the comparison a clinician cares about: which drinking strategy leaves the most people alive at the end of follow-up. Estimating it honestly from observational data is what pillar 3 is for.

02 · REINFORCEMENT LEARNING

Survival as a reward to maximize

Reinforcement learning needs a reward. The key move in this project is that survival itself is the reward. Take the log of the survival function and the product becomes a sum, one term per wave, which is exactly the shape RL likes.

Taking log of equation (2) turns the product into a sum. Each wave contributes a term log(1 − λ), and that term is our dense, survival-derived reward:

r(S_t, A_t) = log1 − λ(t | S_t, A_t)

(4)

log S^g(τ) ≥ ∑τt = 1r(S_t, A_t)

(5)

The cumulative reward lower-bounds log counterfactual survival (5). So a policy that earns more reward provably pushes up a lower bound on how long people live. That is the bridge from "maximize reward" to "keep people alive."

The system is learned in two stages. Stage 1 builds the world model that produces the reward; Stage 2 learns the policy that maximizes it.

STAGE 1 · world model

HistoryHₜ = (W, L, A)

→

EncoderSₜ = f_θ(Hₜ)

→

Hazard λ→ reward log(1−λ)

Behavior π_βwhat people did

reward + π_β↓

STAGE 2 · policy

SAC + BCmaximize reward − α·KL

→

Policy πaction Aₜ ∈ K

Stage 1 learns the state encoder, the hazard (hence the reward), and the behavior policy. Stage 2 uses both to learn the optimal alcohol policy, anchored by the KL penalty.

Stage 1, the state encoder and reward

A Transformer reads the raw history H_t and compresses it into a state representation S_t = f_θ(H_t). On top of that state we fit two heads: the hazard λ that gives the reward, and a behavior policy π_β(A_t | S_t) that captures what people actually did. The behavior policy is what keeps Stage 2 honest.

Stage 2, KL-regularized SAC+BC

Now learn the optimal policy π with Soft Actor-Critic plus Behavior Cloning. The objective has two pieces. The first maximizes expected cumulative reward (survival). The second is a KL penalty that pulls π toward the behavior policy π_β, so the policy never wanders into actions the data cannot speak to:

J(π) = Ep^π∑tr(S_t, A_t) − α · E∑tD_KLπ(· | S_t) ‖ π_β(· | S_t)

(6)

αthe offline-credibility knob

α → 0Chase reward freely. The policy can recommend actions barely seen in the data, so its survival gain may be a mirage.

← sweep →

α → ∞Defer to behavior. KL collapses to zero, the policy reverts to π_β, and survival reverts to the observed level.

We sweep α and report the whole tradeoff rather than cherry-picking a single point. This is the honest way to present an offline policy.

THE HONEST FINDING

On this cohort the learned policy does not beat always-abstain or behavior on S^g(τ). That is the signature of a confounded reward: baseline abstainers carry far more prior disease than drinkers (the sick-quitter effect), so naive estimates make abstention look harmful. Pillar 3, plus the baseline disease-history adjustment, is how we diagnose and correct it.

03 · DEEP LTMLE

Estimating survival without fooling yourself

We have a policy and a target, S^g(τ). But the data is observational: who drank what was not randomized. Naively plugging a model into equation (3) inherits every bias in that model. Longitudinal Targeted Maximum Likelihood Estimation fixes this.

LTMLE is a doubly-robust, plug-in estimator. It combines two nuisance models, an outcome model Q (how state and action map to survival) and a propensity model g (how likely each action was given the past), and is consistent if either one is right. It works by sequential regression from the last wave backward, with a small targeting correction at each step.

1Observed dataW, Lₜ, Aₜ, Yₜ

→

2Initial fit Q⁰outcome model, last wave back

→

3Targetingfluctuate by clever covariate Hₜ(g)

→

4Updated Q★solves the influence equation

→

5Plug-in Sᵍ(τ)with valid 95% CI

LTMLE fits an outcome model, then applies one targeting fluctuation driven by the clever covariate (the inverse-propensity weight). The updated model plugs straight into the survival product.

The clever covariate

The targeting step is the heart of it. After an initial outcome fit Q⁰, LTMLE runs one tiny fluctuation regression whose only covariate is the inverse-propensity weight, the clever covariate:

H_t(g) = 𝟙[ A_t = g(H_t) ]∏ts = 1g(A_s | H_s)

(7)

Fitting Q against this covariate nudges the estimate just enough to solve the efficient influence-curve equation. The payoff: the final survival estimate is not only doubly robust but asymptotically efficient, with valid confidence intervals. The updated outcome model Q^* is then plugged back into the survival product to read off S^g(τ).

The Deep in Deep LTMLE is that the Q and g nuisances are neural networks (a Transformer over the history), so the estimator can absorb high-dimensional, long-range trajectory structure that a generalized linear model would miss. It also handles the K-way composite action directly, rather than collapsing to a binary drink / abstain split.

The independent cross-check

Because a home-grown estimator can hide bugs, every headline number is cross-validated against the established CRAN ltmle package with a SuperLearner ensemble (SL.lm, SL.glm, SL.xgboost). It needs binary treatment nodes, so the composite action is decomposed into two bits per wave, drink-or-not and high-or-low dose, with the impossible "abstain and high-dose" cell ruled out deterministically. When the two estimators agree, we trust the signal.

QOutcome model

Predicts survival from state and action. A Transformer over the history.

gPropensity model

Predicts which action was taken, given the past. Forms the clever covariate.

→

SᵍDoubly robust

Consistent if either Q or g is right. Valid confidence intervals.

HOW THE PIECES FIT

One loop, three ideas

Survival analysis defines the target, RL learns a policy that maximizes it, and Deep LTMLE measures whether the policy actually delivers, all on the same longitudinal cohort. The LLM layer then translates the policy action into a grounded, judged recommendation.

1SurvivalHazard λ and counterfactual survival Sᵍ(τ) define what 'better' means.

→

2Rewardlog(1 − λ) turns survival into a dense per-wave reward.

→

3PolicySAC+BC learns the alcohol action that maximizes reward, anchored by KL.

→

4EstimateDeep LTMLE measures Sᵍ(τ), cross-checked against CRAN ltmle.

Going deeper

Method paper. Shirakawa et al., Survival Policy Learning as Inference. Submitted to NeurIPS 2026. The Stage-1 reward and Stage-2 SAC+BC formulation.
Deep LTMLE. Shirakawa et al., Longitudinal Targeted Minimum Loss-based Estimation with Temporal-Difference Heterogeneous Transformer(ICML 2024). The neural estimator behind the counterfactual-survival numbers.
Cohort. NHANES I Epidemiologic Follow-up Study (NHEFS): 14,407 subjects, four waves, 1971 to 1992.
Cross-check. van der Laan and Gruber on Longitudinal TMLE; the CRAN ltmle package for the independent estimate.

Research and clinical-decision-support tool, not deployed medical advice. Every output is a research artifact for evaluation, not an instruction for patient care.