How Komachi AI turns a health history into a survival-optimal alcohol recommendation.
This page is for the curious reader who wants to look under the hood. It walks through the three ideas the system is built on, survival analysis, offline reinforcement learning, and targeted causal estimation, with the actual math and a few diagrams. No clinical background required, just a willingness to follow a curve downhill.
A person is a trajectory over time
Everything starts from the same object: one subject, observed across four waves spanning two decades. At each wave we record their state and the alcohol action they took. The question is always counterfactual, what would have happened under a different drinking policy.
From hazard to a survival curve
Survival analysis answers a deceptively simple question: given that you are alive now, how likely are you to make it to the next wave, and the one after that? The building block is the hazard, the per-wave risk of dying.
Because our cohort is observed at discrete waves, we use the discrete-time hazard. It is the probability of dying in the interval ending at wave t, conditional on having survived to its start and on the state St and action At at that wave:
If you survive each wave given the past, then the probability of being alive all the way through wave τ is just the product of the per-wave survival probabilities. That product is the survival function:
The picture below is the whole intuition. Each wave shaves a little off the survivors (the hazard, in steel), and the survival curve (navy) is the running product of what is left. A policy that lowers the hazard at any wave lifts the entire downstream curve.
The recommendation problem is counterfactual. We do not just want the survival curve a person happened to follow, we want the curve they would follow under a chosen alcohol policy g that sets each action from the history. Write that as Sg(τ), the counterfactual survival under policy g:
Comparing Sg(τ) across policies, always-abstain, observed behavior, the learned policy, is exactly the comparison a clinician cares about: which drinking strategy leaves the most people alive at the end of follow-up. Estimating it honestly from observational data is what pillar 3 is for.
Survival as a reward to maximize
Reinforcement learning needs a reward. The key move in this project is that survival itself is the reward. Take the log of the survival function and the product becomes a sum, one term per wave, which is exactly the shape RL likes.
Taking log of equation (2) turns the product into a sum. Each wave contributes a term log(1 − λ), and that term is our dense, survival-derived reward:
The cumulative reward lower-bounds log counterfactual survival (5). So a policy that earns more reward provably pushes up a lower bound on how long people live. That is the bridge from "maximize reward" to "keep people alive."
The system is learned in two stages. Stage 1 builds the world model that produces the reward; Stage 2 learns the policy that maximizes it.
Stage 1, the state encoder and reward
A Transformer reads the raw history Ht and compresses it into a state representation St = fθ(Ht). On top of that state we fit two heads: the hazard λ that gives the reward, and a behavior policy πβ(At | St) that captures what people actually did. The behavior policy is what keeps Stage 2 honest.
Stage 2, KL-regularized SAC+BC
Now learn the optimal policy π with Soft Actor-Critic plus Behavior Cloning. The objective has two pieces. The first maximizes expected cumulative reward (survival). The second is a KL penalty that pulls π toward the behavior policy πβ, so the policy never wanders into actions the data cannot speak to:
We sweep α and report the whole tradeoff rather than cherry-picking a single point. This is the honest way to present an offline policy.
On this cohort the learned policy does not beat always-abstain or behavior on Sg(τ). That is the signature of a confounded reward: baseline abstainers carry far more prior disease than drinkers (the sick-quitter effect), so naive estimates make abstention look harmful. Pillar 3, plus the baseline disease-history adjustment, is how we diagnose and correct it.
Estimating survival without fooling yourself
We have a policy and a target, Sg(τ). But the data is observational: who drank what was not randomized. Naively plugging a model into equation (3) inherits every bias in that model. Longitudinal Targeted Maximum Likelihood Estimation fixes this.
LTMLE is a doubly-robust, plug-in estimator. It combines two nuisance models, an outcome model Q (how state and action map to survival) and a propensity model g (how likely each action was given the past), and is consistent if either one is right. It works by sequential regression from the last wave backward, with a small targeting correction at each step.
The clever covariate
The targeting step is the heart of it. After an initial outcome fit Q0, LTMLE runs one tiny fluctuation regression whose only covariate is the inverse-propensity weight, the clever covariate:
Fitting Q against this covariate nudges the estimate just enough to solve the efficient influence-curve equation. The payoff: the final survival estimate is not only doubly robust but asymptotically efficient, with valid confidence intervals. The updated outcome model Q* is then plugged back into the survival product to read off Sg(τ).
The Deep in Deep LTMLE is that the Q and g nuisances are neural networks (a Transformer over the history), so the estimator can absorb high-dimensional, long-range trajectory structure that a generalized linear model would miss. It also handles the K-way composite action directly, rather than collapsing to a binary drink / abstain split.
The independent cross-check
Because a home-grown estimator can hide bugs, every headline number is cross-validated against the established CRAN ltmle package with a SuperLearner ensemble (SL.lm, SL.glm, SL.xgboost). It needs binary treatment nodes, so the composite action is decomposed into two bits per wave, drink-or-not and high-or-low dose, with the impossible "abstain and high-dose" cell ruled out deterministically. When the two estimators agree, we trust the signal.
Predicts survival from state and action. A Transformer over the history.
Predicts which action was taken, given the past. Forms the clever covariate.
Consistent if either Q or g is right. Valid confidence intervals.
One loop, three ideas
Survival analysis defines the target, RL learns a policy that maximizes it, and Deep LTMLE measures whether the policy actually delivers, all on the same longitudinal cohort. The LLM layer then translates the policy action into a grounded, judged recommendation.
- Method paper. Shirakawa et al., Survival Policy Learning as Inference. Submitted to NeurIPS 2026. The Stage-1 reward and Stage-2 SAC+BC formulation.
- Deep LTMLE. Shirakawa et al., Longitudinal Targeted Minimum Loss-based Estimation with Temporal-Difference Heterogeneous Transformer(ICML 2024). The neural estimator behind the counterfactual-survival numbers.
- Cohort. NHANES I Epidemiologic Follow-up Study (NHEFS): 14,407 subjects, four waves, 1971 to 1992.
- Cross-check. van der Laan and Gruber on Longitudinal TMLE; the CRAN
ltmlepackage for the independent estimate.
Research and clinical-decision-support tool, not deployed medical advice. Every output is a research artifact for evaluation, not an instruction for patient care.