We propose a metric that measures a model’s ability to potentially augment medical decision-making by reducing uncertainty in specific clinical scenarios. Practically, we envision this metric being used during the early phases of model development (i.e., before calculating net benefit) for multiclass models in dynamic care environments like critical care, which are becoming increasingly common in healthcare19,20,21,22,23.

To introduce our metric mathematically, we first contend that reducing uncertainty in medical decision-making might mirror the considerations of a partially observable Markov Decision Process (POMDP). In a POMDP framework, the clinician seeks to determine the “correct” diagnosis (in their belief state) and “optimal” treatment by predicting outcomes given a particular action taken. As such, there are two key probability distributions involved: one at the diagnosis phase where the clinician seeks to clarify the distribution of possible diagnoses, and a second at the treatment phase where the clinician seeks to clarify the distribution of future states given actions (i.e., treatments) chosen. Actionable ML should reduce the uncertainty of these distributions.

The degree of uncertainty reduction in these key distributions can be quantified on the basis of entropy. Entropy is a measurable concept from information theory that quantifies the level of uncertainty for a random variable’s possible outcomes24. We propose that clinicians may value entropy reduction, and our actionability metric is therefore predicated on the principle that actionability increases with ML’s ability to progressively decrease the entropy of probability distributions central to medical decision-making (Fig. 1).

Returning to the multiclass model that predicts the diagnosis in a critically unwell patient with fever (among a list of possible diagnoses such as infection, malignancy, heart failure, drug fever, etc.), an ML researcher might use the equation below. The equation is for illustration purposes, acknowledging that additional data are needed to determine the reasonable diagnoses in the differential diagnosis list and their baseline probabilities. This “clinician alone” model might be obtained by asking a sample of clinicians to evaluate scenarios in real-time or retrospectively to determine reasonable diagnostic possibilities and their probabilities based on available clinical data.

For each sample in a test dataset, the entropy of the output from the candidate model (i.e., the probability distribution of predicted diagnoses) is calculated and compared to the entropy of the output from the reference model, which by default is the clinician alone model but can also be other ML models. The differences are averaged across all samples to determine the net reduction in entropy (ML—reference) as illustrated below using notation common to POMDPs:

(1) Clinician Alone Model:

$$H^s_c = – \mathop {\sum}\limits_{s_t \in S} {p_c(s_t|o_t)log\;p_c(s_t|o_t)}$$

(2) With ML Model 1:

$$H^s_{m1} = – \mathop {\sum}\limits_{s_t \in S} {p_{m1}(s_t|o_t)log\;p_{m1}(s_t|o_t)}$$

(3) With ML Model 2:

$$H^s_{m2} = – \mathop {\sum}\limits_{s_t \in S} {p_{m2}(s_t|o_t)log\;p_{m2}(s_t|o_t)}$$

Whereby, $$s_t \in S$$ is the patient’s underlying state (e.g., infection) at time t within a domain S corresponding to a set of all reasonable possible states (e.g., different causes of fever, including but not limited to infection) and $$o_t \in O$$are the clinical observations (e.g., prior diagnoses and medical history, current physical exam, laboratory data, imaging data, etc.) at time t within a domain O corresponding to the set of all possible observations.

Therefore, the actionability of the candidate ML model at the diagnosis (i.e., current state) phase (Δs) can be quantified as: $$\Delta ^{{{s}}} = {{{H}}}^{{{s}}}_{{{0}}} – {{{H}}}^{{{s}}}_{{{m}}}$$, where $${{{H}}}_{{{0}}}^{{{s}}}$$ is the entropy corresponding to the reference distribution (typically the clinician alone model, corresponding to $${{{H}}}^{{{s}}}_{{{c}}}$$).

Basically, the model learns the conditional distribution of the various possible underlying diagnoses given the observations (see example calculation in supplemental Fig. 1). The extent of a model’s actionability is the measurable reduction in entropy when one uses the ML model versus the reference model.

Continuing with the clinical example above, the clinician must then choose an action to perform, for example, which antibiotic regimen to prescribe among a choice of many reasonable antibiotic regimens. Each state-action pair maps probabilistically to different potential future states, which therefore have a distribution entropy. Acknowledging that additional data are needed to define the relevant transition probabilities $$p^ \ast (s_{t + 1}|s_{t,}a_t)$$ (i.e., benefit:risk ratios) for each state-action pair (which ideally can be estimated by clinicians or empirically derived data from representative retrospective cohorts) an ML researcher might perform an actionability assessment of candidate multiclass models. The actionability assessment hinges on comparing the entropies of the future state distributions with and without ML and is calculated in a similar fashion to the diagnosis phase, where differences in distribution entropy (reference model – candidate ML model) are calculated for each sample in the test dataset and then averaged. The following equation, or a variation of it, might be used to determine actionability during the treatment phase of care:

Future state probability distribution (P (st+1|st)

(4) Without ML (e.g., clinician alone action/policy):

$$p_c(s_{t + 1}|s_t) = \mathop {\sum}\limits_{a_t \in A} {p^ \ast (s_{t + 1}|s_{t,}a_t)\pi _c(a_t|s_t)}$$

(5) With ML (e.g., the trained model recommended action/policy):

$$p_m(s_{t + 1}|s_t) = \mathop {\sum}\limits_{a_t \in A} {p^ \ast (s_{t + 1}|s_{t,}a_t)\pi _m(a_t|s_t)}$$

Whereby, St+1 is the desired future state (e.g., infection resolution), St is the current state (e.g., fever) at time t, $$a_t \in A$$ is the action taken at time t within a domain A corresponding to a set of reasonable possible actions (i.e., different antibiotic regimens), $$\pi _c(a_t|s_t)$$ is the policy chosen by the clinician at time t (e.g., treat with antibiotic regimen A) and $$\pi _m(a_t|s_t)$$ is the policy recommended by ML at time t (e.g., treat with antibiotic regimen B).

Entropy (H) of the future state probability distribution

Each future state probability distribution comes from a distribution of possible future states with associated entropy, which we illustrate as:

(6) Without ML:

$$H^a_0 = – \mathop {\sum}\limits_{s_{t + 1} \in S} {p_0(s_{t + 1}|s_t)log\;p_0(s_{t + 1}|s_t)}$$

(7) With ML:

$$H^a_m = – \mathop {\sum}\limits_{s_{t + 1} \in S} {p_0(s_{t + 1}|s_t)log\;p_m(s_{t + 1}|s_t)}$$

Therefore, the actionability of the candidate ML model at the action (i.e., future state) phase (Δa) can be quantified as $$\Delta ^{{{a}}} = {{{H}}}^{{{a}}}_0 – {{{H}}}^{{a}}_{{{m}}}$$, where $${{{H}}}_0^{{{a}}}$$ is the entropy corresponding to the reference distribution (typically the clinician alone model).

The model essentially learns the conditional distribution of the future states given actions taken in the current state, and actionability is the measurable reduction in entropy when one uses the ML model versus the reference (typically clinician alone) model.