A Condition Hidden in Plain Sight
HF affects an estimated 1-2% of the global adult population and is linked to high mortality and substantial societal costs. Its symptoms, breathlessness, fatigue, and fluid retention frequently mimic those of respiratory and other non-cardiac conditions, making accurate diagnosis difficult. Confirming HF typically demands echocardiography, invasive procedures, or specialist biomarker testing, resources that remain inaccessible in many healthcare settings.
The ECG, by contrast, is cheap, non-invasive, and available even in resource-scarce environments. However, ECGs have historically offered limited utility in HF diagnosis, used mainly to identify secondary causes like atrial fibrillation (AF) rather than HF itself. A further obstacle is the absence of a reliable diagnostic ground truth.
Administrative International Classification of Diseases, 10th revision (ICD-10) diagnostic codes, the most readily available data source, are known to have low validity for HF, as patients are frequently miscoded or missed entirely. Training a supervised deep learning model on such noisy labels risks teaching it to replicate those same diagnostic errors.
To address this, the researchers introduced a pragmatic labelling strategy that cross-references ICD-10 codes with N-terminal proB-type natriuretic peptide (NT-proBNP), a well-established blood biomarker for HF. NT-proBNP levels below 125 ng/L effectively rule out HF, while levels above 1,000 ng/L provide strong confirmation, a two-sided filter that substantially purifies both the positive and negative training labels.
Building the Model, Cleaning the Labels
The development dataset comprised 25,300 patients, drawing from ECG recordings collected at Akershus University Hospital between 2016 and 2022. After applying the pragmatic labelling strategy, 47,034 electrocardiograms remained, with 10,692 labelled as HF-positive. Three deep learning architectures were evaluated using five-fold cross-validation.
The InceptionTime model consistently outperformed the other two architectures, and the final model was assembled as an ensemble of five independently trained InceptionTime networks, with predictions averaged across all five members to improve stability and generalisation.
The model was evaluated against three progressively stricter labelling strategies. The first used ICD-10 codes alone, the second incorporated age and sex-adjusted NT-proBNP thresholds to validate diagnoses, and the third applied strict cut-offs (below 125 ng/L for non-HF and above 1,000 ng/L for confirmed HF) to produce the highest-certainty labels.
On the prospective test cohort of 43,727 patients, the model achieved area under the curve (AUC) values of 0.86, 0.91, and 0.96 under the three strategies, respectively. The model trained with NT-proBNP-enhanced labels significantly outperformed the ICD-10-only model under strategies two and three, confirming that cleaner training labels translate directly into better detection.
Validated Across Populations and Phenotypes
External validation on the MIMIC-IV dataset, comprising 161,352 patients from a United States (US) hospital system, yielded AUC values of 0.87, 0.90, and 0.96, closely mirroring the prospective results and demonstrating strong generalisability across different clinical and demographic settings.
Beyond broad HF detection, the model showed meaningful sensitivity to cardiac function subtypes, which are historically the hardest to identify. Among patients with preserved ejection fraction (EF) above 50%, the model distinguished normal diastolic function from grade 2 or 3 dysfunction with an AUC of 0.800, and from all other diastolic grades with an AUC of 0.828. It also responded systematically to the H2FPEF risk score, assigning progressively higher predicted risk to patients in higher score categories.
Download the PDF of this page here
A qualitative review by a senior cardiologist further found that among 30 high-risk patients flagged by the model despite having no formal diagnosis, 24 showed strong clinical signs of HF with preserved EF (HFpEF), suggesting the model is detecting genuinely undiagnosed cases that conventional labelling strategies completely miss.
Toward Scalable HF Screening
This study represents a meaningful step toward accessible, low-cost HF screening at scale. By demonstrating that a deep learning model trained on pragmatically labelled, routinely collected data can detect HF across its full phenotypic range, including the elusive HFpEF subtype, the work opens the door to earlier intervention in settings where specialist diagnostics are unavailable.
The authors acknowledge key limitations, including the predominance of Caucasian patients in both validation cohorts and the absence of a true gold-standard diagnosis. Future work aims to incorporate echocardiographic parameters into model training, explore additional biomarkers, and establish real-world clinical impact through prospective trials.
Journal Reference
Stenhede, E., Ravn, J., Schirmer, H., & Ranjbar, A. (2026). Heart failure detection in electrocardiograms using Artificial Intelligence and pragmatic labelling. Npj Digital Medicine. DOI:10.1038/s41746-026-02774-4, https://www.nature.com/articles/s41746-026-02774-4
Disclaimer: The views expressed here are those of the author expressed in their private capacity and do not necessarily represent the views of AZoM.com Limited T/A AZoNetwork the owner and operator of this website. This disclaimer forms part of the Terms and conditions of use of this website.