Reviewed by Lexie CornerMay 13 2025
A recent study led by researchers at Weill Cornell Medicine has introduced a novel artificial intelligence–based method that accurately groups cancer patients based on shared characteristics before treatment and similar outcomes afterward. This approach could significantly enhance patient selection for clinical trials and support more personalized treatment decisions.
The study, conducted in collaboration with Regeneron Pharmaceuticals, addresses a long-standing challenge in oncology: predicting which patients are most likely to respond well to a given therapy. The researchers found that their method outperformed all previously published approaches in forecasting treatment outcomes using electronic health record data.
We’re hopeful that this approach ultimately will be useful for testing and targeting treatments across a wide range of diseases.
Dr. Fei Wang, Study Senior Author and Founding Director, Institute of AI for Digital Health, Department of Population Health Sciences, Weill Cornell Medicine
Dr. Fei Wang is also a Professor of Population Health Sciences.
While machine learning has long shown promise in identifying subtle yet meaningful patterns within large datasets, including medical records, existing systems often fall short. Although they can group patients based on general similarities in health information, these classifications don't reliably predict how patients will respond to future treatments.
Dr. Ying Li, a study co-author and a scientist at Regeneron specializing in treatment response prediction, recently reached out to Dr. Wang to explore whether his research team could help develop a more effective solution to this challenge.
Our goal was to develop a platform that sorts patients with the target disease who are receiving the same treatment into groups sharing similar baseline characteristics and treatment outcomes. We validated this method using a real-world database of advanced small cell lung cancer patients treated with immune checkpoint inhibitors.
Dr. Ying Li, Study Co-Author and Scientist, Regeneron
Dr. Weishen Pan, the study's first author and a postdoctoral research associate in the Wang Laboratory, led the development of the new machine learning platform. Pan trained the system using anonymized health records from 3,225 lung cancer patients contained in a commercial database. Each record included 104 variables, covering factors such as blood test results, prescriptions, medical history, and tumor stage.
In this initial application, the platform categorized patients into three distinct groups. The group with the longest average overall survival time after beginning treatment was predominantly female (55.5 %) and had relatively low rates of comorbid conditions such as diabetes and heart failure.
In contrast, the group with the shortest survival time had less than half the average survival of the first group. This group was primarily male (66.2 %) and showed higher rates of tumor metastasis, abnormal blood test results indicating inflammation, and signs of liver and kidney dysfunction.
Using a metric called the concordance index, we showed that the average performance of this new approach at predicting patient survival times was superior to that of standard statistical and machine learning methods.
Dr. Weishen Pan, Study First Author and Postdoctoral Research Associate, Weill Cornell Medicine
The research team then applied their trained machine learning model to a separate dataset containing information on 1,441 patients with non-small-cell lung cancer. They found that the system generated nearly identical patient groupings, both in terms of baseline characteristics and survival outcomes.
Dr. Wang, Dr. Li, and their collaborators now plan to further develop and validate this approach for stratifying patients in clinical trials of novel therapies, as well as for informing personalized treatment decisions. The consistent groupings and predictive accuracy demonstrated by the platform also suggest its potential value in uncovering deeper biological insights into disease mechanisms.
“We’ll probably need more than electronic health record data for this, but we do want to understand the biological mechanisms that explain these distinct patient subgroups,” said Dr. Wang.
Regeneron supported the research.
Journal Reference:
Pan, W., et al. (2025). Identification of predictive sub-phenotypes for clinical outcomes using real-world data and machine learning. Nature Communications. doi.org/10.1038/s41467-025-59092-8