The study uniquely employed water quality indexing (WQI), geochemical analysis, and machine learning (ML) models, identifying random forest (RF) as the best predictor.
Groundwater Under Growing Pressure
Groundwater is a critical global resource, especially in arid and semi-arid regions, where it is the primary source of drinking and irrigation water. However, rapid urbanization, industrialization, and population growth have led to its widespread contamination, posing severe risks to human health and agriculture.
Traditionally, the WQI and irrigation WQI (IWQI) have been vital tools for simplifying complex water quality data into a comprehensible format for assessment and management. Previous research has effectively utilized these indices and begun exploring ML models like artificial neural networks (ANNs) and support vector machines (SVMs) for water quality prediction. Despite this progress, a significant research gap exists in the application and comparison of more advanced ML algorithms, specifically RF and extreme gradient boosting (XGB), for predicting WQI.
This paper filled this gap by conducting a comprehensive physicochemical analysis of groundwater in Kasganj, India, and uniquely employing a suite of advanced models, such as RF, ANN, and XGB, to predict WQI with high accuracy. This integrated approach of traditional indexing with cutting-edge predictive modeling provides a robust framework for identifying contamination hotspots and informing targeted remediation strategies.
Combining Geochemical Modeling with Machine Learning
The study area was characterized by a sub-humid climate with hot summers and relies on groundwater from alluvial aquifers, with water levels fluctuating seasonally. The research was conducted from August 2023 to July 2024, during which 115 groundwater samples were systematically collected from 23 pre-identified sites using tube wells, hand pumps, and submersibles.
The samples were stored in pre-washed high-density polypropylene (HDPP) bottles and analyzed for twelve key physicochemical parameters, including potential of Hydrogen (pH), TDS, fluoride, and various ions. Advanced analytical techniques like titration, flame photometry, and ultraviolet (UV) spectrophotometry were employed, maintaining an estimated error of less than ±5 %.
The data analysis involved multiple approaches. Geochemical modeling using PHREEQC software calculated mineral saturation indices to understand the water's interaction with aquifer materials like carbonate rocks. Furthermore, the researchers calculated both WQI and IWQI by integrating the measured parameters, classifying the water into categories from excellent to unsuitable for drinking and irrigation based on established standards.
A key and innovative aspect of the methodology was the application of three advanced machine learning models, namely, RF, ANN, and XGB, to predict the WQI. RF uses an ensemble of decision trees resistant to overfitting, ANN mimics biological neurons to model complex nonlinear relationships, and XGB sequentially builds trees to correct errors from previous ones. The advantages and disadvantages of each model were noted, with careful steps described for their implementation, including data preprocessing, k-fold cross-validation, and hyperparameter tuning to ensure robust and accurate predictions.
Critical Contamination and Model Performance
The analysis revealed significant contamination, with TDS alarmingly high and fluoride levels exceeding the World Health Organization (WHO) limit in many samples. Hydrogeochemical analysis, using Piper and Gibbs diagrams, identified the water type predominantly as Ca-Mg-Cl, with rock-water interactions being the primary source of ions. A correlation analysis suggested fluoride mobilization is linked to local mineral weathering.
The application of the WQI classified 60.87 % of samples as "unfit" for drinking, with only 13.04 % being "moderately poor" and 26.08% categorized as "very poor", according to the study’s classification criteria. For irrigation, indices like sodium absorption ratio (SAR) and magnesium hazard (MH) were calculated to assess suitability.
A major component of the results was testing three machine learning models to predict the WQI. While all models (XGB, ANN, Random Forest) performed well, the RF model demonstrated the best predictive accuracy and generalization on unseen test data, achieving the highest R2 value (0.951) and the lowest error metrics (root mean square error (RMSE) of 5.97).
The discussion confirmed RF's superiority, aligning with other recent studies, and positions this integrated approach of traditional indexing with machine learning as a reliable and advanced method for water quality monitoring and management.
Data-Driven Insights for Safer Groundwater
The study revealed that groundwater in Kasganj, India, faces critical contamination, with more than 60 % of samples deemed unfit for drinking due to excessive TDS and fluoride levels exceeding WHO limits.
By combining traditional water quality indices with advanced machine learning models, the researchers created a powerful predictive framework for assessing groundwater health. Among the models tested, the Random Forest algorithm delivered the most reliable results, highlighting its value as a practical tool for long-term water quality monitoring. These findings underscore the urgent need for sustainable groundwater management and targeted remediation efforts to safeguard this vital resource.
Journal Reference
Islam et al. (2025). Integrated groundwater quality assessment using geochemical modeling and machine learning approach in Northern India. Scientific Reports, 15(1). DOI:10.1038/s41598-025-21592-4. https://www.nature.com/articles/s41598-025-21592-4
Disclaimer: The views expressed here are those of the author expressed in their private capacity and do not necessarily represent the views of AZoM.com Limited T/A AZoNetwork the owner and operator of this website. This disclaimer forms part of the Terms and conditions of use of this website.