Toxic Metal Pollution in Marine Ecosystems
Toxic metal pollution, particularly Al, threatens marine ecosystems due to industrial and agricultural activities. Al accumulates in saltwater, causing toxicity to organisms and risks to human health. Previous work has successfully applied ML to predict heavy metal concentrations but often relied on large, complex feature sets, limiting model interpretability.
A gap exists in determining whether accurate predictions can be achieved using a reduced subset of chemical elements rather than full datasets. This study addressed that gap by applying ML combined with feature selection to predict Al concentrations in water and sediment from the Sea of Marmara. It aims to identify key elemental predictors, improving prediction efficiency and model interpretability for environmental monitoring.
Data Collection, Feature Selection, and Predictive Modeling
Researchers collected water and sediment samples from 17 locations in the Sea of Marmara over ten days. Sediment samples were dried in an oven, and both sample types were subjected to chemical digestion with acids in a microwave device. After filtration and dilution, the concentrations of 15 elements (including Al, Iron, Lead, and Zinc) were measured using inductively coupled plasma optical emission spectroscopy (ICP-OES) analysis.
To improve prediction accuracy and reduce complexity, two feature selection techniques were applied. Recursive feature elimination (RFE) works by repeatedly removing the least important features and retraining the model until the optimal subset remains. A genetic algorithm (GA) takes a different approach, mimicking biological evolution by testing various feature combinations and using crossover and mutation to evolve toward the best solution. GA is more computationally intensive but excels at uncovering complex, nonlinear relationships in data.
Six predictive models were employed. Multiple linear regression (MLR) establishes linear relationships between elements but requires strict statistical assumptions. Elastic net combines two regularization techniques to prevent overfitting while handling correlated features. Type-1 Fuzzy functions model uncertainties without needing expert rules, using clustering to capture relationships.
Extreme gradient boost (XGBoost) builds sequential decision trees, each correcting previous errors, with built-in regularization to avoid overfitting. Random forest averages many decision trees for robust predictions. Finally, ensemble learning methods combine multiple models, using simple averaging or stacked architectures, to achieve superior predictive performance. All models underwent hyperparameter tuning to optimize their performance for predicting Al concentrations.
Experimental Findings and Geochemical Interpretation
Data were split 50% training, 20% validation, and 30% testing. Two feature selection methods, RFE and GA, were applied, reducing features from 14 to six in both datasets.
For water samples, the best individual model was XGBoost combined with GA-selected features, achieving a root mean square error (RMSE) of 0.0354. This improved prediction by 44.4% compared to using all features and by 22.2% compared to standard MLR. Shapley additive explanations (SHAP) analysis revealed Copper and Chromium as the most influential predictors.
For sediment samples, XGBoost with GA-selected features again performed best, with an RMSE of 356.57, a 20% improvement over the full dataset and 45% better than MLR. SHAP analysis highlighted Boron and Cadmium as key features.
Feature selection significantly improved tree-based models (XGBoost, Random Forest) but not linear models (MLR), because tree-based methods better capture nonlinear relationships and interactions. The stacking ensemble method also showed competitive performance.
Geochemically, selected features in water reflect anthropogenic inputs and hydrographic mixing, while sediment features indicate both natural lithogenic background and human influences. Al distribution in the Sea of Marmara aligns with regional studies showing Al as primarily lithogenic, with contributions from industrialization and urbanization.
Implications for Interpretable Environmental Monitoring
This study successfully demonstrated that combining ML with feature selection can accurately predict Al concentrations in water and sediment from the Sea of Marmara using a reduced set of elements. Among the models tested, XGBoost combined with GA-based feature selection achieved the best performance, reducing the number of predictors from 14 to 6 while improving prediction accuracy by 44.4% for water samples and 20% for sediment samples.
Key predictors included Copper, Chromium, Boron, and Cadmium. These findings confirm that feature selection enhances model interpretability and efficiency without sacrificing performance. Future research should explore deep learning approaches and incorporate seasonal and spatial variations across broader sampling areas to further improve predictive capabilities for environmental monitoring.
Journal Reference
Ucan, A., Tak, N., Hocaoglu-Ozyigit, A., & Ozyigit, I. I. (2026). Scientific Reports. DOI:10.1038/s41598-026-48252-5, https://www.nature.com/articles/s41598-026-48252-5
Disclaimer: The views expressed here are those of the author expressed in their private capacity and do not necessarily represent the views of AZoM.com Limited T/A AZoNetwork the owner and operator of this website. This disclaimer forms part of the Terms and conditions of use of this website.