Researchers recently examined how well multimodal large language models (MLLMs) can evaluate weld quality in industrial settings. Drawing on expert-annotated datasets from both real-world and online images, the study found that while these models tend to perform better on familiar, publicly sourced images, they can also handle previously unseen, real-world welds with reasonable effectiveness.
Study: Do multimodal large language models understand welding? Image Credit: Anggalih Prasetya/Shutterstock.com
To support this evaluation, the team introduced WeldPrompt—a prompting strategy that blends chain-of-thought reasoning with in-context learning. This method improved recall in certain cases but delivered inconsistent results overall. The findings underscore both the promise and current constraints of MLLMs in technical domains, especially when applied to tasks requiring nuanced visual judgment.
Background
While generative AI has made significant strides, hallucinations—confident but incorrect outputs—remain a critical concern, particularly for models that combine language and vision. Most research to date has focused on hallucinations in text-based applications or white-collar domains. As a result, there's limited understanding of how MLLMs perform in real-world, high-stakes environments like manufacturing.
At the same time, much of the conversation around AI’s impact on the workforce has overlooked skilled production roles—despite these being strong candidates for AI integration. Welding, which requires technical skill and precise visual assessment, is a prime example.
To address this gap, the study introduced a new dataset of expert-annotated weld images, both from real-world training programs and publicly available online sources. The team also proposed WeldPrompt, designed to guide MLLMs in producing more accurate and reasoned judgments by referencing similar past examples.
Materials and Methods
The researchers tested two MLLMs—GPT-4o and LLaVA-1.6—on their ability to classify welds as acceptable or unacceptable across three industrial contexts: marine & research vessels (RV), aeronautical, and farming.
They used two datasets:
- A real-world set collected from welding training programs
- An online set assembled from publicly available weld images
Each image was annotated by a domain expert according to the technical standards of the specific context. To avoid giving the models direct hints, any annotations were removed before testing, resulting in 62 real-world and 58 online images.
The models were evaluated under two prompting strategies:
- Zero-shot prompting, where models made judgments without seeing prior examples
- WeldPrompt, which incorporated in-context examples of correctly classified welds, identified using embeddings from a vision model. These examples were paired with chain-of-thought reasoning to help guide the model’s output.
To assess performance, the study used precision, recall, F1-score, and ROC-AUC metrics, averaged across runs using a leave-one-out validation strategy. This ensured a robust and unbiased evaluation across all contexts.
Findings and Analysis
Both GPT-4o and LLaVA-1.6 were tested across all three domains using both prompting strategies. In the zero-shot setting, performance was noticeably weaker in the farming context, where both models struggled with precision and recall. GPT-4o tended to be overly conservative in the aeronautical domain, rejecting many welds that experts had approved, while showing more leniency in farming. LLaVA-1.6 showed a similar pattern but generally underperformed due to its smaller model size.
Notably, both models fared better on the online dataset than on real-world images, suggesting a reliance on memorized training examples rather than genuine reasoning.
Introducing WeldPrompt helped in some areas. GPT-4o saw improvements in recall and precision for RV & marine welds, though this came at the cost of increased strictness in the aeronautical domain. LLaVA-1.6 continued to struggle with class imbalance in the real-world aeronautical dataset, often defaulting to blanket rejection of all welds. In contrast, GPT-4o showed a more balanced response.
Overall, WeldPrompt led to modest but consistent F1 score improvements, indicating better alignment with expert assessments. However, the ongoing performance gap between online and real-world data highlighted a deeper issue: the models may still be relying more on pattern recognition than on true visual reasoning.
Conclusion
This study sheds light on both the capabilities and limitations of MLLMs in industrial quality control. While models like GPT-4o and LLaVA-1.6 show promise, especially when supported by prompting strategies like WeldPrompt, their performance is still closely tied to familiar datasets and limited by current reasoning capabilities.
For MLLMs to be reliably deployed in high-stakes industrial tasks, more targeted fine-tuning, stronger prompting frameworks, and improvements in reasoning over unfamiliar visual inputs will be essential.
Journal Reference
Khvatskii, G., Lee, Y. S., Angst, C., Gibbs, M., Landers, R., & Chawla, N. V. (2025). Do multimodal large language models understand welding? Information Fusion, 120, 103121. DOI:10.1016/j.inffus.2025.103121. https://www.sciencedirect.com/science/article/pii/S1566253525001940?via%3Dihub
Disclaimer: The views expressed here are those of the author expressed in their private capacity and do not necessarily represent the views of AZoM.com Limited T/A AZoNetwork the owner and operator of this website. This disclaimer forms part of the Terms and conditions of use of this website.