New Research Puts AI to the Test in High-Stakes Welding Assessments

Researchers recently examined how well multimodal large language models (MLLMs) can evaluate weld quality in industrial settings. Drawing on expert-annotated datasets from both real-world and online images, the study found that while these models tend to perform better on familiar, publicly sourced images, they can also handle previously unseen, real-world welds with reasonable effectiveness.

Study: Do multimodal large language models understand welding? Image Credit: Anggalih Prasetya/Shutterstock.com


To support this evaluation, the team introduced WeldPrompt—a prompting strategy that blends chain-of-thought reasoning with in-context learning. This method improved recall in certain cases but delivered inconsistent results overall. The findings underscore both the promise and current constraints of MLLMs in technical domains, especially when applied to tasks requiring nuanced visual judgment.

Background

While generative AI has made significant strides, hallucinations—confident but incorrect outputs—remain a critical concern, particularly for models that combine language and vision. Most research to date has focused on hallucinations in text-based applications or white-collar domains. As a result, there's limited understanding of how MLLMs perform in real-world, high-stakes environments like manufacturing.

At the same time, much of the conversation around AI’s impact on the workforce has overlooked skilled production roles—despite these being strong candidates for AI integration. Welding, which requires technical skill and precise visual assessment, is a prime example.

To address this gap, the study introduced a new dataset of expert-annotated weld images, both from real-world training programs and publicly available online sources. The team also proposed WeldPrompt, designed to guide MLLMs in producing more accurate and reasoned judgments by referencing similar past examples.

Materials and Methods

The researchers tested two MLLMs—GPT-4o and LLaVA-1.6—on their ability to classify welds as acceptable or unacceptable across three industrial contexts: marine & research vessels (RV), aeronautical, and farming.

They used two datasets:

  • A real-world set collected from welding training programs
  • An online set assembled from publicly available weld images

Each image was annotated by a domain expert according to the technical standards of the specific context. To avoid giving the models direct hints, any annotations were removed before testing, resulting in 62 real-world and 58 online images.

The models were evaluated under two prompting strategies:

  • Zero-shot prompting, where models made judgments without seeing prior examples
  • WeldPrompt, which incorporated in-context examples of correctly classified welds, identified using embeddings from a vision model. These examples were paired with chain-of-thought reasoning to help guide the model’s output.

To assess performance, the study used precision, recall, F1-score, and ROC-AUC metrics, averaged across runs using a leave-one-out validation strategy. This ensured a robust and unbiased evaluation across all contexts.

Findings and Analysis

Both GPT-4o and LLaVA-1.6 were tested across all three domains using both prompting strategies. In the zero-shot setting, performance was noticeably weaker in the farming context, where both models struggled with precision and recall. GPT-4o tended to be overly conservative in the aeronautical domain, rejecting many welds that experts had approved, while showing more leniency in farming. LLaVA-1.6 showed a similar pattern but generally underperformed due to its smaller model size.

Notably, both models fared better on the online dataset than on real-world images, suggesting a reliance on memorized training examples rather than genuine reasoning.

Introducing WeldPrompt helped in some areas. GPT-4o saw improvements in recall and precision for RV & marine welds, though this came at the cost of increased strictness in the aeronautical domain. LLaVA-1.6 continued to struggle with class imbalance in the real-world aeronautical dataset, often defaulting to blanket rejection of all welds. In contrast, GPT-4o showed a more balanced response.

Overall, WeldPrompt led to modest but consistent F1 score improvements, indicating better alignment with expert assessments. However, the ongoing performance gap between online and real-world data highlighted a deeper issue: the models may still be relying more on pattern recognition than on true visual reasoning.

Conclusion

This study sheds light on both the capabilities and limitations of MLLMs in industrial quality control. While models like GPT-4o and LLaVA-1.6 show promise, especially when supported by prompting strategies like WeldPrompt, their performance is still closely tied to familiar datasets and limited by current reasoning capabilities.

For MLLMs to be reliably deployed in high-stakes industrial tasks, more targeted fine-tuning, stronger prompting frameworks, and improvements in reasoning over unfamiliar visual inputs will be essential.

Journal Reference

Khvatskii, G., Lee, Y. S., Angst, C., Gibbs, M., Landers, R., & Chawla, N. V. (2025). Do multimodal large language models understand welding? Information Fusion120, 103121. DOI:10.1016/j.inffus.2025.103121. https://www.sciencedirect.com/science/article/pii/S1566253525001940?via%3Dihub

Disclaimer: The views expressed here are those of the author expressed in their private capacity and do not necessarily represent the views of AZoM.com Limited T/A AZoNetwork the owner and operator of this website. This disclaimer forms part of the Terms and conditions of use of this website.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Nandi, Soham. (2025, May 16). New Research Puts AI to the Test in High-Stakes Welding Assessments. AZoRobotics. Retrieved on May 16, 2025 from https://www.azorobotics.com/News.aspx?newsID=15958.

  • MLA

    Nandi, Soham. "New Research Puts AI to the Test in High-Stakes Welding Assessments". AZoRobotics. 16 May 2025. <https://www.azorobotics.com/News.aspx?newsID=15958>.

  • Chicago

    Nandi, Soham. "New Research Puts AI to the Test in High-Stakes Welding Assessments". AZoRobotics. https://www.azorobotics.com/News.aspx?newsID=15958. (accessed May 16, 2025).

  • Harvard

    Nandi, Soham. 2025. New Research Puts AI to the Test in High-Stakes Welding Assessments. AZoRobotics, viewed 16 May 2025, https://www.azorobotics.com/News.aspx?newsID=15958.

Tell Us What You Think

Do you have a review, update or anything you would like to add to this news story?

Leave your feedback
Your comment type
Submit

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.