When a robot is deployed to locate workers trapped in a collapsed mine, it must act quickly - mapping unfamiliar terrain and pinpointing its own position in real-time. But today’s best machine-learning models for navigation can only process a handful of images at once, which doesn’t cut it in disaster zones where seconds matter and thousands of images may need to be analyzed.
To address this, MIT researchers have created a new system that dramatically speeds up the process.
Drawing from both modern AI vision models and classic computer vision techniques, their method can generate accurate 3D maps of complex environments - like a cluttered office corridor - in just seconds, using only images from a robot’s onboard camera.
The system works by incrementally building small submaps as the robot moves. These submaps are then aligned and stitched together into a full 3D reconstruction, all while estimating the robot’s location in real time. Unlike many current approaches, this method doesn’t require calibrated cameras or expert fine-tuning. Its simplicity, combined with fast and high-quality results, makes it more practical for real-world deployment.
In addition to aiding search-and-rescue missions, the technique could power applications in extended reality (XR) for devices like VR headsets, or help warehouse robots quickly locate and move items.
For robots to accomplish increasingly complex tasks, they need much more complex map representations of the world around them. But at the same time, we don’t want to make it harder to implement these maps in practice. We’ve shown that it is possible to generate an accurate 3D reconstruction in a matter of seconds with a tool that works out of the box.
Dominic Maggio, Study Lead Author and Graduate Student, Massachusetts Institute of Technology
Maggio collaborated with postdoc Hyungtae Lim and senior author Luca Carlone, associate professor in MIT’s Department of Aeronautics and Astronautics, principal investigator at the Laboratory for Information and Decision Systems (LIDS), and director of the MIT SPARK Lab. The research will be presented at the Conference on Neural Information Processing Systems.
Revisiting SLAM with a Fresh Perspective
The core challenge the team tackled is a well-known robotics problem: simultaneous localization and mapping, or SLAM. SLAM allows a robot to map an unknown environment while keeping track of its own position within it.
Traditional optimization-based SLAM methods often struggle in visually complex scenes or require pre-calibrated cameras. In contrast, machine-learning-based methods, while easier to implement, are limited by memory and can only process about 60 images at a time - far too slow for high-speed exploration.
The MIT team’s solution sidesteps this limitation by having the robot create many small submaps instead of one large map. These smaller chunks can be processed quickly, then combined into a full 3D reconstruction.
“This seemed like a very simple solution, but when I first tried it, I was surprised that it didn’t work that well, ” says Dominic Maggio.
While searching for a solution, Maggio revisited computer vision research from the 1980s and ’90s. That deep dive revealed a key issue: machine-learning models often introduce subtle distortions when processing images, which makes aligning submaps much trickier than expected.
Traditional approaches rely on basic geometric operations like rotating and translating submaps until they line up. But with modern models, that’s often not enough. A submap might show one side of a room with slightly warped or stretched walls, meaning standard alignment techniques fall short. These small deformations introduce ambiguity that simple transformations can't resolve.
We need to make sure all the submaps are deformed in a consistent way so we can align them well with each other.
Luca Carlone, Associate Professor, Department of Aeronautics and Astronautics, Massachusetts Institute of Technology
A More Flexible Alignment Strategy
To solve this, the researchers developed a more adaptable mathematical method that models how submaps might be warped. By applying these transformations, the system can reliably align even slightly distorted submaps.
Using a stream of input images, the system produces a 3D reconstruction of the environment along with estimates of the camera’s positions - crucial data that enables the robot to localize itself within the scene as it navigates.
“Once Dominic had the intuition to bridge these two worlds – learning-based approaches and traditional optimization methods – the implementation was fairly straightforward. Coming up with something this effective and simple has potential for a lot of applications,” Carlone added.
Their system outperformed other methods both in speed and accuracy, all without the need for specialized cameras or extra processing tools. In tests, the researchers were able to generate near-real-time 3D reconstructions of intricate environments, such as the interior of the MIT Chape, using nothing more than short cell phone videos.
The results were impressively precise, with an average reconstruction error of less than 5 centimeters.
Looking ahead, the team aims to further improve the system’s reliability in especially complex or cluttered environments and eventually deploy it on real robots operating in demanding, real-world conditions.
Knowing about traditional geometry pays off. If you understand deeply what is going on in the model, you can get much better results and make things much more scalable.
Luca Carlone, Associate Professor, Department of Aeronautics and Astronautics, Massachusetts Institute of Technology