Thought Leaders

Visual Object Localisation in Humanoid Robots

Mr. Jürgen Leitner at the Dalle Molle Institute for Artificial Intelligence (IDSIA) in Lugano, Switzerland, talks to AZoRobotics about visual object localisation in humanoid robots. Mr. Leitner’s study was published in the peer-reviewed International Journal of Advanced Robotic Systems by the open-access publisher InTech.

Can you tell us about your on-going research at IDSIA?

At the IDSIA Robotics Lab, we are proud owners of the European-developed iCub humanoid robot.

The platform provides an anthropomorphically designed system with high degrees of freedom (DOF), which allows for interesting research into topics such as artificial intelligence, robot learning and more practically object manipulation.

This research is part of the European Commission-funded IM-CLeVeR FP7 project and is placed at the intersection of neuroscience, machine learning, computer vision and robot control.

Toward Intelligent Humanoids | iCub 2012

Video courtesy of Tomás Donoso, Dalle Molle Institute for Artificial Intelligence (IDSIA) in Lugano, Switzerland.

What makes the iCub humanoid robot a good platform for object manipulation research?

The design of the iCub comprises a package that tightly integrates sensors and actuators. For example, the arm and hand provide a very detailed human-like design with almost as many DOFs.

The tight integration of artificial intelligence (AI) with a “body”—known as embodiment—allows us to draw more parallels with humans. For example, in the field of chess playing, researchers have made significant advances in the field of AI (e.g., IBM’s Deep Blue computer).

These computers have been able to match (and even eclipse) human performance. If you contrast this with the physical movement of chess pieces across a board, you notice that there is plenty of room to improve robot abilities.

Currently, virtually no robots are able to perceive the chessboard, plan and then execute motions to move one chess piece to another position. But every child can perform this action without a problem.

Why is perception for robotic systems a difficult issue to solve in robotic systems?

To allow for manipulation, the object first needs to be detected and localised. Usually this localisation is performed by using stereo vision approaches, which will give a position estimate with respect to the eyes.

In a humanoid robot, the eyes are usually not at the same position as the starting point of the arm. Therefore, a transformation based on the kinematics of the robot is necessary to calculate the position relative to a common reference frame. In the iCub, the origin of this global frame is in the hip.

Why is spatial understanding critical for perception and reasoning in robots?

Perception generally is very important in creating autonomous robots. A robot needs to sense its environment in order to make decisions and perform the right actions.

Spatial perception is of importance for obstacle avoidance, navigation and planning. This has been mostly of interest for mobile robots, but it is also important for a humanoid robot.

Take, for example, a humanoid robot that should clean a table in your house. This robot needs to have some understanding of how far things are away (Do I have to move to pick something up?) and how the objects are placed relatively to each other (Can I pick up the tea box or do I have to first remove the cup which is blocking my reach?).

Can you discuss the approach to your latest research on learning object localisation from vision on a humanoid robot?

The novelty of our approach is to use machine learning techniques to “learn” a direct correlation between the pixels in the two cameras and the robot’s position in a useful, global reference frame.

Can you discuss the two main biologically-inspired approaches for position estimates?

In our resent research, we used two different methods to “learn” the direct correlation. One is called Genetic Programming (GP), which is based on the principles of evolution in biological systems.

The other method is based on Artificial Neural Networks (ANNs), which are a simplified model of how the neurons in our brain might work. Both of these methods are generally able to represent arbitrary functions that correlate the system inputs with its outputs.

By providing these systems a given dataset with known inputs and outputs, the systems are able to train and therefore “learn” the correlation.

What are the advantages to applying ANN and GP and for position estimates?

By using these machine-learning methods we are able to find the correlation of pixels to 3D Cartesian coordinates without any knowledge of the camera parameters (which are usually determined by calibration) and without the need of a kinematic model.

Though a rough kinematic model for the iCub is known, its human-like design allows for quite a variance and standard methods for localisation and control tend to have errors of a few centimeters, even with thorough camera and kinematic model calibration. Our approach on the other hand requires a given training set.

This takes quite a long time to collect by a human; therefore, we used another robot—a very precise industrial robot arm—to provide the ground truth to our humanoid. In the future we hope that the humanoid can learn completely autonomously by using either haptic feedback or landmarks.

Can you discuss the ‘stereo vision’ problem and how you have used this in your current approach to learning spatial object localisation from vision?

The stereo vision problem is to estimate the position of an object relative to the camera, given that you have multiple images from different angles.

These images can come from multiple cameras or from one camera that is moved between shots. However, the usual approach has some issues with the camera setup on the iCub, which allows the vergence of the eyes to change.

What makes the ‘stereo vision’ module the most accurate localisation tool for the iCub?

The stereo vision module provided by the Italian Institute of Technology in Genoa uses various interconnected modules to provide ‘accurate’ estimates.

First, it uses calibrated models of both the eyes and the kinematics of the robot. Second, it uses a feature-tracking approach to improve motion detection.

Finally, it uses a stereo disparity map to further improve the correlation and to estimate the motion of the head. All of these things together are a nice feat of engineering but do require a lot of computational resources.

What calibration tasks have you used to investigate special perception in your current study?

As mentioned before, we have not performed a precise kinematic model or camera parameter calibration. In our approach the two learning systems (GP and ANN) do not have any prior knowledge about these.

In fact, through training, they learn to estimate a function that contains both of these models. Of course, to collect the ground truth we have to first measure the distance of the industrial arm to the robot.

How does the robot collect data about object localisation and how is it interpreted?

The iCub tries to locate a small and bright red object. This is done to reduce the computer vision burden during the training.

The object is placed in a shared workspace by a Katana industrial robot arm with millimeter accuracy. (This introduces a few more challenges, like preventing the two robots from colliding).

While the iCub moves about it collects the data of its encoders (joint positions) of the 9 relevant joints (hip, neck, eyes, etc.) and the position of the red object in both the left and right images.

It combines these with the location ground truth provided by the Katana and this way builds a training set.

Once the learning is finished, the iCub uses the learned ANN- or GP-generated formula to localise any object in its visual space.

The iCub and the Katana industrial robot arm.

The iCub and the Katana industrial robot arm. Image Courtesy of Mr. Jürgen Leitner, Dalle Molle Institute for Artificial Intelligence (IDSIA) in Lugano, Switzerland.

What were the main findings from your research?

We have a method that allows the robot to estimate its internal models (kinematics and camera) in a way to use with object localisation.

The main point was to see that this is possible to learn without the need of external calibration, and can achieve very similar results with our lightweight approach.

How do you plan on improving this approach for more accurate localisation of objects in a 3D space by using humanoid robots?

There are a few ideas, we have to improve the system’s accuracy. One obvious way is to create better datasets to learn from, e.g., more data, less outliers, more positions. Another way would be to improve the learning (e.g., by knowing where the errors are large and where they are small and to take this knowledge into account).

In the end, we would like to have our iCub learn online while the robot is controlled by a human or any controller or even just by itself like a human.

Another idea is to change the frame to localise in. Currently this is operational space of the robot (Cartesian 3D space). This comes from the very well-implemented operational space controller that runs on the iCub and is the de facto standard of operation.

On the other hand, if you look at humans, we do not have a very precise Cartesian world model, we have trouble estimating distances (even of positions of a glass on a table), yet we don’t have trouble picking it up.

So one thing we are looking into is combining the sensory side with the motor side and predict the position, e.g., in an ego-sphere around the robot, or even in the joint space of the robot.

How do you see this technological advancement extending the use of robotic systems into areas of application that will allow humanoids to co-exist and work with humans?

Very interesting question. It’s hard to predict the future, but I strongly believe that for robots to enter domains where they need to co-exist or co-work with humans, better sensorimotor coordination is required. This will allow them to adjust to the environment in a more natural and more predictable way (for the human).

I hope that my research on spatial perception, together with other research at IDSIA and the whole iCub community, will improve some of the issues pertaining to object manipulation and interaction.

About Jürgen Leitner

Jürgen Leitner is a researcher in autonomous robots at the Dalle Molle Institute for Artificial Intelligence (IDSIA). As a member of the IDSIA Robotics Lab, he is involved in research on the iCub humanoid robot aiming to develop autonomous object manipulation.

Jürgen LeitnerPrior to joining the IDSIA Robot Lab, Jürgen was a member of the Advanced Concepts Team (ACT) of the European Space Agency (ESA) in the Netherlands. There, he researched the application of artificial neural networks for space systems, e.g., as controllers for robots and spacecraft. He is a graduate of the Joint European Master in Space Science and Technology (SpaceMaster) programme.

IDSIA is a small but visible research lab associated with the two universities in Lugano, Switzerland. IDSIA has recently won various awards in international competitions and is currently involved in a multitude of research projects funded by the European Commission and the Swiss National Science Foundation.

Disclaimer: The views expressed here are those of the interviewee and do not necessarily represent the views of Limited (T/A) AZoNetwork, the owner and operator of this website. This disclaimer forms part of the Terms and Conditions of use of this website.


Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Kaur, Kalwinder. (2019, June 24). Visual Object Localisation in Humanoid Robots. AZoRobotics. Retrieved on June 17, 2024 from

  • MLA

    Kaur, Kalwinder. "Visual Object Localisation in Humanoid Robots". AZoRobotics. 17 June 2024. <>.

  • Chicago

    Kaur, Kalwinder. "Visual Object Localisation in Humanoid Robots". AZoRobotics. (accessed June 17, 2024).

  • Harvard

    Kaur, Kalwinder. 2019. Visual Object Localisation in Humanoid Robots. AZoRobotics, viewed 17 June 2024,

Tell Us What You Think

Do you have a review, update or anything you would like to add to this article?

Leave your feedback
Your comment type

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.