Skip to main content

Real-time understanding of dexterous deformable object manipulation with bio-inspired hybrid hardware architectures

Final Report Summary - REAL-TIME ASOC (Real-time understanding of dexterous deformable object manipulation with bio-inspired hybrid hardware architectures)

Closing the loop of action-perception is one of the main challenges of the next generation of robotics. In order to close these loops, active perception is crucial and real-time performance is needed to do so. Although this statement is applicable to the other sensory information, this project is focuses on vision. Mimicking biological systems, robotics systems require a way of adaptively selecting the most relevant information in the scene. In the case of vision, saliency has been extensively studied and determines the areas of the visual input array and the features that deserve more attention and then, that should be selected and processed first in such a robotic system.
Cognition in robotics has a very relevant role and specifically, to determine the parts of the input sensory array in the same way biological systems do it. A simple example can illustrate easily our point of view: a robot is asked to “go and bring the shoes”. To begin with, the robot should have a prior common-sense knowledge about where to find the target, in this case, “the shoes” shouldn't be in the shelves, or the wall, or on top of the table, but on the floor. Now, the robot should also have an idea of how “the shoes” look like: size, shape, color, or texture. In other words, the robot should also have a model of the current target under consideration. This model should let the robot find a candidate quickly and efficiently enough, following our example, a few possible locations for “the shoes”.
Attention has been usually explained in the literature as the integration of two information streams. There is one stream that happens in a bottom-up way and depends on the features of the scene. The second stream is top-down and voluntary and is determined by the action that is going to be performed. Going back to the example, once the system determines the action that is going to be performed e.g. inspect a specific area, a model of the target of the action is required: size, color, texture ... The novelty of this project is indeed determined by the use of these kind of approaches that use both streams of information.
The project has focused on the development of new mechanisms for visual attention and the integration of novel cues, such as the image motion or the 3D pose and self-motion. The current techniques that compute the flow and the egomotion are computationally very expensive and then, cannot be performed in real-time or limit the accuracy of the estimation. In this projec, we decided to use a novel neuromorphic device: the DVS o Dynamic Visual Sensor. It is an asynchronous event-driven sensor that captures with a really high temporal resolution (in the order of microseconds) the positions whose luminance changes. The high temporal resolution (equivalent to recording about 600 thousand frames per second) and the reduction of the amount of information (removing the static areas of the scene), make this device very suitable for robotic applications that demand short latencies to operate in real-time. Moreover, the device also uses a log input which means that its dynamic range is high. This means that each pixel of the matrix sensor independently processes the input luminance, and that it is not affected by the luminance collected by its neighbors. Putting it simple, if the sensor is pointing to a window in an indoor environment, there is no artifacts as the one found in standard cameras. The changes in the indoor environment and outdoors are both still perceived.
Creating the model of the target of the action is one of the first objectives of this project. The model is the first step we require for a proto-segmentation of the scene (or of the objects that attracts the visual attention instead of the whole scene). Detecting contours of the objects and the border ownership is the starting point. Boundaries of the objects are defined by depth discontinuities and border ownership determines what part of the boundary belongs to the object (foreground) and which one belongs to the background. The project has studied two mechanisms for proto-segmentation using DVS input. The first one is covered in our proposal and uses the visual torque operator developed by UMD. It is an operator that captures the concept of closed contours applied object edges (that includes contours and texture edges). This operator is adapted to different spatial resolution and sizes in order to obtain different pixel-wise closureness probabilities that are summed up to obtain the final proto-segmentation. The second mechanism involves the use of the motion cues: the integration of the motion pathway allows the robotic platform to estimate the motion in the scene as well as its self-pose and self-motion. Moreover, also the structure of the scene can be estimated (up to a scale factor). And then, by the joint estimation of self-motion and structure of the scene from image motion cues, contours of the objects can be extracted by capturing the 3D of the scene without explicitly computing the depth. And then, from the contours, the border ownership can also been extracted, allowing for the proto-segmentation of the objects in the scene.
Although, the first method has also been studied, most of the effort has been dedicated to the second mechanism, since it is the most novel one. The technique utilizes a classifier trained with diverse event-based features that characterize spatio-temporal surfaces, such as local patches of temporal information, spatio-temporal texture, spatio-temporal orientation of the contours and event-based image motion. The classifier predicts the locations of these object boundaries and their border ownership assignments, which helps deciding which part of the boundary is part of the object and which one is part of the background. The inference takes approximately 0.05 s. This information lets us perform the mentioned proto-segmentation that will lead us to a full segmentation after refinement with additional cues. The results of this work where published in the International Conference of Computer Vision in December, 2015 (Santiago de Chile, Chile). ICCV is considered within the top 3 most influential publications in Computer Vision. The code, datasets, and the publication have been released and are published in the project website.
Moreover, a work on robotics, action manipulation and active perception was submitted at the beginning of 2016 to the International Journal on Computer Vision and it is under review.
As mentioned before, the segmentation of the objects in the scene requires of the integration of the motion cues (among others). The event-based image motion estimation was the first focus of this project and the very first results of the project corresponds to this topic. Two different methods were presented: one uses the concept of motion parallax that uses the width of the bends that events collected after small amounts of time draw, was presented first and published in the special issue dedicated to new methods for neuromorphic devices of the Proceedings of the IEEE; the second method is specialized for the detection of high-frequency spatio-temporal textures and was also published at the International Work-Conference of Artificial Neural Networks. Moreover, a real-time demonstration of the sensor performing this motion estimation was also presented in the same conference.
With respect to the attention and the bottom-up attention mechanisms and their implementation, a work was published in the IEEE Industrial Informatics Journal.
With respect to implementation of the motion pathway, several self-motion estimation methods are already implemented. In this case, these methods are not only suitable for asynchronous event-based sensors, but also for conventional sensors, solving a problem that is still a challenge if required for real-time robotic applications. A work in this matter was published in 2016 in the Frontiers in Neuroscience journal. The first dataset that considers not only image motion, but also self-motion and 3D motion of the sensor using asynchronous sensor has been released and it is accessible through the project website. A paper is under preparation, that presents different methods for self-motion estimation for event-based and frame-based sensors.
The project has led to get also funding from the regional BioTic Genil Program (Projects for Young Researchers) for a project working on motion estimation for autonomous flying vehicles. And since 2014, to two projects on drone navigation funded by the MADOC (Spanish Ministry of Defense). The fellow is now the IP in two EU H2020 projects proposals: ECSEL PROMIP (already in the FPP phase) and the RIA ULTRALEV.
The fellow has also actively collaborated with Dr. Guangling Sun from Shanghai University and several master and PhD students at UMD using vision for autonomous flying vehicles. The fellow is also the supervisor of a student at UGR.
A demo session was performed at the DARPA day in May 2015 at UMD, to show algorithms for motion estimation with DVS.
Moreover, the fellow is now one of the organizers of the Telluride Workshop on Cognitive Neuroscience that is held annually in Colorado, USA. This workshop represents a great opportunity to strengthen the ties between UGR and UMD, and other institutions all over the world. The fellow is also a post-doc affiliate of NACS (Neuroscience and Cognitive Science at UMD).
The details of the project and its results will be released on its website: