Omni-Supervised Learning for Dynamic Scene Understanding

Project Information

DynAI

Grant agreement ID: 101043189

DOI

10.3030/101043189

EC signature date 28 November 2022

Start date 1 January 2023

End date 31 December 2027

Funded under

European Research Council (ERC)

Total cost

€ 1 500 000,00

EU contribution

€ 1 500 000,00

1 500 000,00

Coordinated by

NVIDIA ITALY S.R.L.
Italy

Periodic Reporting for period 1 - DynAI (Omni-Supervised Learning for Dynamic Scene Understanding)

Reporting period: 2023-01-01 to 2025-06-30

To navigate the world, autonomous vehicles need to be aware of the surrounding
static geometry as well as the dynamic objects in the scene. Understanding moving objects is especially
challenging, as robots need to parse appearance and motion at once in order to detect an object, track
it over time, and potentially predict its future trajectory. In other words, computer vision algorithms
need to perform dynamic scene understanding (DSU).
Computer vision methods have long relied on the same winning strategy to produce state-of-the-art results for
several vision tasks: (i) using a convolutional neural network (CNN) based model to process images,
and (ii) training the model for a specific task in a supervised fashion on very large-scale datasets.
The dependency on large-scale manually annotated
datasets is especially a problem, since we cannot expect to annotate all possible object classes in the
world, particularly those observed very rarely. Our efforts should be spent not only on improving how methods learn from annotated data, but also,
how this knowledge can be transferred to unlabeled data, so that we can safely take our autonomous
vehicles from the artificial closed-world of current benchmarks to the real open-world.
In DynAI we propose an approach based on
Omni-supervised learning for dynamic scene understanding in the open-world.

To make DynAI a reality, we focus on three fundamental pillars of research:
– Models. We will design a hierarchical (from pixels to objects) image-dependent representation that
will allow us to capture spatio-temporal dependencies at all levels of the hierarchy.
– Data. We will create a new large-scale DSU synthetic dataset, and propose
novel methods to mitigate the annotation costs for video data.
– Open-World. Our models will be able to detect, segment, retrieve, and track dynamic
objects coming from classes never previously observed during the training of our neural networks.

The 5 most significant achievements can be summarized in 5 publications:

[1] Aljosa Osep, Tim Meinhardt, Francesco Ferroni, Neehar Peri, Deva Ramanan, Laura Leal-Taixe. “Better Call SAL: Towards Learning to Segment Anything in Lidar”. European Conference on Computer Vision (ECCV)

We propose the SAL (Segment Anything in Lidar) method consisting of a text-promptable zero-shot model for segmenting and classifying any object in Lidar, and a pseudo-labeling engine that facilitates model training without manual supervision. While the established paradigm for Lidar Panoptic Segmentation (LPS) relies on manual supervision for a handful of object classes defined a priori, we utilize 2D vision foundation models to generate 3D supervision “for free”. Our pseudolabels consist of instance masks and corresponding CLIP tokens, which we lift to Lidar using calibrated multi-modal data. Even without manual labels, our model achieves 91% in terms of class-agnostic segmentation and 54% in terms of zero-shot LPS of the fully supervised state-of-the-art.

[2] Orcun Cetintas, Tim Meinhardt, Guillem Braso, Laura Leal-Taixe. “SPAMming Labels: Efficient Annotations for the Trackers of Tomorrow”. European Conference on Computer Vision (ECCV).

In this work, we introduce SPAM, a video label engine that provides high quality labels with minimal human intervention. SPAM is built around two key insights: i) most tracking scenarios can be easily resolved. To take advantage of this, we utilize a pre-trained model to generate high quality pseudo-labels, reserving human involvement for a smaller subset of more difficult instances; ii) handling the spatiotemporal dependencies of track annotations across time can be elegantly and efficiently formulated through graphs. Therefore, we use a unified graph formulation to address the annotation of both detections and identity association for tracks across time. Based on these insights, SPAM produces high quality annotations with a fraction of ground truth labeling cost.

[3] Jenny Seidenschwarz, Aljosa Osep, Francesco Ferroni, Simon Lucey, Laura Leal-Taixe. “SeMoLi: What Moves Together Belongs Together”. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

We tackle semi-supervised object detection based on motion cues. Recent results suggest that heuristic-based clustering methods in conjunction with object trackers can be used to pseudo-label instances of moving objects and use these as supervisory signals to train 3D object detectors in Lidar data without manual supervision. We re-think this approach and suggest that both object detection as well as motion-inspired pseudo-labeling can be tackled in a data-driven manner. We leverage recent advances in scene flow estimation to obtain point trajectories from which we extract long-term class-agnostic motion patterns. Revisiting correlation clustering in the context of message passing networks we learn to group those motion patterns to cluster points to object instances.

[4] Orcun Cetintas, Guillem Braso, Laura Leal-Taixe. “Unifying Short and Long-Term Tracking with Graph Hierarchies”. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

We propose a method that processes videos hierarchically: lower levels of our hierarchy focus on short-term association, and higher levels focus on increasingly long-term scenarios. The key differences to existing hybrid multilevel solutions is that we use the same learnable model for all time scales, i.e. hierarchy levels. Instead of handcrafting different models for different scales, we show that our model can learn to exploit the cues that are best suited for each time-scale in a data-driven manner.

[5] Jenny Seidenschwarz, Guillem Braso, Victor Castro Serrano, Ismail Elezi, Laura Leal-Taixe. “Simple Cues Lead to a Strong Multi-Object Tracker”. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

For a long time, the most common paradigm in MOT was tracking-by-detection (TbD), where objects are first detected and then associated over video frames. For association, most models resourced to motion and appearance cues, e.g. re-identification networks. In this paper, we ask ourselves whether simple good old TbD methods are also capable of achieving the performance of end-to-end models. To this end, we propose two key ingredients that allow a standard re-identification network to excel at appearance-based tracking. https://github.com/dvl-tum/GHOST.

Segment Anything in Lidar (SAL) [1] is a significant achievement because it showed for the first time that a 3D perception model could be trained entirely with pseudo-labels, bypassing the need to annotate 3D data, which is considerably more expensive than annotating images.
This opens the door for our WP3 to train a model entirely on pseudolabels, using the perception algorithms that we are developing in WP1 and WP2. This was somewhat unplanned, since the goal was to study how far we could get with pseudolabels but we did not expect to reach such a high segmentation accuracy.
The re-identification only tracking method [5] was also a significant breakthrough, since it showed the tracking community that many of the method that were being developed with modern networks such as Transformers were actually not better than classic tracking methods when those are well-understood and adapted to the new tracking data. This helped the community focus on the tracking issues that do need more complex methods to be solved.

Periodic Reporting for period 1 - DynAI (Omni-Supervised Learning for Dynamic Scene Understanding)

Share this page Share this page on social networks

Download Download the content of the page