Skip to main content

Video Understanding for Autonomous Driving

Periodic Reporting for period 1 - VUAD (Video Understanding for Autonomous Driving)

Reporting period: 2020-04-01 to 2022-03-31

In this project, we address scene understanding from video sequences for autonomous vehicles. With the development of robust techniques in deep learning, the dream of autonomous vehicles has become a vision. However, the perception module for autonomous vehicles needs to be perfected to a level comparable to humans before these machines can be safely utilized on the roads. The goal of this project is to increase the robustness and reliability of perception algorithms in autonomous vehicles by modeling temporal cues in a video such as continuity as opposed to most state-of-the-art methods that use only a single image.

Self-driving cars are poised to become a trillion-pound market in the next few decades, based on the needs of commuters and logistics chains worldwide. More importantly, they would solve two pressing problems of our society. 1.25 million people die in car accidents each year due to human error, and about 35 million are severely injured, rivaling the worst diseases. Another often-ignored fact is that the average car commuter spends 52 minutes per day driving to or from work, amounting to 5.4% of their waking time lost to a menial task. Enabling an Artificial Intelligence (AI) system to understand and drive in complex urban environments now seems largely solved for most common scenarios. Companies such as Waymo (US), Tesla (US), and Wayve (UK) routinely test on public roads. This achievement was made possible, largely, by advances in deep neural networks.

Computer Vision methods achieve impressive results on a single image for various tasks such as object detection. For instance, pedestrian detectors now boast over 98% accuracy according to the widely-acknowledged KITTI benchmark. However, this success has not been fully extended to sequences yet. It is commonly acknowledged that video understanding falls years behind a single image. This is mainly due to two reasons: the processing power required for reasoning across multiple frames and the difficulty of obtaining ground truth for every frame in a sequence, especially for pixel-level tasks. Based on these observations, there are two likely directions to boost the performance of tasks related to video understanding: unsupervised learning and object-level reasoning. We work on both perspectives in this project. We present deep learning solutions for dynamic scene understanding by detecting and tracking multiple people in street scenes, i.e. multi-object tracking (MOT) as well as by modeling the movement of the static parts of the scene which arise from camera motion.
Our first two objectives in the project learn a graph representation for first tracking multiple objects, then both locating and tracking the objects in video sequences. When there are multiple objects around a self-driving vehicle, planning requires a history of agents to be able to predict the future locations of objects and plan accordingly. Most current prediction and planning systems assume ground-truth trajectories for agents, however, at test time, they need to be estimated as well. Our proposed algorithms enable better segmentation of moving objects together with tracking. When there are multiple objects around the vehicle, an important aspect is to model the interactions between them because they affect each other. For example, people walk together and take precautions not to hit each other. This kind of reasoning requires learning relations between objects which are represented as nodes on a graph. Graph-structured data enables passing messages between different objects so that reasoning for one object can be informed by the location and motion of the other objects. As we show in the second objective where we perform joint segmentation and tracking, this also helps segmentation because pixels that belong to an object tend to move together. Our third objective jointly segments independently moving objects and estimates their motion separately in addition to the motion of the autonomous vehicle. This enables a better and more reliable perception system in dynamic environments such as cities with frequently crowded areas.

For dissemination, every year we attended at least two conferences in the field. During these conferences, we also participated in workshops and tutorials related to the project. In a graph learning workshop, we presented our work in the first objective. We presented the results of the third objective at a conference as well as another workshop. We also presented our work on the project in several invited talks both internationally and nationally addressing the scientific community as well as the university and the high school students. We were also an active part of the AI community at the host university by organizing weekly AI talks and teaching a graduate-level course on self-driving vehicles covering topics related to the project. We additionally performed high school visits, a training program that is open to university students from all over the country with the help of a non-profit organization. We believe that these activities helped foster the research and teaching environment at the host university, even at a national level to some extent.
Our algorithms for tracking, segmentation, and motion estimation enhance the capability of agents in dynamic environments. We address the problem of understanding dynamic scenes from videos separately for independently moving objects and the motion of the self-driving vehicle. This helps enable safe navigation in city centers especially. While self-driving in highway driving is considered mostly solved, interesting cases happen mostly in crowded scenes in city centers. The first step of reasoning in crowded scenes is understanding the motion in the scene and our project directly addresses this problem. Our algorithm in the first objective can generate trajectories for agents which can be used for planning and prediction purposes of the self-driving vehicle. Anticipating the motion of dynamic agents is crucial for self-navigation.

In the big picture, self-driving will reduce the number of deaths due to traffic accidents. Most of the reasons behind the traffic accidents are related to the driver such as fatigue, substance usage, or medical conditions. With self-driving, we can create a safer traffic environment and reduce the number of deaths due to traffic accidents. Self-driving is also expected to increase shared vehicle systems, and reduce the negative impact of climate disaster by creating new ways of transportation. These changes will save time for everyone but will also provide mobility for the disabled and elderly. In terms of wider societal implications, it is expected to introduce economic gains and create new jobs.