Skip to main content
Ir a la página de inicio de la Comisión Europea (se abrirá en una nueva ventana)
español español
CORDIS - Resultados de investigaciones de la UE
CORDIS

Forecasting and Preventing Human Errors

Periodic Reporting for period 1 - FORHUE (Forecasting and Preventing Human Errors)

Período documentado: 2022-10-01 hasta 2025-03-31

Human errors remain the main source of incidents. They can lead to fatalities, traffic accidents, or product defects and cause high economic and social cost. While some errors can still be corrected if they are detected in time, many human errors cause high costs as soon as they occur or are even irreversible. In these cases, it is very important to recognize human errors before they occur. The goal of this project is therefore to develop methods based on artificial intelligence that forecast human actions, human motion, and potential human errors from video data.
In this project, we have developed several approaches that forecast human motion, human actions, and recognize errors induced by humans. An example application is a driving assistant system that forecasts the behaviour of other traffic participants to warn the driver before an accident happens. We thus developed an approach that takes videos from multiple cameras, which are mounted at the vehicle and monitor the surroundings, as input. It then detects other traffic participants in the videos and forecasts their motions for the next two to eight seconds simultaneously (Li et al., IJCAI 2023). Another work focuses on forecasting the motion of multiple socially interacting persons. To this end, we developed the first approach that was able to forecast the motion of multiple persons even for long-time horizons of 40 seconds and that models the uncertainty of the forecast motion (Tanke et al., ICCV 2023). We furthermore developed a metric that measures whether the forecast motion is socially plausible. In order to advance the state of the art further, we introduced a benchmark for evaluating approaches that forecast the motion of multiple persons in a natural working environment. The so-called “Humans in Kitchens” dataset (Tanke et al., NeurIPS 2023) is a large-scale multi-person 3D human motion dataset with annotated 3D human poses, scene geometry, and per-person activities. Overall, it consists of more than 4M annotated human poses of 90 individuals.

In order to improve the efficiency of forecasting approaches, we developed an approach, called TaylorSwiftNet, that forecasts continuously frames (Saber er al., BMVC 2022). TaylorSwiftNet is a novel approach that takes full advantage of a continuous representation of motion. In contrast to RNNs that forecast the future frame-by-frame or PDE-based approaches that discretize PDEs to solve them numerically, we infer a continuous function over time from the observations. This avoids discretization artifacts and provides an analytical function that can be swiftly evaluated for any future continuous point and allows to forecast future frames at a higher sampling rate than the observed frames, which are very useful properties for practical applications.

In order to detect human errors in the context of human-robot collaborations, we introduced a dataset and approach to recognize failures or errors that are caused by humans in human-robot collaborations (Thoduka et al., ICRA 2024). The proposed “Handover Failure Detection” dataset contains failures that are caused by the human participant in both robot-to-human and human-to-robot handovers. The dataset includes multimodal data such as video, robot joint state, and readings from a force-torque sensor. We also present a temporal action segmentation approach for jointly classifying the actions of the human participant and the robot, as well as recognizing failures. Since some human errors depend on an incorrect number of executions of certain actions, i.e. some actions are repeated too often and some not often enough, we also developed an approach for counting repetitive actions in videos (Luo et al., ICIP 2024). The approach is action-agnostic and can be used to detect human errors due to a wrong number of repetitions.
Social Diffusion (Tanke et al., ICCV 2023) has been the first approach that was able to forecast the motion of multiple interacting persons for a longer time period. In contrast to previous multi-person forecasting approaches, the approach does not suffer from freezing motion after a short period. It has also been the first multi-person motion forecasting model that models forecast uncertainty.

Since the latency of forecast models is an important aspect, we were able to reduce the efficiency of transformer architectures, which convert a video into a set of tokens and process them at various stages. The approach is motivated by the observation that the amount of relevant information varies depending on the content of a video. While some videos are easy to understand, other videos are more complex and contain many important details. Instead of using the same amount of computation for each video as previous approaches, we developed an innovative approach that automatically selects an adequate number of tokens at each stage based on the video content, i.e. the number of the selected tokens at all stages of the transformer architecture varies for different videos. The proposed approach achieves a substantial reduction of up to 50% of the computational cost for various transformer architectures for image classification or action recognition. Since the approach can be directly applied to pre-trained transformers, it is a versatile tool that is not limited to video data, but it can improve the efficiency of transformer architectures for many practical applications.
Social Diffusion forecasts the motion of multiple interacting persons.
TaylorSwiftNet forecasts the future for any continuous point in time.
Samples from the Humans in Kitchens dataset.
Mi folleto 0 0