Periodic Reporting for period 1 - FORHUE (Forecasting and Preventing Human Errors)
Okres sprawozdawczy: 2022-10-01 do 2025-03-31
In order to improve the efficiency of forecasting approaches, we developed an approach, called TaylorSwiftNet, that forecasts continuously frames (Saber er al., BMVC 2022). TaylorSwiftNet is a novel approach that takes full advantage of a continuous representation of motion. In contrast to RNNs that forecast the future frame-by-frame or PDE-based approaches that discretize PDEs to solve them numerically, we infer a continuous function over time from the observations. This avoids discretization artifacts and provides an analytical function that can be swiftly evaluated for any future continuous point and allows to forecast future frames at a higher sampling rate than the observed frames, which are very useful properties for practical applications.
In order to detect human errors in the context of human-robot collaborations, we introduced a dataset and approach to recognize failures or errors that are caused by humans in human-robot collaborations (Thoduka et al., ICRA 2024). The proposed “Handover Failure Detection” dataset contains failures that are caused by the human participant in both robot-to-human and human-to-robot handovers. The dataset includes multimodal data such as video, robot joint state, and readings from a force-torque sensor. We also present a temporal action segmentation approach for jointly classifying the actions of the human participant and the robot, as well as recognizing failures. Since some human errors depend on an incorrect number of executions of certain actions, i.e. some actions are repeated too often and some not often enough, we also developed an approach for counting repetitive actions in videos (Luo et al., ICIP 2024). The approach is action-agnostic and can be used to detect human errors due to a wrong number of repetitions.
Since the latency of forecast models is an important aspect, we were able to reduce the efficiency of transformer architectures, which convert a video into a set of tokens and process them at various stages. The approach is motivated by the observation that the amount of relevant information varies depending on the content of a video. While some videos are easy to understand, other videos are more complex and contain many important details. Instead of using the same amount of computation for each video as previous approaches, we developed an innovative approach that automatically selects an adequate number of tokens at each stage based on the video content, i.e. the number of the selected tokens at all stages of the transformer architecture varies for different videos. The proposed approach achieves a substantial reduction of up to 50% of the computational cost for various transformer architectures for image classification or action recognition. Since the approach can be directly applied to pre-trained transformers, it is a versatile tool that is not limited to video data, but it can improve the efficiency of transformer architectures for many practical applications.