S-GEAR: Semantically Guided Representation Learning for Action Anticipation (ECCV2024)

Anxhelo Diko, Danilo Avola, Bardh Prenkaj, Federico Fontana, Luigi Cinque

July, 2024

Abstract

Action anticipation is the task of forecasting future activity from a partially observed sequence of events. However, this task is exposed to intrinsic future uncertainty and the difficulty of reasoning upon interconnected actions. Unlike previous works that focus on extrapolating better visual and temporal information, we concentrate on learning action representations that are aware of their semantic interconnectivity based on prototypical action patterns and contextual co-occurrences. To this end, we propose the novel Semantically Guided Representation Learning (S-GEAR) framework. S-GEAR learns visual action prototypes and leverages language models to structure their relationship, inducing semanticity. To gather insights on S-GEAR’s effectiveness, we test it on four action anticipation benchmarks, obtaining improved results compared to previous works – +3.5, +2.7, and +3.5 absolute points on Top-1 Accuracy on Epic-Kitchen 55, EGTEA Gaze+ and 50 Salads, respectively, and +0.8 on Top-5 Recall on Epic-Kitchens 100. We further observe that S-GEAR effectively transfers the geometric associations between actions from language to visual prototypes. Finally, S-GEAR opens new research frontiers in anticipation tasks by demonstrating the intricate impact of action semantic interconnectivity. We will release our code online upon acceptance.

Type

Journal article

Source Themes

Anxhelo Diko

PhD Student In Computer Science

A highly motivated and results-oriented Computer Vision Ph.D. student with a deep passion for advancing the field of artificial intelligence. My research focuses on building multimodal representations and understanding human activities, addressing key challenges for autonomous agents and AI in general. I have extensive experience with multimodal large language models for video captioning and question answering and a keen interest in view-invariant video representation learning. I am particularly committed to exploring how to effectively bridge the gap between representations of different modalities while preserving their unique characteristics. In addition to my research expertise, I have a strong engineering foundation honed through academic and industry experiences. Proficient in Python, C++, and CUDA, I excel at rapidly prototyping and implementing innovative ideas. I am eager to leverage my skills and knowledge to contribute to cutting-edge research and development in this dynamic field.