What and Where: Semantic Grasping and Contextual Scanning for Moment Retrieval and Highlight Detection

刘婧, Zhuo He, Weizhi Nie, Zongbing Zhang and Yuting Su

February 2025

PDF CODE

Abstract

The current surge in video content highlights the tasks of moment retrieval (MR) and highlight detection (HD), which involve localizing video segments of events and predicting clip-wise saliency scores based on text queries. The recent methods, while effective, may overlook two aspects: 1) Multimodal features often show weak alignment from frozen encoders, hindering thorough semantic exploration of video clips through fine-grained cross-modal interaction. 2) Due to the absence of significant distinction between adjacent video clips, it is challenging for clip-level context modeling to accurately locate query-relevant content. To mitigate these gaps and inspired by the human routine in understanding visual events, we propose a progressive framework dubbed “what and where” to initially grasp the aligned semantics of each video clip, and then proceed to scan moment-level contextual features temporally to identify events matching the query. In the ‘what’ stage, to enable explicit alignment of modal features and achieve a thorough semantic understanding, we firstly devise the Initial Semantic Projection (ISP) loss to bring closer different modal features with similar semantics. Additionally, we develop a Clip Semantic Mining module to deeply mine the relevance of these identified semantics to the specific query (at both word- and sentence-level). In the ‘where’ stage, to enhance feature distinctiveness, we design a Multi-Context Perception module that models moment-level context. It includes an Event Context (EC) branch and a Chronological Context (CC) branch, focusing on possible query-relevant event moments and temporal moments of various lengths. Finally, extensive experiments validate the state-of-the-art performance of our W2W model on three benchmark datasets without additional pre-training. Codes are available at https://github.com/TJUMMG/W2W.

Type

Journal article

Publication

IEEE Trans. Circuits Syst. Video Technol.

What and Where: Semantic Grasping and Contextual Scanning for Moment Retrieval and Highlight Detection

Abstract

刘婧

副教授，博导