Emotion recognition in conversations (ERC) has garnered significant attention for its critical role in human-computer interaction systems.
Video moment retrieval (VMR) aims to localize a video segment in an untrimmed video that is semantically relevant to a language query.
patio-temporal video grounding (STVG) aims to localize the spatiotemporal object tube in a video according to a given text query.....