Continual Text-to-Video Retrieval with Frame Fusion and Task-Aware Routing
Paper
Paper/Presentation Title | Continual Text-to-Video Retrieval with Frame Fusion and Task-Aware Routing |
---|---|
Presentation Type | Paper |
Authors | Zhao, Zecheng, Chen, Zhi, Huang, Zi, Sadiq, Shazia and Chen, Tong |
Journal or Proceedings Title | Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '25) |
Journal Citation | pp. 1011-1021 |
Number of Pages | 11 |
Year | 2025 |
Publisher | Association for Computing Machinery (ACM) |
Place of Publication | United States |
ISBN | 9798400715921 |
Digital Object Identifier (DOI) | https://doi.org/10.1145/3726302.3729936 |
Web Address (URL) of Paper | https://dl.acm.org/doi/10.1145/3726302.3729936 |
Web Address (URL) of Conference Proceedings | https://dl.acm.org/doi/proceedings/10.1145/3726302 |
Conference/Event | 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '25) |
Event Details | 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '25) Parent ACM International Conference on Research and Development in Information Retrieval Delivery In person Event Date 13 to end of 18 Jul 2025 Event Location Padua, Italy |
Abstract | Text-to-Video Retrieval (TVR) aims to retrieve relevant videos based on textual queries. However, as video content evolves continuously, adapting TVR systems to new data remains a critical yet under-explored challenge. In this paper, we introduce the first benchmark for Continual Text-to-Video Retrieval (CTVR) to address the limitations of existing approaches. Current Pre-Trained Model (PTM)-based TVR methods struggle with maintaining model plasticity when adapting to new tasks, while existing Continual Learning (CL) methods suffer from catastrophic forgetting, leading to semantic misalignment between historical queries and stored video features. To address these two challenges, we propose FrameFusionMoE, a novel CTVR framework that comprises two key components: (1) the Frame Fusion Adapter (FFA), which captures temporal video dynamics while preserving model plasticity, and (2) the Task-Aware Mixture-of-Experts (TAME), which ensures consistent semantic alignment between queries across tasks and the stored video features. Thus, FrameFusionMoE enables effective adaptation to new video content while preserving historical text-video relevance to mitigate catastrophic forgetting. We comprehensively evaluate FrameFusionMoE on two benchmark datasets under various task settings. Results demonstrate that FrameFusionMoE outperforms existing CL and TVR methods, achieving superior retrieval performance with minimal degradation on earlier tasks when handling continuous video streams. Our code is available at: https://github.com/JasonCodeMaker/CTVR |
Keywords | Continual Text-to-Video Retrieval; Continual Learning; Video Representation Learning |
Contains Sensitive Content | Does not contain sensitive content |
ANZSRC Field of Research 2020 | 4602. Artificial intelligence |
Byline Affiliations | University of Queensland |
https://research.usq.edu.au/item/zyx4z/continual-text-to-video-retrieval-with-frame-fusion-and-task-aware-routing
Download files
60
total views6
total downloads53
views this month3
downloads this month