From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos

NeurIPS 2025

Towards Automatic AI Highlights
1University of Central Florida 2Adobe

TF-CoVR: Temporally Fine-grained Composed Video Retrieval

Can video models capture subtle action differences across videos? We introduce TF-CoVR, a large-scale benchmark with 180K triplets focused on fine-grained action changes in gymnastics and diving. Unlike earlier datasets, it includes multiple valid targets per query and emphasizes temporal changes like twist counts or apparatus switches. We also propose TF-CoVR-Base, a two-stage model that learns video embeddings via action classification, then aligns them with text using contrastive learning. It outperforms prior methods by a large margin, improving mAP@50 from 19.83 to 27.22 in fine-tuned settings.

TF-CoVR teaser

Figure: TF-CoVR visualization showing a query video and modification instruction ("show with 2.5 turn"), along with the top 5 retrieved videos. The query action involves a vault routine with a 1.5 turn off, and the retrieval system successfully surfaces multiple visually similar routines exhibiting increased turn count.



Abstract

Composed Video Retrieval (CoVR) retrieves a target video given a query video and a modification text describing the intended change. Existing CoVR benchmarks emphasize appearance shifts or coarse event changes and therefore do not test the ability to capture subtle, fast-paced temporal differences. We introduce TF-CoVR, the first large-scale benchmark dedicated to temporally fine-grained CoVR. TF-CoVR focuses on gymnastics and diving and provides 180K triplets drawn from FineGym and FineDiving datasets. Previous CoVR benchmarks focusing on temporal aspect, link each query to a single target segment taken from the same video, limiting practical usefulness. In TF-CoVR, we instead construct each <query, modification> pair by prompting an LLM with the label differences between clips drawn from different videos; every pair is thus associated with multiple valid target videos (3.9 on average), reflecting real-world tasks such as sports-highlight generation. To model these temporal dynamics we propose TF-CoVR-Base, a concise two-stage training framework: (i) pre-train a video encoder on fine-grained action classification to obtain temporally discriminative embeddings; (ii) align the composed query with candidate videos using contrastive learning. We conduct the first comprehensive study of image, video, and general multimodal embedding (GME) models on temporally fine-grained composed retrieval in both zero-shot and fine-tuning regimes. On TF-CoVR, TF-CoVR-Base improves zero-shot mAP@50 from 5.92 (LanguageBind) to 7.51, and after fine-tuning raises the state-of-the-art from 19.83 to 27.22

TF-CoVR Interactive Demo

Explore temporally fine-grained composed video retrieval. Swipe through queries, inspect modification instructions, and compare retrieved videos side-by-side with the query.

💡 Tip: swipe horizontally on cards, tap the pill to expand, and tap a retrieved video for picture-in-picture comparison.

Query vs Retrieved Comparison
Query
Retrieved

Future Direction: AI Highlights

TF-CoVR is designed for temporally fine-grained composed video retrieval, but the same ability to find subtle skills across large collections makes it a natural building block for AI-generated highlight reels.

TF-CoVR Dataset Generation Pipeline

Dataset generation

Figure: Overview of our automatic triplet generation pipeline for TF-CoVR. We start with temporally labeled clips from FineGym and FineDiving datasets. Using CLIP-based text embeddings, we compute similarity between temporal labels and form pairs with high semantic similarity. These label pairs are passed to GPT-4o along with in-context examples to generate natural language modifications describing the temporal differences between them. Each generated triplet consists of a query video, a target video, and a modification text capturing fine-grained temporal action changes.

TF-CoVR-Base Architecture

TF-CoVR-Base

Figure: Overview of TF-CoVR-Base framework. Stage 1 learns temporal video representations via supervised classification using the AIM encoder. In Stage 2, the pretrained AIM and BLIP encoders are frozen, and a projection layer and MLP are trained to align the query-modification pair with the target video using contrastive loss. During inference, the model retrieves relevant videos from TF-CoVR based on a user-provided query and textual modification.

TF-CoVR-Base Results

Results

Table: Evaluation of models fine-tuned on TF-CoVR using mAP@K for K in {5, 10, 25, 50}. We report the performance of various fusion strategies and model architectures trained on TF-CoVR. Fusion methods include MLP and cross-attention (CA). Each model is evaluated using a fixed number of sampled frames from both query and target videos. Fine-tuning on TF-CoVR leads to significant improvements across all models.

Qualitative Comparison of TF-CoVR-Base

Qualitative results

Figure: Qualitative results for the composed video retrieval task using our two-stage approach. Each column presents a query video (top), a corresponding modification instruction (middle), and the top-3 retrieved target videos (ranks 1–3) based on the model's predictions. The modification instructions capture fine-grained action or event-level changes. This visualization demonstrates the effectiveness of the retrieval model in identifying subtle temporal variations, highlighting the practical utility of TF-CoVR for fine-grained sports understanding and highlight generation.

BibTeX

@misc{gupta2025playreplaycomposedvideo,
  title={From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos},
  author={Animesh Gupta and Jay Parmar and Ishan Rajendrakumar Dave and Mubarak Shah},
  year={2025},
  eprint={2506.05274},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2506.05274},
}