From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos

1University of Central Florida 2Adobe

TF-CoVR: Temporally Fine-grained Composed Video Retrieval

Can video models capture subtle action differences across videos? We introduce TF-CoVR, a large-scale benchmark with 180K triplets focused on fine-grained action changes in gymnastics and diving. Unlike earlier datasets, it includes multiple valid targets per query and emphasizes temporal changes like twist counts or apparatus switches. We also propose TF-CoVR-Base, a two-stage model that learns video embeddings via action classification, then aligns them with text using contrastive learning. It outperforms prior methods by a large margin, improving mAP@50 from 19.83 to 25.82 in fine-tuned settings.

Figure: TF-CoVR visualization showing a query video and modification instruction ("show with 2.5 turn"), along with the top 5 retrieved videos. The query action involves a vault routine with a 1.5 turn off, and the retrieval system successfully surfaces multiple visually similar routines exhibiting increased turn count.



Abstract

Composed Video Retrieval (CoVR) retrieves a target video given a query video and a modification text describing the intended change. Existing CoVR benchmarks emphasize appearance shifts or coarse event changes and therefore do not test the ability to capture subtle, fast-paced temporal differences. We introduce TF-CoVR, the first large-scale benchmark dedicated to temporally fine-grained CoVR. TF-CoVR focuses on gymnastics and diving and provides 180K triplets drawn from FineGym and FineDiving. Previous CoVR benchmarks focusing on temporal aspect, link each query to a single target segment taken from the same video, limiting practical usefulness. In TF-CoVR, we instead construct each pair by prompting an LLM with the label differences between clips drawn from different videos; every pair is thus associated with multiple valid target videos (3.9 on average), reflecting real-world tasks such as sports-highlight generation. To model these temporal dynamics we propose TF-CoVR-Base, a concise two-stage training framework: (i) pre-train a video encoder on fine-grained action classification to obtain temporally discriminative embeddings; (ii) align the composed query with candidate videos using contrastive learning. We conduct the first comprehensive study of image, video, and general multimodal embedding (GME) models on temporally fine-grained composed retrieval in both zero-shot and fine-tuning regimes. On TF-CoVR, TF-CoVR-Base improves zero-shot mAP@50 from 5.92 (LanguageBind) to 7.51, and after fine-tuning raises the state-of-the-art from 19.83 to 25.82.

Try TF-CoVR Demo

Each video has its action description, and a modification in action for retrieval.

Hover to play, click to explore retrieval results.

TF-CoVR Dataset Generation Pipeline

Figure: Overview of our automatic triplet generation pipeline for TF-CoVR. We start with temporally labeled clips from FineGym and FineDiving datasets. Using CLIP-based text embeddings, we compute similarity between temporal labels and form pairs with high semantic similarity. These label pairs are passed to GPT-4o along with in-context examples to generate natural language modifications describing the temporal differences between them. Each generated triplet consists of a query video, a target video, and a modification text capturing fine-grained temporal action changes.

TF-CoVR-Base Architecture

Figure: Overview of TF-CoVR-Base framework. Stage 1 learns temporal video representations via supervised classification using the AIM encoder. In Stage 2, the pretrained AIM and BLIP encoders are frozen, and a projection layer and MLP are trained to align the query-modification pair with the target video using contrastive loss. During inference, the model retrieves relevant videos from TF-CoVR based on a user-provided query and textual modification.

TF-CoVR-Base Results

Table: Evaluation of models fine-tuned on TF-CoVR using mAP@K for K in {5, 10, 25, 50}. We report the performance of various fusion strategies and model architectures trained on TF-CoVR. Fusion methods include MLP and cross-attention (CA). Each model is evaluated using a fixed number of sampled frames from both query and target videos. Fine-tuning on TF-CoVR leads to significant improvements across all models.

Qualitative Comparison of TF-CoVR-Base

Figure: Qualitative results for the composed video retrieval task using our two-stage approach. Each column presents a query video (top), a corresponding modification instruction (middle), and the top-3 retrieved target videos (ranks 1–3) based on the model's predictions. The modification instructions capture fine-grained action or event-level changes. This visualization demonstrates the effectiveness of the retrieval model in identifying subtle temporal variations, highlighting the practical utility of TF-CoVR for fine-grained sports understanding and highlight generation.

BibTeX

@misc{gupta2025playreplaycomposedvideo,
      title={From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos}, 
      author={Animesh Gupta and Jay Parmar and Ishan Rajendrakumar Dave and Mubarak Shah},
      year={2025},
      eprint={2506.05274},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.05274}, 
}