Can video models capture subtle action differences across videos? We introduce TF-CoVR, a large-scale benchmark with 180K triplets focused on fine-grained action changes in gymnastics and diving. Unlike earlier datasets, it includes multiple valid targets per query and emphasizes temporal changes like twist counts or apparatus switches. We also propose TF-CoVR-Base, a two-stage model that learns video embeddings via action classification, then aligns them with text using contrastive learning. It outperforms prior methods by a large margin, improving mAP@50 from 19.83 to 25.82 in fine-tuned settings.
Composed Video Retrieval (CoVR) retrieves a target video given a query video and a modification text describing the intended change. Existing CoVR benchmarks emphasize appearance shifts or coarse event changes and therefore do not test the ability to capture subtle, fast-paced temporal differences. We introduce TF-CoVR, the first large-scale benchmark dedicated to temporally fine-grained CoVR. TF-CoVR focuses on gymnastics and diving and provides 180K triplets drawn from FineGym and FineDiving. Previous CoVR benchmarks focusing on temporal aspect, link each query to a single target segment taken from the same video, limiting practical usefulness. In TF-CoVR, we instead construct each
Each video has its action description, and a modification in action for retrieval.
Hover to play, click to explore retrieval results.
Figure: Overview of our automatic triplet generation pipeline for TF-CoVR. We start with temporally labeled clips from FineGym and FineDiving datasets. Using CLIP-based text embeddings, we compute similarity between temporal labels and form pairs with high semantic similarity. These label pairs are passed to GPT-4o along with in-context examples to generate natural language modifications describing the temporal differences between them. Each generated triplet consists of a query video, a target video, and a modification text capturing fine-grained temporal action changes.
Figure: Overview of TF-CoVR-Base framework. Stage 1 learns temporal video representations via supervised classification using the AIM encoder. In Stage 2, the pretrained AIM and BLIP encoders are frozen, and a projection layer and MLP are trained to align the query-modification pair with the target video using contrastive loss. During inference, the model retrieves relevant videos from TF-CoVR based on a user-provided query and textual modification.
Table: Evaluation of models fine-tuned on TF-CoVR using mAP@K for K in {5, 10, 25, 50}. We report the performance of various fusion strategies and model architectures trained on TF-CoVR. Fusion methods include MLP and cross-attention (CA). Each model is evaluated using a fixed number of sampled frames from both query and target videos. Fine-tuning on TF-CoVR leads to significant improvements across all models.
Figure: Qualitative results for the composed video retrieval task using our two-stage approach. Each column presents a query video (top), a corresponding modification instruction (middle), and the top-3 retrieved target videos (ranks 1–3) based on the model's predictions. The modification instructions capture fine-grained action or event-level changes. This visualization demonstrates the effectiveness of the retrieval model in identifying subtle temporal variations, highlighting the practical utility of TF-CoVR for fine-grained sports understanding and highlight generation.
@misc{gupta2025playreplaycomposedvideo,
title={From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos},
author={Animesh Gupta and Jay Parmar and Ishan Rajendrakumar Dave and Mubarak Shah},
year={2025},
eprint={2506.05274},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.05274},
}