“What I cannot create, I do not understand.”
— Richard P. Feynman, on his blackboard at the time of his death, 1988
Object tracking is a fundamental task in computer vision, requiring the localization of objects of interest across video frames. Diffusion models have shown remarkable capabilities in visual generation, making them well-suited for addressing several requirements of the tracking problem. This work proposes a novel diffusion-based methodology to formulate the tracking task. Firstly, their conditional process allows for injecting indications of the target object into the generation process. Secondly, diffusion mechanics can be developed to inherently model temporal correspondences, enabling the reconstruction of actual frames in video. However, existing diffusion models rely on extensive and unnecessary mapping to a Gaussian noise domain, which can be replaced by a more efficient and stable interpolation process. Our proposed interpolation mechanism draws inspiration from classic image-processing techniques, offering a more interpretable, stable, and faster ap- proach tailored specifically for the object tracking task. By leveraging the strengths of diffusion models while circumventing their limitations, our Diffusion-based INterpolation TrackeR (DINTR) presents a promising new paradigm and achieves a superior multiplicity on seven benchmarks across five indicator representations.
Comparison of paradigms and mechanisms of SOTA tracking methods. Indication Types defines the representation to indicate targets with their corresponding datasets: TAP-Vid, PoseTrack, MOT, VOS, VIS, MOTS, KITTI, LaSOT, GroOT. Methods in color gradient support both types of single- and multi-target benchmarks.
@article{nguyen2024dintr,
title = {DINTR: Tracking via Diffusion-based Interpolation},
author = {Nguyen, Pha and Le, Ngan and Cothren, Jackson and Yilmaz, Alper and Luu, Khoa},
journal = {Advances in Neural Information Processing Systems},
year = 2024
}