OVTrack: Open-Vocabulary Multiple Object Tracking

Abstract

The ability to recognize, localize and track dynamic objects in a scene is fundamental to many real-world applications, such as self-driving and robotic systems. Yet, traditional multiple object tracking (MOT) benchmarks rely only on a few object categories that hardly represent the multitude of possible objects that are encountered in the real world. This leaves contemporary MOT methods limited to a small set of pre-defined object categories. In this paper, we address this limitation by tackling a novel task, open-vocabulary MOT, that aims to evaluate tracking beyond pre-defined training categories. We further develop OVTrack, an open-vocabulary tracker that is capable of tracking arbitrary object classes. Its design is based on two key ingredients: First, leveraging vision-language models for both classification and association via knowledge distillation; second, a data hallucination strategy for robust appearance feature learning from denoising diffusion probabilistic models. The result is an extremely data-efficient open-vocabulary tracker that sets a new state-of-the-art on the large-scale, large-vocabulary TAO benchmark, while being trained solely on static images.

Publication
In Conference on Computer Vision and Pattern Recognition, CVPR 2023
Siyuan Li
Siyuan Li
PhD Student, ETH Zurich
Lei Ke
Lei Ke
PhD Student, ETH Zurich & HKUST
Martin Danelljan
Martin Danelljan
Researcher

Researcher in Computer Vision and Machine Learning at Apple