TAP-Vid: A Benchmark for Tracking Any Point in a Video

Introduction

This is the review version of the webpage for the TAP-Vid dataset.

TAP-Vid is a dataset of videos along with point tracks, either manually annotated or obtained from a simulator. The aim is to evaluate tracking of any trackable point on any solid physical surface. Algorithms receive a single query point on some frame, and must produce the rest of the track, i.e., including where that point has moved to (if visible), and whether it is visible, on every other frame. This requires point-level precision (unlike prior work on box and segment tracking) potentially on deformable surfaces (unlike structure from motion) over the long term (unlike optical flow) on potentially any object (i.e. class-agnostic, unlike prior class-specific keypooint tracking on humans). Here's an example of what's annotated on one video of the DAVIS dataset:

Licensing

The annotations of TAP-Vid, as well as the RGB-Stacking videos, are released under a Creative Commons BY license. The original source videos of DAVIS come from the val set, and are also licensed under creative commons licenses per their creators; see the DAVIS dataset for details. Kinetics videos are publicly available on YouTube, but subject to their own individual licenses. See the Kinetics dataset webpage for details.

Downloading and Using the Dataset

For DAVIS and RGB-Stacking, the videos are contained in a simple pickle file; for DAVIS, this contains a simple dict, where each key is a DAVIS video title, and the contents are the video (4D uint8 tensor), the points (float32 tensor with 3 axes; the first is point id, the second is time, and the third is x/y), and the occlusions (bool tensor with 2 axies; the first is point id, the second is time). RGB-Stacking is the same, except there's no video titles, so it's a simple list of these structures rather than a dict. The downloads are given above.

For Kinetics, we cannot distribute the raw videos, so instructions for assembling the above data structures are given below.

Downloading the Kinetics videos

We expect the raw clips from Kinetics700-2020 validation set to be downloaded and stored in a local folder <video_root_path>. The clips should be stored as MP4, following the name pattern f'{youtube_id}_{start_time_sec:06}_{end_time_sec:06}', e.g. 'abcdefghijk_000010_000020.mp4'.

Clips can be stored in any subfolder within the <video_root_path>. The most common pattern is to store it as <video_root_path>/<label_name>/<clip_name>.

Processing the clips

Once the validation clips have been downloaded, a pickle file containing all the information can be generated using the provided script:

bash
python3 -m pip install -r requirements.txt
python3 generate_tapvid.py \
  --csv_path=<path_to_tapvid_kinetics.csv> \
  --output_base_path=<path_to_output_pickle_folder> \
  --video_root_path=<path_to_raw_videos_root_folder> \
  --alsologtostderr

Visualizing annotations

We also provide a script generating an MP4 with the points painted on top of the frames. The script will work with any of the pickle files (Kinetics Tapnet, Davis or Robotics). A random clip is chosen from all the available ones and all the point tracks are painted.

bash
python3 -m pip install -r requirements.txt
python3 visualize.py \
  --input_path=<path_to_the_pickle_file.pkl> \
  --output_path=<path_to_the_output_video.mp4> \
  --alsologtostderr

Visualized Ground Truth Examples

To demonstrate the datasets we have created, we include the full TAP-Vid-DAVIS (30 videos) (old version), as well as 10 examples each from the synthetic TAP-Vid-Kubric and TAP-Vid-RGB-Stacking datasets (old version). Unfortunately we cannot directly include Kinetics visualizations due to the licensing of these YouTube videos, but they may be visualized using the download script above.

Comparison of Tracking With and Without Optical Flow

When annotating videos, we interpolate between the sparse points that the annotators choose by finding tracks which minimize the discrepancy with the optical flow while still connecting the chosen points. To validate that this is indeed improving results, we annotated several DAVIS videos twice and compare them side by side, once using the flow-based interpolation, and again using a naive linear interpolation, which simply moves the point at a constant velocity between points.

Hosting and Maintenance

The majority of the code as well as standalone annotations will be hosted in DeepMind’s Github once we are prepared for a public announcement, which is currently scheduled to happen late in July. For our datasets which include videos (DAVIS and robotics), the default files are too large for Github, and so these will be hosted in a Google cloud bucket. DeepMind plans to ensure the availability of these repositories in the long term, and we expect maintenance to be minimal as simple python readers are sufficient to use the data.

Prior submission

This submission partially overlaps with a paper that was rejected from ECCV, which focused on the model and featured a preliminary, substantially-smaller version of the dataset presented here. Our dataset is titled Tracking Any Point, and according to reviewers, we did not make it clear enough that we do not handle liquids, gasses, or transparent objects (our name is inspired by Tracking Any Object, though this dataset does not annotate objects that break apart or cannot be picked out with a bounding box). We have altered the writing to better discuss this issue. Reviewers also criticized the complexity of the model, so we have simplified this and reformulated to focus on the dataset. Reviewers were also not confident in the accuracy and therefore the usefulness of our annotations, so we substantially improved our description of the annotation procedure and increased the size of the dataset. We also added experiments on JHMDB, demonstrating that our annotations can improve performance on this popular dataset if used to train our TAP-Net architecture, but identical experiments in the other direction don’t show any improvement.