This is the review version of the webpage for the TAP-Vid dataset.
TAP-Vid is a dataset of videos along with point tracks, either manually annotated or obtained from a simulator. The aim is to evaluate tracking of any trackable point on any solid physical surface. Algorithms receive a single query point on some frame, and must produce the rest of the track, i.e., including where that point has moved to (if visible), and whether it is visible, on every other frame. This requires point-level precision (unlike prior work on box and segment tracking) potentially on deformable surfaces (unlike structure from motion) over the long term (unlike optical flow) on potentially any object (i.e. class-agnostic, unlike prior class-specific keypooint tracking on humans). Here's an example of what's annotated on one video of the DAVIS dataset:
The annotations of TAP-Vid, as well as the RGB-Stacking videos, are released under a Creative Commons BY license. The original source videos of DAVIS come from the val set, and are also licensed under creative commons licenses per their creators; see the DAVIS dataset for details. Kinetics videos are publicly available on YouTube, but subject to their own individual licenses. See the Kinetics dataset webpage for details.
For DAVIS and RGB-Stacking, the videos are contained in a simple pickle file; for DAVIS, this contains a simple dict, where each key is a DAVIS video title, and the contents are the video (4D uint8
tensor), the points (float32
tensor with 3 axes; the first is point id, the second is time, and the third is x/y), and the occlusions (bool
tensor with 2 axies; the first is point id, the second is time). RGB-Stacking is the same, except there's no video titles, so it's a simple list of these structures rather than a dict. The downloads are given above.
For Kinetics, we cannot distribute the raw videos, so instructions for assembling the above data structures are given below.
We expect the raw clips from Kinetics700-2020 validation set to be downloaded
and stored in a local folder <video_root_path>
. The clips should be stored as
MP4, following the name pattern
f'{youtube_id}_{start_time_sec:06}_{end_time_sec:06}'
, e.g.
'abcdefghijk_000010_000020.mp4'
.
Clips can be stored in any subfolder within the <video_root_path>
. The most
common pattern is to store it as <video_root_path>/<label_name>/<clip_name>
.
Once the validation clips have been downloaded, a pickle file containing all the information can be generated using the provided script:
bash python3 -m pip install -r requirements.txt python3 generate_tapvid.py \ --csv_path=<path_to_tapvid_kinetics.csv> \ --output_base_path=<path_to_output_pickle_folder> \ --video_root_path=<path_to_raw_videos_root_folder> \ --alsologtostderr
We also provide a script generating an MP4 with the points painted on top of the frames. The script will work with any of the pickle files (Kinetics Tapnet, Davis or Robotics). A random clip is chosen from all the available ones and all the point tracks are painted.
bash python3 -m pip install -r requirements.txt python3 visualize.py \ --input_path=<path_to_the_pickle_file.pkl> \ --output_path=<path_to_the_output_video.mp4> \ --alsologtostderr
To demonstrate the datasets we have created, we include the full TAP-Vid-DAVIS (30 videos) (old version), as well as 10 examples each from the synthetic TAP-Vid-Kubric and TAP-Vid-RGB-Stacking datasets (old version). Unfortunately we cannot directly include Kinetics visualizations due to the licensing of these YouTube videos, but they may be visualized using the download script above.
When annotating videos, we interpolate between the sparse points that the annotators choose by finding tracks which minimize the discrepancy with the optical flow while still connecting the chosen points. To validate that this is indeed improving results, we annotated several DAVIS videos twice and compare them side by side, once using the flow-based interpolation, and again using a naive linear interpolation, which simply moves the point at a constant velocity between points.
The majority of the code as well as standalone annotations will be hosted in DeepMind’s Github once we are prepared for a public announcement, which is currently scheduled to happen late in July. For our datasets which include videos (DAVIS and robotics), the default files are too large for Github, and so these will be hosted in a Google cloud bucket. DeepMind plans to ensure the availability of these repositories in the long term, and we expect maintenance to be minimal as simple python readers are sufficient to use the data.
This submission partially overlaps with a paper that was rejected from ECCV, which focused on the model and featured a preliminary, substantially-smaller version of the dataset presented here. Our dataset is titled Tracking Any Point, and according to reviewers, we did not make it clear enough that we do not handle liquids, gasses, or transparent objects (our name is inspired by Tracking Any Object, though this dataset does not annotate objects that break apart or cannot be picked out with a bounding box). We have altered the writing to better discuss this issue. Reviewers also criticized the complexity of the model, so we have simplified this and reformulated to focus on the dataset. Reviewers were also not confident in the accuracy and therefore the usefulness of our annotations, so we substantially improved our description of the annotation procedure and increased the size of the dataset. We also added experiments on JHMDB, demonstrating that our annotations can improve performance on this popular dataset if used to train our TAP-Net architecture, but identical experiments in the other direction don’t show any improvement.