Overlapping images between Open Images, Flickr30k, and COCO
Open Images contains ~9M images crawled from Flickr. Other datasets such as COCO or Flickr30k also picked their images from Flickr. We ran a duplicate image detector and found out that there are some images in Open Images that also appear in COCO or Flickr30k.
These overlaps might be seen as data leakage if the model has seen some evaluation images during training. The table below quatifies the percentage of data leakage in several train-eval scenarios.
Train on | Eval on | Size of the training set | Size of the evaluation set | Eval images seen in train | Train set % | Evaluation set % |
---|---|---|---|---|---|---|
OID train with bounding box annotations | COCO val2017 | 1,743,042 | 5,000 | 67 | 0.0038% | 1.3% |
COCO test2017 | 1,743,042 | 40,670 | 617 | 0.0354% | 1.5% | |
Flickr30k val | 1,743,042 | 1,000 | 11 | 0.0006% | 1.1% | |
Flickr30k test | 1,743,042 | 1,000 | 6 | 0.0003% | 0.6% | |
OID train | COCO val2017 | 9,011,219 | 5,000 | 304 | 0.0034% | 6.1% |
COCO test2017 | 9,011,219 | 40,670 | 2,314 | 0.0257% | 5.7% | |
Flickr30k val | 9,011,219 | 1,000 | 36 | 0.0004% | 3.6% | |
Flickr30k test | 9,011,219 | 1,000 | 47 | 0.0005% | 4.7% | |
COCO train2017 | OID Val+Test | 118,287 | 167,056 | 208 | 0.1758% | 0.12% |
Flickr30k train | OID Val+Test | 29,783 | 167,056 | 47 | 0.1578% | 0.03% |
Our recommended course of action to fix this data leakage is to remove the
overlapping images from the training set (not the eval set), in
order for the eval results to be comparable to other works. The number of
overlapping images is an insignificant percentage of the training set, so the
trained model should be essentially the same.
Published 21st June 2022