Open Images Dataset V7 News Extras Extended Download Description Explore

Overlapping images between Open Images, Flickr30k, and COCO

Open Images contains ~9M images crawled from Flickr. Other datasets such as COCO or Flickr30k also picked their images from Flickr. We ran a duplicate image detector and found out that there are some images in Open Images that also appear in COCO or Flickr30k.

The exact IDs that identify the pairs of duplicated images are in the following two files:

These overlaps might be seen as data leakage if the model has seen some evaluation images during training. The table below quatifies the percentage of data leakage in several train-eval scenarios.

Train on Eval on Size of the training set Size of the evaluation set Eval images seen in train Train set % Evaluation set %
OID train with bounding box annotationsCOCO val20171,743,0425,000670.0038%1.3%
COCO test20171,743,04240,6706170.0354%1.5%
Flickr30k val1,743,0421,000110.0006%1.1%
Flickr30k test1,743,0421,00060.0003%0.6%
OID trainCOCO val20179,011,2195,0003040.0034%6.1%
COCO test20179,011,21940,6702,3140.0257%5.7%
Flickr30k val9,011,2191,000360.0004%3.6%
Flickr30k test9,011,2191,000470.0005%4.7%
COCO train2017OID Val+Test118,287167,0562080.1758%0.12%
Flickr30k trainOID Val+Test29,783167,056470.1578%0.03%
Our recommended course of action to fix this data leakage is to remove the overlapping images from the training set (not the eval set), in order for the eval results to be comparable to other works. The number of overlapping images is an insignificant percentage of the training set, so the trained model should be essentially the same.


Published 21st June 2022