Open Images

Overlapping images between Open Images, Flickr30k, and COCO

Open Images contains ~9M images crawled from Flickr. Other datasets such as COCO or Flickr30k also picked their images from Flickr. We ran a duplicate image detector and found out that there are some images in Open Images that also appear in COCO or Flickr30k.

The exact IDs that identify the pairs of duplicated images are in the following two files:

These overlaps might be seen as data leakage if the model has seen some evaluation images during training. The table below quatifies the percentage of data leakage in several train-eval scenarios.

Train on	Eval on	Size of the training set	Size of the evaluation set	Eval images seen in train	Train set %	Evaluation set %
OID train with bounding box annotations	COCO val2017	1,743,042	5,000	67	0.0038%	1.3%
	COCO test2017	1,743,042	40,670	617	0.0354%	1.5%
	Flickr30k val	1,743,042	1,000	11	0.0006%	1.1%
	Flickr30k test	1,743,042	1,000	6	0.0003%	0.6%
OID train	COCO val2017	9,011,219	5,000	304	0.0034%	6.1%
	COCO test2017	9,011,219	40,670	2,314	0.0257%	5.7%
	Flickr30k val	9,011,219	1,000	36	0.0004%	3.6%
	Flickr30k test	9,011,219	1,000	47	0.0005%	4.7%
COCO train2017	OID Val+Test	118,287	167,056	208	0.1758%	0.12%
Flickr30k train	OID Val+Test	29,783	167,056	47	0.1578%	0.03%

Our recommended course of action to fix this data leakage is to remove the overlapping images from the training set (not the eval set), in order for the eval results to be comparable to other works. The number of overlapping images is an insignificant percentage of the training set, so the trained model should be essentially the same.

Published 21st June 2022