Open Images Dataset V7 News Extras Extended Download Description Explore
You are viewing the description for version V4 of Open Images, the latest version of Open Images is V7 (released Sep 2021).
If you would like to view the description of another version, please select it here:

Overview of Open Images V4

Open Images is a dataset of ~9M images that have been annotated with image-level labels, object bounding boxes and visual relationships. The training set of V4 contains 14.6M bounding boxes for 600 object classes on 1.74M images, making it the largest existing dataset with object location annotations. The boxes have been largely manually drawn by professional annotators to ensure accuracy and consistency. The images are very diverse and often contain complex scenes with several objects (8.4 per image on average). This also encorages structural image annotations, such as visual relationships. Moreover, the dataset is annotated with image-level labels spanning thousands of classes.

Open Images Extended

Open Images Extended is a collection of sets that complement the core Open Images Dataset with additional images and/or annotations. You can read more about this in the Extended section. The rest of this page describes the core Open Images Dataset, without Extensions.


The following paper describes Open Images V4 in depth: from the data collection and annotation to detailed statistics about the data and evaluation of models trained on it.

A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, T. Duerig, and V. Ferrari.
The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale.
arXiv:1811.00982, 2018.
[PDF] [BibTeX]
  author = {Alina Kuznetsova and Hassan Rom and Neil Alldrin and Jasper Uijlings and Ivan Krasin and Jordi Pont-Tuset and Shahab Kamali and Stefan Popov and Matteo Malloci and Tom Duerig and Vittorio Ferrari},
  title = {The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale},
  year = {2018},
  journal = {arXiv:1811.00982}

If you use the Open Images dataset in your work, please cite this article, as well as the following reference:

Krasin I., Duerig T., Alldrin N., Ferrari V., Abu-El-Haija S., Kuznetsova A., Rom H., Uijlings J., Popov S., Kamali S., Malloci M., Pont-Tuset J., Veit A., Belongie S., Gomes V., Gupta A., Sun C., Chechik G., Cai D., Feng Z., Narayanan D., Murphy K.
OpenImages: A public dataset for large-scale multi-label and multi-class image classification, 2017.
Available from
  title={OpenImages: A public dataset for large-scale multi-label and multi-class image classification.},
  author={Krasin, Ivan and Duerig, Tom and Alldrin, Neil and Ferrari, Vittorio and Abu-El-Haija, Sami and Kuznetsova, Alina and Rom, Hassan and Uijlings, Jasper and Popov, Stefan and Kamali, Shahab and Malloci, Matteo and Pont-Tuset, Jordi and Veit, Andreas and Belongie, Serge and Gomes, Victor and Gupta, Abhinav and Sun, Chen and Chechik, Gal and Cai, David and Feng, Zheyun and Narayanan, Dhyanesh and Murphy, Kevin},
  journal={Dataset available from},

Data organization

The dataset is split into a training set (9,011,219 images), a validation set (41,620 images), and a test set (125,436 images). The images are annotated with image-level labels, bounding boxes and visual relationships as described below.

Image-level labels

Table 1 shows an overview of the image-level labels in all splits of the dataset. All images have machine generated image-level labels automatically generated by a computer vision model similar to Google Cloud Vision API. These automatically generated labels have a substantial false positive rate.

Table 1: Image-level labels.

Train Validation Test # Classes # Trainable Classes
Images 9,011,219 41,620 125,436 - -
Machine-Generated Labels 78,977,695 512,093 1,545,835 7,870 4,764
Human-Verified Labels 27,894,289
pos: 13,444,569
neg: 14,449,720
pos: 365,772
neg: 185,618
pos: 1,105,052
neg: 562,347
19,794 7,186

Moreover, the validation and test sets, as well as part of the training set have human-verified image-level labels. Most verifications were done with in-house annotators at Google. A smaller part was done by crowd-sourcing from Image Labeler: Crowdsource app, This verification process practically eliminates false positives (but not false negatives: some labels might be missing from an image). The resulting labels are largely correct and we recommend to use these for training computer vision models. Multiple computer vision models were used to generate the samples (not just the one used for the machine-generated labels) which is why the vocabulary is significantly expanded (#classes column in Table 1).

Overall, there are 19,995 distinct classes with image-level labels. Note that this number is slightly higher than the number of human-verified labels in the above table. The reason is that there are a small number of labels in the machine-generated set that do not appear in the human-verified set. Trainable classes are those with at least 100 positive human-verifications in the V4 training set. Based on this definition, 7186 classes are considered trainable.

Bounding boxes

Table 2 shows an overview of the bounding box annotations in all splits of the dataset, which span 600 object classes. These offer a broader range than those in the ILSVRC and COCO detection challenges, including new objects such as "fedora" and "snowman".

Table 2: Boxes.

Train Validation Test # Classes
Images 1,743,042 41,620 125,436 -
Boxes 14,610,229 204,621 625,282 600

For the training set, we annotated boxes in 1.74M images, for the available positive human-verified image-level labels. We focused on the most specific labels. For example, if an image has labels {car, limousine, screwdriver}, we annotated boxes for limousine and screwdriver. For each label in an image, we exhaustively annotated every instance of that object class in the image (but see below for group cases). We provide 14.6M bounding boxes. On average there are 8.4 boxed objects per image. 90% of the boxes were manually drawn by professional annotators at Google using the efficient extreme clicking interface [1]. We produced the remaining 10% semi-automatically using an enhanced version of the method in [2]. These boxes have been human verified to have IoU>0.7 with a perfect box on the object, and in practice they are accurate (mean IoU ~0.82). We have drawn bounding boxes for human body parts and the class "Mammal" only for 95,335 images, due to the overwhelming number of instances (1,327,596 on the 95,335 images). We drew a single box around groups of objects (e.g., a bed of flowers or a crowd of people) if they had more than 5 instances which were heavily occluding each other and were physically touching (we marked these boxes with the attribute "group-of").

For the validation and test sets, we provide exhaustive box annotation for all object instances, for all available positive image-level labels (again, except for "groups of"). All boxes were manually drawn. We deliberately tried to annotate boxes at the most specific level in our semantic hierarchy as possible. On average, there are ~5 boxes per image in the validation and test sets.

In all splits (train, val, test), annotators also marked a set of attributes for each box, e.g. indicating whether that object is occluded (see the full description in the download section).

Visual relationships

Table 3 shows an overview of the visual relationship annotations in the train split of the dataset.

Table 3: Relationships.

Train Validation Test # Distinct relationship triplets # Classes # Attributes
Relationship triplets 374,768 3,983 12,248 329
obj-obj: 287
attr: 42
57 5

For the training set, we annotated all images already containing bounding box annotations with visual relationships between objects and for some objects we annotated visual attributes (encoded as "is" relationship).

In our notation, a pair of objects connected by a relationship forms a triplet (e.g. "beer on table"). Visual attributes are in fact also triplets, where an object in connected with an attribute using the relationship is (e.g. "table is wooden", "handbag is made of leather" or "bench is wooden"). We initially selected 467 possible triplets based on existing bounding box annotations. The 329 of them that have at least one instance in the training set form the final set of visual relationships/attributes triplets. In total, we annotated 375K instances of these triplets on the training set, involving 57 different object classes and 5 attributes. These include both human-object relationships (e.g. "woman playing guitar", "man holding microphone") and object-object relationships (e.g. "beer on table", "dog inside car").

Annotations are exhaustive, meaning that for each image that can potentially contain a relationship triplet (i.e. contains the objects involved in that triplet), we provide annotations exhaustively listing all positive triplets instances in that image. For example, for "woman playing guitar" in an image, we list all pairs of ("woman","guitar") that are in the relationship "playing" in that image. All other pairs of (woman,guitar) in that image are negative examples for the "playing" relationship.

For the 57 object classes with relationship annotations, OID training set contains 3,290,070 bounding boxes and 2,077,154 image-level labels.

NOTE: we will provide visual relationships annotations on the test and validation sets soon - stay tuned!

Class definitions

Classes are identified by MIDs (Machine-generated Ids) as can be found in Freebase or Google Knowledge Graph API. A short description of each class is available in class-descriptions.csv.

Statistics and data analysis

Hierarchy for the 600 boxable classes

View the set of boxable classes as a hierarchy here or download it as a JSON file:

Hierarchy Visualizer

Label distributions

The following figures show the distribution of annotations across the dataset. Notice that the label distribution is heavily skewed (note: the y-axis is on a log-scale). Classes are ordered by number of positive samples. Green indicates positive samples while red indicates negatives.

Label frequencies - Training set Label frequencies - Validation set Label frequencies - Test set Bounding box frequencies - Training set Bounding box frequencies - Validation set Bounding box frequencies - Test set


The annotations are licensed by Google LLC under CC BY 4.0 license. The images are listed as having a CC BY 2.0 license. Note: while we tried to identify images that are licensed under a Creative Commons Attribution license, we make no representations or warranties regarding the license status of each image and you should verify the license for each image yourself.


  1. "Extreme clicking for efficient object annotation", Papadopolous et al., ICCV 2017.

  2. "We don't need no bounding-boxes: Training object class detectors using only human verification", Papadopolous et al., CVPR 2016.