Open Images Challenge 2018 Visual Relationships Detection evaluation
For the Visual Relationships Detection track, we use two tasks: relationship detection and phrase detection.
In the relationship detection task, the expected output is two object detections with their correct class labels, and the label of the relationship that connects them (for the object-is-attribute case, the two boxes are identical).
In the phrase detection task, the expected output is a single detection enclosing both objects, two object labels and one relationship label.
Let Intersection-over-Union (IoU) threshold = 0.5. The participants will be evaluated on the weighted sum of the following metrics:
- Mean Average Precision of relationships detection at IoU > threshold (mAPrel).
- Recall@N of relationships detection at IoU > threshold (Recall@Nrel).
- Mean Average Precision of phrase detection at IoU > threshold (mAPphrase).
All three metrics were used for the evaluation of Visual Relationships Detection in previous works. However, the performance of the state-of-the-art algorithms in relationships detection rask is still very low (due to the difficulty of the task), so we decided to introduce additional retrieval metric of Recall@N for the relationships detection. Note that phrase detection is more tractable, so using mAP is sufficient.
Note that group-of boxes and hierarchy effects are not taken into account during evaluation.
mAPrel in relationships detection
For each relationship type (e.g. 'at', 'on') Average Precision (AP) is computed by extending the PASCAL VOC 2010 definition to relationship triplets. The main modification is that a matching criteria must apply on the two object boxes and three class labels (two object labels and a relationship label). We consider a detected triplet to be a True Positive (TP) if and only if both object boxes have IoU > threshold with a previously undetected ground-truth annotation, and all three labels match their corresponding ground-truth labels. Any other detection is considered a False Positive (FP) in the two cases (1) both class labels of the objects are annotated in that image (regardless of positive or negative); or (2) one or both labels are annotated as negative. Finally, if either of the labels is unannotated, the detection is not evaluated (ignored). mAPrel is computed as the average of per-relationship APs.
Recall@Nrel in relationships detection
The triplet detections are sorted by score and then the top N predictions are evaluated as TP, FP or ignored (see above). A recall point is scored if there is at least one True Positive is found among these top N detections.
mAPphrase in phrase detection
Each relationship detection triplet is transformed so that a single enclosing bounding box is formed from the two object detections. This bounding box has three labels attached (two object labels and one relationship label). The enclosing box is considered to be a TP if IoU > threshold with a previously undetected ground-truth annotation and all three labels match their corresponding ground-truth labels. The AP for each relationship type is computed according to the PASCAL VOC 2010 definition. mAPphrase is computed as the average of per-relationship APs.
The implementation these metrics is publicly available as part of the Tensorflow Object Detection API under the name 'OID Challenge Visual Relationship Detection Metric 2018'. The software provides several diagnostic metrics (as per-class AP), however those metrics will not be used for the final ranking.
To obtain the evaluation results, use oid_vrd_challenge_evaluation.py util.
Please see this Tutorial on how to run the metric.
The evaluation server is hosted by Kaggle.
Note: you need to be registered at Kaggle website for the competition to be able to submit the results. The registration deadline is August 23 2018.