^{1}

^{*}

^{2}

^{1}

^{2}

^{1}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: JNI DBC. Performed the experiments: JNI RK TP. Analyzed the data: JNI RK. Wrote the paper: JNI RK JS DBC. Helped with writing: TP JS DBC.

We aim to improve segmentation through the use of machine learning tools during region agglomeration. We propose an active learning approach for performing hierarchical agglomerative segmentation from superpixels. Our method combines multiple features at all scales of the agglomerative process, works for data with an arbitrary number of dimensions, and scales to very large datasets. We advocate the use of variation of information to measure segmentation accuracy, particularly in 3D electron microscopy (EM) images of neural tissue, and using this metric demonstrate an improvement over competing algorithms in EM and natural images.

Image segmentation, a fundamental problem in computer vision, concerns the division of an image into meaningful constituent regions, or segments.

In addition to having applications in computer vision and object recognition (

Additionally, automated segmentation of EM images presents significant challenges compared to that of natural images (

Note that the data is isotropic, meaning it has the same resolution along every axis. The goal of segmentation here is to partition the volume into individual neurons, two of which are shown in orange and blue. The volume is densely packed by these thin neuronal processes taking long, tortuous paths.

A common approach in the field is to perform oversegmentation into small segments called

We hypothesized that agglomeration could be improved by using more information than just the boundary mean, despite the latter’s desirable characteristics. A priority function could draw from many additional features, such as boundary variance and region texture. Using training data in which pairs of superpixels have been labeled as “merge” or “don’t merge”, we could then apply machine learning techniques to predict from those features whether two superpixels should be merged. With that simple approach, however, we found that the guaranteed effectiveness of the mean could easily disappear. Similarly to the case with the boundary mean, the region sizes progressively increase and so does the amount of evidence for or against a merge. However, we could encounter a combination of features for which we had no training data.

To get around this problem, we developed an active learning paradigm that generates training examples across every level of the agglomeration hierarchy and thus across very different segment scales. In active learning, the algorithm determines what example it wants to learn from next, based on the previous training data. For agglomerative segmentation, we ask the classifier which two regions it believes should be merged, and compare those against the ground truth to obtain the next training example. By doing this at all levels of the agglomeration hierarchy, we ensure that we have samples from all parts of the feature space that the classifier is likely to encounter.

Past learning methods either used a manual combination of a small number of features

We describe below our method for collecting training data for agglomerative segmentation. Throughout a training agglomeration, we consult a human-generated gold standard segmentation to determine whether each merge is correct. This allows us to learn a merge function at the many scales of agglomeration. We show that our learned agglomeration outperforms state of the art agglomeration algorithms in natural image segmentation (

To evaluate segmentations, we advocate the use of variation of information (VI) as a metric and show that it can be used to improve the interpretability of segmentation results and aid in their analysis.

The ideas in this work are implemented in an open-source Python library called Gala that performs agglomeration learning and segmentation in arbitrary dimensions.

The method described below is illustrated and summarized in

Let

There are many methods to obtain

The RAG is defined as follows. Each node

We then define the merge priority function (MPF) or policy

The mean probability of boundary along the edge is but one example of a merge priority function. In this work, we propose finding an optimal

We first define the optimal agglomeration

From this, we can work out a label between two regions:

Now, given an initial policy

After epoch

There remains the issue of choosing a suitable initial policy. We found that the mean boundary probability or even random numbers work well, but, to obtain the fastest convergence, we generate the training set consisting of every labeled edge in the initial graph (with no agglomeration),

In this section, we describe the feature maps used in our work. We call primitive features “cues”, from which we compute the actual features used in the learning. We did not focus on these maps extensively, and expect that these are not the last word with respect to useful features for agglomerative segmentation learning.

For natural images, we use the gPb oriented boundary map

For EM data, we use four separate cues: a probability map of cell boundaries, cytoplasm, mitochondria, and glia. Mitochondria were labeled by hand using the active contours function in the ITK-SNAP software package

Let

For

For natural image segmentation, we added several mid-level features based on region orientation and convex hulls. For orientation features, the orientation of each region is estimated from the region’s second moment matrix. We use the angle between the two regions, as well and the angles between each region and a line segment connecting their centroids, as features. For convex hull features, we calculated the volume of the convex hull of each region, as well as for their union, and used the ratios between these convex hulls volumes and the volumes of the regions themselves as a measure of the convexity of regions.

Before we describe the main results of our paper, a discussion of evaluation methods is warranted, since even the question of the “correct” evaluation method is the subject of active research.

The most commonly used method is boundary precision-recall

The use of boundary precision-recall has deficiencies as a segmentation metric, since small changes in boundary detection can result in large topological differences between segmentations. This is particularly problematic in neuronal EM images, where the goal of segmentation is to elucidate the connectivity of extremely long, thin segments that have tiny (and error-prone) branch points. For such images, the number of mislabeled boundary pixels is irrelevant compared to the precise location and topological impact of the errors

The region evaluation measure of choice in the segmentation literature has been the Rand index (RI)

However, RI has several disadvantages, such as being sensitive to rescaling and having a limited useful range

VI overcomes all of the disadvantages of the Rand index and has several other advantages, such as being a formal metric

Like the Rand index, VI is sensitive to topological changes but not to small variations in boundary changes, which is critical in EM segmentation. Unlike RI, however, errors in VI scale linearly in the size of the error whereas the RI scales quadratically. This makes VI more directly comparable between volumes. In addition, because RI is based on point pairs, and because the vast majority of pairs are in disjoint regions, RI has a limited useful range very near 1, and that range is different for each dataset. In contrast, VI ranges between 0 and

VI is by its definition (

Shaded areas correspond to mean

Left: split-VI plot. Stars represent optimal VI (minimum sum of x and y axis), circles represent VI at threshold

In addition, each of the under- and oversegmentation terms can be further broken down into its constituent errors. The oversegmentation term of a VI distance is defined as

In light of the utility of VI, our evaluation is based on VI, particularly for EM data. For natural images, we also present boundary precision-recall and other measures, to facilitate comparison to past work. In addition to boundary PR values, RI, and VI, we show values for the covering, a measure of overlap between segments

We present in this paper the segmentation performance of several agglomerative algorithms, defined below. As a baseline we show results from agglomeration using only the mean boundary probability between segments (“mean”).

For natural images, we also show the results when oriented boundary maps are used (“mean-orient”), which is the algorithm presented by Arbeláez

Our proposed method, using an actively-trained classifier and agglomeration, is denoted as “agglo”. For details, see Section 1 of the Methods and

A similar method, described by Jain

In order to show the effect of our agglomerative learning, we also compare using a classifier trained on only the initial graph before agglomeration (“flat”).

Our starting dataset was a

To generate a gold standard segmentation, an initial segmentation based on pixel intensity alone was manually proofread using software specifically designed for this purpose (called Raveler)

Compared with mean agglomeration or with a flat learning strategy, our active agglomerative learning algorithm improved segmentation performance modestly but significantly (

In addition, the agglomerative training appears to dramatically improve the probability estimates from the classifier. If the probability estimates from a classifier are accurate, then, under reasonable assumptions, we expect the minimum VI to occur at or near

(Flat learning is equivalent to 0 agglomerative training epochs.) (a) VI as a function of threshold for mean, flat learning, and agglomerative learning (5 epochs). Stars indicate minimum VI, circles indicate VI at

This suggests that agglomerative learning improves the classifier probability estimates. Indeed, the minimum VI and the VI at

Although we implemented our algorithm to work specifically on isotropic data, we attempted to segment the publicly available SNEMI3D challenge dataset (available at

We also show the results of our algorithm on the Berkeley Segmentation Dataset (BSDS500)

Our algorithm improves segmentation as measured by all the above evaluation metrics (

Algorithm | Covering | RI | VI | F-measure | ||||||

ODS | OIS | Best | ODS | OIS | ODS | OIS | ODS | OIS | AP | |

human | 0.72 | 0.72 | – | 0.88 | 0.88 | 1.17 | 1.17 | 0.80 | 0.80 | – |

agglo | ||||||||||

flat | 0.608 | 0.658 | 0.753 | 0.830 | 0.859 | 1.63 | 1.42 | 0.726 | ||

oriented mean |
0.584 | 0.643 | 0.741 | 0.824 | 0.854 | 1.71 | 1.49 | 0.725 | 0.759 | 0.758 |

mean | 0.540 | 0.597 | 0.694 | 0.791 | 0.834 | 1.80 | 1.63 | 0.643 | 0.666 | 0.689 |

In

as measured by VI at the optimal dataset scale (ODS). Each point represents one image. Numbered and colored points correspond to the example images in

Several example segmentations are shown in

We have presented a method for learning agglomerative segmentation. By performing agglomeration while comparing with a ground truth, we learn to merge segments at all scales of agglomeration. And, by guiding the agglomeration with the previous best policy, we guarantee that the examples we learn match those that will be encountered during a test agglomeration. Indeed, the difference in behavior between agglomerative learning and flat learning is immediately apparent and striking when watching the agglomerations occur side by side (see Video S4).

LASH

Another difference is that Jain

Recent work also attempts to use machine learning to classify on a merge hierarchy starting from watershed superpixels

Bjoern Andres, Fred Hamprecht and colleagues have devoted much effort to the use of graphical models to perform a one-shot agglomeration of supervoxels

First, the theoretical scalability of a global optimization is limited, which could become a problem as volumes exceed the teravoxel range. In contrast, GALA and other hierarchical methods could theoretically be implemented in a Pregel-like massively parallel graph framework

Second, despite the significant progress of the last decade, the accuracy of all currently available segmentation methods is orders of magnitude too small for their output to be used directly without human proofreading

A lot of the effort in connectomics focuses on the segmentation of anisotropic serial-section EM volumes

The feature space for agglomeration is also worthy of additional exploration. For EM data, we included pixel probabilities of boundary, cytoplasm, mitochondria, and glia. Classifier predictions for synapses and vesicles might give further improvements

A weakness of our method is its requirement for a full gold standard segmentation for training. This data might not be easily obtained, and indeed this has been a bottleneck in moving the method “from benchside to bedside”, so to speak. We are therefore in the process of modifying the method to a semi-supervised approach that would require far less training data to achieve similar performance.

Finally, the field of neuronal reconstruction will depend on segmentation algorithms that not only segment well, but point to the probable location of errors. Although it requires further improvements in speed, scalability, and usability, our method is a first step in that direction.

The source code for the Gala Python library can be found at:

(PDF)

(TIF)

(TIF)

(MP4)

We thank Bill Katz for critical reading of the manuscript, C. Shan Xu and Harald Hess for the generation of the image data, Mat Saunders for generation of the ground truth data, Shaul Druckmann for help editing figures, and Viren Jain, Louis Scheffer, Steve Plaza, Phil Winston, Don Olbris and Nathan Clack for useful discussions.