The authors have declared that no competing interests exist.

Conceived and designed the experiments: RCP PME. Performed the experiments: RCP. Analyzed the data: RCP PME. Wrote the paper: RCP PME.

This work builds upon previous efforts in online incremental learning, namely the Incremental Gaussian Mixture Network (IGMN). The IGMN is capable of learning from data streams in a single-pass by improving its model after analyzing each data point and discarding it thereafter. Nevertheless, it suffers from the scalability point-of-view, due to its asymptotic time complexity of O(^{3}) for ^{2}) by deriving formulas for working directly with precision matrices instead of covariance matrices. The final result is a much faster and scalable algorithm which can be applied to high dimensional tasks. This is confirmed by applying the modified algorithm to high-dimensional classification datasets.

The Incremental Gaussian Mixture Network (IGMN) [

IGMN adopts a Gaussian mixture model of distribution components that can be expanded to accommodate new information from an input data point, or reduced if spurious components are identified along the learning process. Each data point assimilated by the model contributes to the sequential update of the model parameters based on the maximization of the likelihood of the data. The parameters are updated through the accumulation of relevant information extracted from each data point. New points are added directly to existing Gaussian components or new components are created when necessary, avoiding merge and split operations, much like what is seen in the Adaptive Resonance Theory (ART) algorithms [

The IGMN is capable of supervised learning, simply by assigning any of its input vector elements as outputs. In other words, any element can be used to predict any other element, like auto-associative neural networks [

Previous successful applications of the IGMN algorithm include time-series prediction [

However, the IGMN suffers from cubic time complexity due to matrix inversion operations and determinant computations. Its time complexity is of O(^{3}), where ^{2}) for learning while keeping the quality of a full covariance matrix solution.

For the specific case of the IGMN algorithm, to the best of our knowledge, this has not been tried before, although we can find similar efforts for related algorithms. In [

The next Section describes the algorithm in more detail with the latest improvements to date. Section 3 describes our improvements to the algorithm. Section 4 shows the experiments and results obtained from both versions of the IGMN for comparison, and Section 5 finishes this work with concluding remarks.

In the next subsections we describe the current version of the IGMN algorithm, a slightly improved version of the one described in [

The algorithm starts with no components, which are created as necessary (see subsection 2.2). Given input ^{2}(

where _{j} is the ^{th} component mean, _{j} its full covariance matrix. If any ^{2}(

where

where _{j} and _{j} are the accumulator and the age of component

If the update condition in the previous subsection is not met, then a new component _{ini} can be obtained by:

where

Optionally, a component _{j} > _{min} and _{j} < _{min}, where _{min} and _{min} are manually chosen (e.g., 5.0 and 3.0, respectively). In that case, also, _{min} to show its importance to the model in the form of an accumulation of its posterior probabilities _{j}. Those components are entirely removed from the model instead of merged with other components, because we assume they represent outliers. Since the removed components have small accumulated activations, it also implies that their removal has almost no negative impact on the model quality, often producing positive impact on generalization performance due to model simplification (a more throughout analysis of parameter sensibility for the IGMN algorithm can be found in [

In the IGMN, any element can be predicted by any other element. In other words, inputs and targets are presented together as inputs during training. Thus, inference is done by reconstructing data from the target elements (_{t}, a slice of the entire input vector _{i}, also a slice of the entire input vector _{i} with the target elements _{t} removed from calculations. After that, _{t} can be reconstructed using the conditional mean equation:

where _{j,ti} is the sub-matrix of the _{j,i} is the sub-matrix corresponding to the known part only and _{j,i} is the

One of the contributions of this work lies in the fact that ^{3}), for ^{log27+O1}) for the Strassen algorithm or at best O(^{2.3728639}) with the most recent algorithms to date [

Firstly, let us denote ^{−1} =

We now proceed to adapt

This allows us to apply the Sherman-Morrison formula [

This formula shows how to update the inverse of a matrix plus a rank-one update. For the second update, which subtracts, the formula becomes:

In the context of IGMN, we have _{j}. Rewriting Eqs (

These two equations allow us to update the precision matrix directly, eliminating the need for the covariance matrix ^{2}) complexity due to matrix-vector products.

Following on the adaptation of the IGMN equations,

which now has a O(^{2}) complexity, since there is no matrix inversion as the original equation. Note that the Sherman-Morrison identity is exact, thus the Mahalanobis computation yields exactly the same result, as will be shown in the experiments. After removing the cubic complexity from this step, the determinant computation will be dealt with next.

Since the determinant of the inverse of a matrix is simply the inverse of the determinant, it is sufficient to invert the result. But computing the determinant itself is also a O(^{3}) operation, so we will instead perform rank-one updates using the Matrix Determinant Lemma [

Since the IGMN covariance matrix update involves a rank-two update, adding a term and then subtracting one, both rules must be applied in sequence, similar to what has been done with the

This was the last source of cubic complexity, which is now quadratic.

Finishing the adaptation in the learning part of the algorithm, we just need to define the initialization for

_{j}(_{j}(

_{j}(_{j}(

_{j} = _{j}(

Δ_{j} = _{j} _{j}

_{j}(_{j}(_{j}

_{K} = _{K}∣ = ∣_{K}∣^{−1}, _{j} = 1, _{j} = 1,

Finally, the inference

Here, according to ^{−1}. But since the terms that constitute these sub-matrices are relative to the original covariance matrix (which we do not have), they must be extracted from the precision matrix directly. Looking at the decomposition, it is clear that ^{−1} = −^{−1} ^{−1} (the terms between parenthesis in ^{T} due to symmetry). So

where ^{2}) complexity for learning and O(^{3}) for inference. The reason for us to not worry about that is that ^{−1} product. In fact, Weka (the data mining platform used in this work [

The first experiment was meant to verify that both IGMN implementations produce exactly the same results. They were both applied to 7 standard datasets distributed with the Weka software (

Dataset | Instances (N) | Attributes (D) | Classes |
---|---|---|---|

breast-cancer | 286 | 9 | 2 |

pima-diabetes | 768 | 8 | 2 |

Glass | 214 | 9 | 7 |

ionosphere | 351 | 34 | 2 |

iris | 150 | 4 | 3 |

labor-neg-data | 57 | 16 | 2 |

soybean | 683 | 35 | 19 |

MNIST [ |
70000 | 784 | 10 |

CIFAR-10 [ |
60000 | 3072 | 10 |

Dataset | RF | NN | Lin. SVM | RBF SVM | IGMN | FIGMN |
---|---|---|---|---|---|---|

breast-cancer | 69.6 ± 9.1 | 75.2 ± 6.5 | 69.3 ± 7.5 | 70.6 ± 1.5 | 71.4 ± 7.4 | 71.4 ± 7.4 |

pima-diabetes | 75.8 ± 3.5 | 74.2 ± 4.9 | 77.5 ± 4.4 | 65.1 ± 0.4 |
73.0 ± 4.5 | 73.0 ± 4.5 |

Glass | 79.9 ± 5.0 | 53.8 ± 7.4 |
62.7 ± 7.8 |
68.8 ± 8.7 |
65.4 ± 4.9 |
65.4 ± 4.9 |

ionosphere | 92.9 ± 3.6 | 92.6 ± 2.4 | 88.0 ± 3.5 | 93.5 ± 3.0 | 92.6 ± 3.8 | 92.6 ± 3.8 |

iris | 95.3 ± 4.5 | 95.3 ± 5.5 | 96.7 ± 4.7 | 96.7 ± 3.5 | 97.3 ± 3.4 | 97.3 ± 3.4 |

labor-neg-data | 89.7 ± 14.3 | 89.7 ± 14.3 | 93.3 ± 11.7 | 93.3 ± 8.6 | 94.7 ± 8.6 | 94.7 ± 8.6 |

soybean | 93.0 ± 3.1 | 93.0 ± 2.4 | 94.0 ± 2.2 | 88.7 ± 3.0 |
91.5 ± 5.4 | 91.5 ± 5.4 |

Average | 85.2 | 82.0 | 83.1 | 82.4 | 83.7 | 83.7 |

• statistically significant degradation

Dataset | # of Components |
---|---|

breast-cancer | 14.2 ± 1.9 |

pima-diabetes | 19.4 ± 1.3 |

Glass | 15.9 ± 1.1 |

ionosphere | 74.4 ± 1.4 |

iris | 2.7 ± 0.7 |

labor-neg-data | 12.0 ± 1.2 |

soybean | 42.6 ± 2.2 |

Besides the confirmation we wanted, we could also compare the IGMN/FIGMN classification accuracy for the referred datasets against other 4 algorithms: Random Forest (RF), Neural Network (NN), Linear SVM and RBF SVM. The neural network is a parallel implementation of a state-of-the-art Dropout Neural Network [

A second experiment was performed in order to evaluate the speed performance of the proposed algorithm, both the original and improved IGMN algorithms, using the parameters

Results can be seen in

Dataset | IGMN Training | FIGMN Training | IGMN Testing | FIGMN Testing |
---|---|---|---|---|

MNIST | 32,544.69 | 1,629.81 | 3,836.06 | 230.92 |

CIFAR-10 | 2,758,252 |
15,545.05 | - | 795.98 |

* estimated time projected from 100 data points

Finally, both versions of the IGMN algorithm with

We have shown how to work directly with precision matrices in the IGMN algorithm, avoiding costly matrix inversions by performing rank-one updates. The determinant computations were also avoided using a similar method, effectively eliminating any source of cubic complexity for the learning algorithm. This resulted in substantial speedups for high-dimensional datasets, turning the IGMN into a good option for this kind of tasks. The inference operation still has cubic complexity, but we argue that it has a much smaller impact on the total runtime of the algorithm, since the number of outputs is usually much smaller than the number of inputs. This was confirmed in the experiments.

In general, we could see that the fast IGMN is a good option for supervised learning, with low runtimes and good accuracy. It should be noted that this is achieved with a single-pass through the data, making it also a valid option for data streams.