
UQLAB user manual
score (see Eq. (1.3)). This choice, however, tends to favor distributions with a larger number
of parameters, which more easily adapt to the data but possibly lead to overfitting. To avoid
overfitting, a penalty term on the number of model parameters can be introduced.
The Akaike Information Criterion (AIC; Akaike, 1974) selects the distribution which mini-
mizes the quantity
AIC = 2k − 2 log(L), (1.10)
where k is the number of model parameters.
An even stronger penalization by the number of distribution parameters is the Bayesian In-
formation Criterion (BIC; Schwartz, 1978), which minimizes
BIC = log(n)k − 2 log(L), (1.11)
where n is the number of data points used to fit the distribution.
Sometimes, maximizing the total likelihood (with or without penalization) produces a prob-
ability distribution with a substantial peak centered where most data points accumulate, and
too little probability mass elsewhere. An example of this behavior is shown in Figure 1. A set
of 1 000 data points is drawn from a Gaussian distribution truncated to the left (µ = 0, σ = 1,
support [1, +∞); true distribution and histogram of the data shown in the left panel). Five
different families are then fitted to the data: Gaussian, truncated Gaussian, Weibull, Gamma,
and Lognormal distributions. For the latter four, the truncation interval is specified as the
true one, [1, +∞). The Gamma distribution yields the lowest AIC, despite visually exhibiting
the largest deviation from the histogram of the data. In this case, one would intuitively want
to select the distribution which most closely follows the data histogram.
This intuition is formalized by the Kolmogorov-Smirnov distance (K-S) criterion. The criterion
selects the family whose cumulative distribution has the lowest maximum distance from the
empirical CDF of the data (1.6), that is, which minimizes
d
KS
= max
x∈R
M
|F
X
(x) − H(x)|. (1.12)
In the example shown in Figure 1, right panel, the K-S criterion would select the Lognormal
distribution (cyan) among those illustrated in the figure.
Currently, the K-S criterion in supported in UQ[PY]LAB for univariate distributions only.
1.3.3 Pair-copula selection for vine copulas
Selecting the vine copula that best fits a data set X , or its counterpart U defined in (1.5),
among all existing vines of dimension M, involves the following steps:
1. Selecting a vine structure (for C- and D-vines: selecting the order of the nodes);
2. Selecting the parametric family of each pair copula;
3. Fitting the pair copula parameters to U.
UQ[PY]LAB-V1.0-114 - 6 -