PLoS ONEplosplosonePLOS ONE1932-6203Public Library of ScienceSan Francisco, CA USAPONE-D-16-4885610.1371/journal.pone.0188012Research ArticlePhysical sciencesMathematicsApproximation methodsPhysical sciencesAstronomical sciencesCelestial objectsBlack holesResearch and analysis methodsMathematical and statistical techniquesStatistical methodsRegression analysisLinear regression analysisPhysical sciencesMathematicsStatistics (mathematics)Statistical methodsRegression analysisLinear regression analysisPhysical sciencesMathematicsStatistics (mathematics)Statistical modelsResearch and analysis methodsImaging techniquesPhysical sciencesMathematicsStatistics (mathematics)Statistical dataComputer and information sciencesInformation technologyData processingPhysical sciencesMathematicsAlgebraLinear algebraEigenvectorsAccelerating cross-validation with total variation and its application to super-resolution imagingAccelerating cross-validation with total variation and its application to super-resolution imaginghttp://orcid.org/0000-0003-1216-489XObuchiTomoyukiFormal analysisFunding acquisitionInvestigationMethodologyValidationVisualizationWriting – original draftWriting – review & editing^{1}*IkedaShiroData curationFunding acquisitionProject administrationSoftware^{2}AkiyamaKazunoriData curationFunding acquisitionVisualization^{3}^{4}^{5}KabashimaYoshiyukiConceptualizationFormal analysisFunding acquisitionMethodologyProject administration^{1}Department of Mathematical and Computing Science/Tokyo Institute of Technology, Yokohama 226-8502, JapanThe Institute of Statistical Mathematics, Tachikawa, Tokyo, 190-8562, JapanHaystack Observatory/Massachusetts Institute of Technology, Westford, MA, 01886, United States of AmericaNational Astronomy Observatory of Japan, Osawa 2-21-1, Mitaka, Tokyo 181-8588, JapanBlack Hole Initiative, Harvard University, Cambridge, MA, 02138, United States of AmericaWangYuanquanEditorBeijing University of Technology, CHINA
The authors have declared that no competing interests exist.
* E-mail: obuchi@c.titech.ac.jp201771220171212e018801210122016301020172017Obuchi et alThis is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
We develop an approximation formula for the cross-validation error (CVE) of a sparse linear regression penalized by ℓ_{1}-norm and total variation terms, which is based on a perturbative expansion utilizing the largeness of both the data dimensionality and the model. The developed formula allows us to reduce the necessary computational cost of the CVE evaluation significantly. The practicality of the formula is tested through application to simulated black-hole image reconstruction on the event-horizon scale with super resolution. The results demonstrate that our approximation reproduces the CVE values obtained via literally conducted cross-validation with reasonably good precision.
http://dx.doi.org/10.13039/501100001691Japan Society for the Promotion of Science25120008IkedaShirohttp://dx.doi.org/10.13039/100000001National Science FoundationAST-1614868AkiyamaKazunoriJapan Society for the Promotion of Science (JP)26870185http://orcid.org/0000-0003-1216-489XObuchiTomoyukihttp://dx.doi.org/10.13039/501100001691Japan Society for the Promotion of Science25120013KabashimaYoshiyukihttp://dx.doi.org/10.13039/501100001691Japan Society for the Promotion of ScienceResearch Abroad ProgramAkiyamaKazunoriThis work was supported by Japan Society for the Promotion of Science 25120008 to Prof. Shiro Ikeda; 26870185 to Dr. Tomoyuki Obuchi; 25120013 to Prof. Yoshiyuki Kabashima; 17H00764 to Dr. Tomoyuki Obuchi and Prof. Yoshiyuki Kabashima; Research Abroad Program to Dr. Kazunori Akiyama; and National Science Foundation AST-1614868 to Dr. Kazunori Akiyama.Data AvailabilityAll image data used in the paper are available from http://vlbiimaging.csail.mit.edu/imagingchallenge.1 Introduction
At present, in many practical situations of science and technology, large high-dimensional observational datasets are created and accumulated on a continuous basis. An essential difficulty concerning the treatment of such high-dimensional data is the extraction of meaningful information. Sparse modeling [1, 2] is a promising framework for overcoming this difficulty, which has recently been utilized in many disciplines [3, 4]. In this framework, a statistical or machine-learning model with a large number of parameters (explanatory variables) is fitted to the data, in conjunction with a certain sparsity-inducing penalty. This penalty should be appropriately chosen with consideration of the processed data. One representative penalty is the ℓ_{1} regularization, which retains certain preferred properties, such as the statistical model convexity [5, 6]. A similar penalty that has received more recent focus is the so-called “total variation (TV)” [7–9], which can be regarded as the ℓ_{1} regularization imposed on the difference between neighboring explanatory variables. The TV yields “continuity” of the neighboring variables, which is suitable for the processing of certain datasets expected to have such continuity, such as natural images [4, 7–9].
Another common difficulty associated with the use of statistical models is model selection. In the context of image processing using the ℓ_{1} and TV regularizations, this difficulty appears during the selection of appropriate regularization weights. A practical framework to select these weights, which is applicable to general situations, is cross-validation (CV). CV provides an estimator of the statistical-model generalization error, i.e., the CV error (CVE), using the data under control, and the minimum CVE obtained when sweeping the weights yields the optimal weight values. This versatile framework is, however, computationally demanding for large datasets/models, and this problem frequently becomes a bottleneck affecting model selection. Thus, reducing the CVE computational cost could have a significant impact on a broad range of sparse modeling applications in various disciplines.
Considering these circumstances, in this paper, we provide a CVE approximation formula for a statistical model of linear regression penalized by the ℓ_{1} and TV terms, to efficiently reduce the computational cost. Note that the formula for the case penalized by the ℓ_{1} term alone has already been proposed in [10], and the formula presented herein is a generalization of it. Below, we show the formula derivation and perform a demonstration in the context of super-resolution imaging. The processed images employed in this study are reconstructed from simulated observations of black holes on the event-horizon scale for the Event Horizon Telescope (EHT, see [11–13]) full array. Note that our formula will be applied to actual EHT observations to be conducted after April 2017.
2 Problem setting
Let us suppose that our measurement is a linear process, and denote the measurement result as y ∈ ℝ^{M} and the measurement matrix as A = {A_{μi}}_{μ=1,⋯,M; i=1,⋯N} ∈ ℝ^{M×N}. The explanatory variables, corresponding to the images that will be examined in the later demonstration, are denoted by x ∈ ℝ^{N}. The quality of the fit to the data is described by the residual sum of squares (RSS), i.e., E(x|y,A)=12||y-Ax||22. In addition, we consider the following penalty consisting of ℓ_{1} and TV terms:
R(x;λℓ1,λT)=λℓ1||x||1+λTT(x),
where the T(x) term corresponds to the TV and is expressed as
T(x)=∑i∑j∈∂i(xj-xi)2≡∑iti,
and ∂i denotes the neighboring variables of the ith variable. There is some variation in the definition of “neighbors”; here, we follow the standard approach [7–9]. That is, x is assumed to be a two-dimensional image and the neighbors of the ith pixel correspond to the right and down pixels. However, the bottom row (the rightmost column) of the image is exceptional, as the neighbor of each pixel in that row (column) corresponds to the right (down) pixel only. Note that the developed approximation formula presented below is independent of this specific choice of neighbors and can be applied to general cases.
For this setup, we consider the following linear regression problem with the penalty given in Eq (1)x^(λℓ1,λT)=argminx{E(x|y,A)+R(x;λℓ1,λT)},
where argminu{f(u)} generally represents the argument that minimizes an arbitrary function f(u). Further, we consider the leave-one-out (LOO) CV of Eq (3) in the form
x^\μ(λl1,λT)=argminx{12∑ν(≠μ)(yν−Aνixi)2+R}≡argminx{E(x|y\μ,A\μ)+R(x;λl1,λT)}.
Note that the system without the μth row of y and A is referred to as the “μth LOO system” hereafter. In this procedure, the CVE, i.e., the generalization error estimator, is
ELOO(λℓ1,λT)=12∑μ=1M(yμ-aμTx^\μ(λℓ1,λT))2,
where aμ⊤=(Aμ1,⋯,AμN) is the μth row vector of A. We term this simply the “LOO error (LOOE).”
Computing the LOOE requires solution of Eq (4)M times, by definition, which is computationally expensive. Therefore, the purpose of this paper is to avoid this computational expense by deriving an approximation formula of Eq (5).
3 Approximation formula for softened system
When M is sufficiently large, i.e., the number of observations is large enourgh, the difference between the LOO solution x^\μ and the full solution x^ is expected to be small. This intuition naturally motivates us to conduct a perturbation connecting these two solutions. To conduct this perturbation, we “soften” the penalty by introducing a small cutoff δ(> 0) in the TV, having the form
R→Rδ(x;λℓ1,λT)=λℓ1∑i||x||+λTTδ(x),
where
Tδ=∑i∑j∈∂i(xj-xi)2+δ2≡∑itiδ.
An approximation formula in the presence of ℓ_{1} regularization with smooth cost functions has already been proposed in [10]. We employ that formula here and take the limit δ → 0.
To state the approximation formula, we begin by defining “active” and “killed” variables. Owing to the ℓ_{1} term, some variables are set to zero in x^; we refer to these variables as “killed variables.” The remaining finite variables are termed “active variables.” We denote the index sets of the active and killed variables by S_{A} and S_{K}, respectively. The active (killed) components of a vector x are formally expressed as x_{SA}(x_{SK}). For any matrix X, we use double subscripts in the same manner. For example, for an N × N matrix, a submatrix having row and column components of S_{A} and S_{K}, respectively, is denoted by X_{SA SK}.
The approximation formula can be derived through the following two steps. Note that, in this derivation, a crucial assumption is that the sets of active and killed variables are common among the full and LOO systems. This assumption may not hold exactly in practice, but the resultant formula is asymptotically exact in the large-N limit [10].
The first step is to compute the values of the active variables and their response to small perturbation. The active variables are determined by the extremization condition of the softened cost function with respect to the active variables, such that
∂(E(x|y\μ,A\μ)+Rδ(x;λl1,λT))∂xSA=0⇒(x^δ\μ)SA.
The focus here is the response of this solution when a small perturbation −h · x is incorporated into the cost function. A simple computation demonstrates that the active–active components of the response function, (χδ\μ)SASA=∂∂hSA(x^δ\μ)SA|h=0, are equivalent to the inverse of the cost-function Hessian
(χδ\μ)SASA=(HSASAδ\μ)−1,Hδ∖μ=∂x2(E(x|y∖μ,A∖μ)+Rδ(x;λl1,λT))=G∖μ+∂x2Rδ(x;λl1,λT),
where ∂x2 denotes the Hessian operator ∂x2≡(∂2∂xi∂xj) and G^{∖μ} is the Gram matrix of A^{∖μ}, i.e., G^{∖μ} ≡ (A^{∖μ})^{⊤}A^{∖μ}. The other components of the response function are identically zero, from the stability assumption of S_{K} and because the killed variables are zero, with x^SK=x^SK\μ=0.
In the second step, we connect the full solution to the LOO solution, through the above perturbation with an appropriate h. To specify the perturbation, we assume that the difference dδ=x^δ-x^δ\μ is small and expand the RSS of the full system with respect to d^{δ} as follows:
E(x^δ\μ|y,A)≈E(x^δ|y,A)-∑μ=1M(yμ-aμ⊤x^δ)aμ⊤dδ.
This equation implies that the perturbation between the full and LOO systems can be expressed as hμ=(yμ-aμ⊤x^δ)aμ. Hence, we obtain
x^δ≈x^δ\μ+χδ\μhμ=x^δ\μ+(yμ-aμ⊤x^δ)χδ\μaμ.
The Hessian of the full system has a simple relationship with the LOO Hessian, such that
Hδ≡G\μ+(aμaμ⊤)+∂x2Rδ(x^δ)≈Hδ\μ+(aμaμ⊤),
where the approximation at the righthand side comes from replacing x^δ with x^δ\μ in the argument of R^{δ}(x). Inserting Eqs (12 and 13) in conjunction with χSASAδ=(HSASAδ)-1 into Eq (5) and using the Sherman-Morrison formula for matrix inversion, we find
ELOO(λℓ1,λT)≈12∑μ=1M(yμ-aμ⊤x^δ)2(1-aμSA⊤(χδ)SASAaμSA)2.
According to Eq (14), we can compute the LOOE only from the full solution x^δ, without actually performing CV, which facilitates considerable reduction of the computational cost.
4 Handling a singularity
Let us generalize Eq (14) to the limit δ → 0, where the penalty contains another singular term in addition to the ℓ_{1} term. This TV singularity tends to “lock” some of the neighboring variables, i.e., x_{j} = x_{i} (∀j ∈ ∂i), which corresponds to t_{i} = 0 in Eq (2). If two different vanishing TV terms, t_{i} and t_{j}, share a common variable x_{r}, all the variables in those TV terms take the same value x_{k} = x_{r} (∀k ∈ ({i} ∪ ∂i ∪ {j} ∪ ∂j)). In this manner, the active variables are separated into several “locked” clusters, with all the variables inside a cluster having an identical value. This implies that the variable response to a perturbation, χ = lim_{δ→0}χ^{δ}, should have the same value for all variables in a cluster and may, therefore, be merged. Below, we demonstrate this behavior for the δ → 0 limit. For the derivation, we assume that the clusters are common to both the full and LOO systems, similar to the assumption for S_{A} and S_{K}. For convenience, we index the clusters by α, β ∈ C and denote the number of clusters by |C|; the index set of variables in a cluster α is represented by S_{α} and the total set of indices in all clusters is denoted by S_{C} ≡ ∪_{α}S_{α}. Hereafter, we concentrate on the active variable space only and omit the killed variable space. The complement set of S_{C}, i.e., the set of isolated variables that do not belong to any cluster, is denoted by S_{I} and, thus, S_{A} = S_{I} ∪ S_{C}.
Two crucial observations for the derivation are the “scale separation” and the presence of the “zero mode.” For vanishing TV terms, a natural scaling to satisfy limδ→0tiδ=ti=0 is |x^jδ-x^iδ|∝δ(∀j∈∂i). Once this scaling is assumed, we realize that the components of the Hessian that are directly related to the clusters diverge. Let us define by S^α the set of TV terms corresponding to cluster α, i.e., S^α={i|({i}∪∂i)⊂Sα}. Hence, by construction and for all α ∈ C, all components of Dαδ≡λT(∂x2∑i∈S^αtiδ)SαSα are scaled as 1/δ and, thus, diverge as δ → 0. The remaining terms are retained as O(1). According to this “scale separation,” we decompose the Hessian as H^{δ} = D^{δ} + F^{δ}, where D^{δ} is the direct sum of the diverging components in the naively extended space; Dδ=⊕αDαδ; and F^{δ} consists of the remaining O(1) terms. This decomposition can be schematically expressed as
Hδ=Dδ+Fδ=(D1δ000⋱000D|C|δ000)+(FSCSCδFSCSIδFSISCδFSISIδ).
We denote the basis of the current expression by {e_{i}}_{i∈SA}, with (e_{i})_{j} = δ_{ij}, and move to another basis that diagonalizes DSCSCδ. Each Dαδ has a “zero mode,” and its normalized eigenvector is given by z_{α} = (z_{iα}), where ziα=1/|Sα| for i ∈ S_{α} and 0 otherwise, in the full space. This behavior originates from the symmetry, such that the {tiδ}i∈S^α are invariant under a uniform shift in the S_{α} sub-space, i.e., x_{j} → x_{j} + Δ (∀j ∈ S_{α}) for ∀Δ ∈ ℝ. This invariance can also be directly seen from a property of the Hessian, i.e., ∂2∂xi2tiδ+∑j∈∂i∂∂xi∂xjtiδ=0.
In addition, we represent the set of normalized eigenvectors of all the other modes of Dαδ, which have eigenvalues λ_{αa} that are proportional to 1/δ and positively divergent, as {uαa}a=1|Sα|-1. Then, {{{u_{αa}}_{a}, z_{α}}_{α}} diagonalizes DSCSCδ and {{{u_{αa}}_{a}, z_{α}}_{α}, {e_{i}}_{i∈SI}} constitutes an orthonormal basis of the full space. Corresponding to this variable change, we denote S^Z, S^I+Z, and S^C-Z as the index set of variables in the space spanned by {z_{α}}_{α}, {{z_{α}}_{α}, {e_{i}}_{i∈SI}, and {u_{αa}}_{α,a}, respectively. In the new expression, we can rewrite H^{δ} = D^{δ} + F^{δ} as
Hδ=(DS˜C-ZS˜C-Zδ000)+(FS˜C-ZS˜C-ZδFS˜C-ZS˜I+ZδFS˜I+ZS˜C-ZδFS˜I+ZS˜I+Zδ),
where DS˜C-ZS˜C-Zδ=diag({λαa}α,a). Because of the divergence of DS˜C-ZS˜C-Zδ, only FS˜I+ZS˜I+Zδ is relevant for the evaluation of (H^{δ})^{−1}. These considerations yield the explicit formula of χ as
(Hδ)-1=((DS˜C-ZS˜C-Zδ)-100(FS˜I+ZS˜I+Zδ)-1)+O(δ)→(000(FS˜I+ZS˜I+Z)-1)=χ,
where F = lim_{δ→0}F^{δ}.
By construction, in the reduced space to span ({z_{α}}_{α}, {e_{i}}_{i∈SI}), FS˜I+ZS˜I+Z can be expressed as
FS˜I+ZS˜I+Z=∑α,β(Fαβzαzβ⊤+Fβαzβzα⊤)+∑α∑i∈SI(Fαizαei⊤+Fiαeizα⊤)+∑i,j∈SIFijeiej⊤.
As the non-zero components of the zero mode z_{α} are identically given as 1/|Sα|, all these coefficients can be easily expressed by the original coefficients F_{ij}, as
Fαβ=zα⊤Fzβ=1|Sα||Sβ|∑i∈Sα,j∈SβFij,Fαi=zα⊤Fei=1|Sα|∑j∈SαFji,
and F_{iα} = F_{αi} by the symmetry. Now, all the components are explicitly specified. The form of χ in the original basis {e_{i}}_{i∈SA} can be accordingly assessed by moving back from the basis {{{u_{αa}}_{a}, z_{α}}_{α}, {e_{i}}_{i∈SI}}, which completes the computation.
Some additional consideration of the above computation demonstrates that we can shorten some steps and obtain a more interpretable result. We introduce a |S˜I+Z|×|S˜I+Z| matrix F¯ as
F¯αβ=|Sα||Sβ|Fαβ,F¯αi=|Sα|Fαi,
with the remaining components being identical to those of FS˜I+ZS˜I+Z, i.e., F¯SISI=FSISI. Eqs (19) and (20) indicate that F¯ is simply the matrix summing the rows and columns in each cluster to a row and a column. It is natural that F¯ has a direct connection to χ, because the locked variables in a cluster should exhibit the same response against perturbation. In fact, the response function χ in the original basis is expressed using F¯ as
χ=∑i,j∈SIF¯ij-1(eiej⊤+ejei⊤)+∑α,βF¯αβ-1∑i∈Sα∑j∈Sβeiej⊤+∑α(∑i∈Sα∑j∈SIF¯αj-1eiej⊤+∑i∈SI∑j∈SαF¯iα-1eiej⊤).
This can be directly shown from Eqs (17 and 19), using the relation FS˜I+ZS˜I+Z=PF¯S˜I+ZS˜I+ZP with P=diag({{1}i∈SI,{|Sα|-1}α}), and the blockwise matrix inversion formula. Eqs (14) and (21) constitute the main result of this paper.
5 Algorithmic implementation5.1 Numerical stability and the softening constant <italic>δ</italic>
For handling the singularity of the cost-function Hessian, we have introduced the softening constant δ in the TV and finally taken the δ → 0 limit. In practical implementations, however, we should keep δ small but finite. To see the reason, it is sufficient to see a simple example with just three variables {xi}i=13. The softened TV is defined as
Tδ(x)=(x2-x1)2+(x3-x1)2+δ2=p2+q2+δ2,
where p = x_{2} − x_{1}, q = x_{3} − x_{1} are introduced. The corresponding gradient and Hessian are
∂Tδ∂x=1(p2+q2+δ2)1/2(-p-qpq),∂x2Tδ=1(p2+q2+δ2)3/2((p-q)2+2δ2pq-q2-δ2pq-p2-δ2pq-q2-δ2q2+δ2-pqpq-p2-δ2-pqp2+δ2).
The zero point of the gradient is given by p = q = 0 irrespectively of the δ value. Inserting this into the Hessian, we get one zero mode proportional to (1, 1, 1)^{⊤} and two finite modes whose eigenvalues are (3/δ, 1/δ) being divergent in the δ → 0 limit. This exactly matches with the assumptions of the approximation formula.
On the other hand, if we first take the limit δ → 0 before taking the zero gradient limit p, q → 0, we see that two zero modes appear: One is proportional to (1, 1, 1)^{⊤} and the other is to (p + q, q − 2p, p − 2q)^{⊤}. This is a bad news because the second zero mode, which remains even in the limit p, q → 0, is never taken into account when deriving the approximation formula: The derivation essentially depends on how the zero mode behaves and our formula loses its justification if such unexpected zero modes exist.
These considerations manifest that the two limits, lim_{δ→0} and lim_{p,q→0}, are not exchangeable in the TV Hessian. The derivation of our approximation formula assumes lim_{δ→0} lim_{p,q→0} and thus the algorithmic implementation should reflect this limit in a certain way. A simple way is to keep δ small but finite, which is actually a common technique to enhance the numerical stability when using the TV [14]. The choice of the amplitude of δ is related to the numerical precision when solving the optimization problem (3). A practical choice is stated in the next subsection.
5.2 Procedures
Here, we state the procedures for implementation of Eqs (14 and 21) in a numerical computation. Suppose that we have an algorithm to solve Eq (3) and to provide the solution x^ given y, A, λ_{ℓ1}, and λ_{T}. Using this solution and introducing a finite δ in the Hessian by the reason discussed above, we can assess the LOOE through the following steps:
The sets of active and killed variables, S_{A} and S_{K}, are specified from x^.
The values of all TV terms {tiδ(x^)}i=1N are computed.
All clusters C and the index sets belonging to the clusters {S_{α}}_{α∈C} are enumerated from {tiδ(x^)}i=1N, as well as the one of isolated variables, S_{I}.
The total variation from which the vanishing TV terms are removed is denoted by T˜δ(x^), and the regular part of the Hessian is computed as F=G+λT∂x2T˜δ(x^).
A new index set S_{R} = {{α}_{α∈C}, S_{I}} is defined.
On S_{R}, the merged Hessian F¯ is constructed from F, as F¯SISI=FSISI, F¯αβ=∑i∈Sα,i∈SβFij, F¯αSI=∑i∈SαFiSI, and F¯SIα=∑i∈SαFSIi. Similarly, the merged measurement matrix A¯ is defined as A¯μSI=AμSI, A¯μα=∑i∈SαAμi.
Using F¯ and A¯, the LOOE factor in Eq (14) is computed as 1-aμSATχSASAaμSA=1-a¯μSRT(F¯SRSR\a¯μSR), where a¯μT is the μth row vector of A¯ and x = A\b is the solution of the linear equation Ax = b.
Using the LOOE factor and x^, the LOOE is evaluated from Eq (14).
At step 7, we take the left division F¯SRSR\a¯μSR instead of the inverse χ=F¯-1 for numerical stability. The cluster enumeration at step 3 involves a delicate point in the definition of C and {S_{α}}_{α∈C}. Because of the limited precision in the numerics, the TV term |x^j-x^i|(j∈∂i) never exactly vanishes; therefore, we need a certain threshold to extract the cluster structure from the TV terms. Here, we introduce the threshold θ and enumerate the clusters as follows:
If tiδ(x^)≤δ+θ, the variables in {i} ∪ ∂i are regarded as “linked.” All the links are enumerated by testing tiδ(x^)≤δ+θ for all i = 1, ⋯, N. The set of links is denoted by L, and the index set of all variables in L is denoted by S_{L}.
An empty set C = ϕ is prepared and the cluster index α = 1 is defined.
The following steps are repeatedly implemented while L is non-empty:
Two empty sets, S_{tmp} = ϕ and S_{cluster} = ϕ, are prepared;
One link is selected and removed from L. The variable indices in the link are entered into S_{tmp};
The following steps are repeatedly implemented while S_{tmp} is non-empty:
One index i in S_{tmp} is selected and moved from S_{tmp} to S_{cluster};
If the above chosen index i exists in S_{L}, all the links to i are removed from L, and S_{L} is updated accordingly. The variables linked to i are entered into S_{tmp};
S_{tmp} ← S_{tmp} − S_{cluster}.
The variables in S_{cluster} constitute a cluster. S_{α} = S_{cluster} is defined and α is entered into C;
α ← α + 1.
If S_{α} ∩ S_{K} ≠ ϕ, α is removed from C. This is checked for all α ∈ C.
C, {S_{α}}_{α∈C}, and S_{I} = S_{A} − ∪_{α∈C}S_{α} are returned.
The entire procedure presented above implements Eqs (14 and 21).
A debatable point would be the values of θ and δ. In most of iterative algorithms as the one in [8, 9], there is an inevitable finite error of the TV term even when it should vanish. Let us express the “scale” of this error as ti(x^)≈θnum>0. By construction, the threshold θ is related to this numerical error and it is appropriate to choose θ ≈ θ_{num}; the softening constant δ should be sufficiently larger than θ_{num} because it does implement the assumed order of two limits, lim_{δ→0} lim_{p,q→0}, in derivation of the approximation formula. Overall, the relation
θ≈θnum≪δ
must be satisfied. We have numerically checked how strict this principled relation is, and found that the approximation result is not sensitive to the choice of θ as long as it is sufficiently smaller than δ. Although a little more delicate points are involved in the choice of δ, we have also found that in a wide range of δ the approximation result is stable and the cost-function Hessian is safely invertible. Based on these observations, in the application of our formula below, the default values are set to be δ = 10^{−4} and θ = 10^{−12}. They are chosen according to our datasets and experimental setup: The maximum value of the non-softened TV terms is scaled as maxiti(x^)≳10-4 and the numerical precision is about θ_{num} ≈ 10^{−12}; the former value is reflected to δ and the latter one is used in θ. Coincidently, this default value of δ accords with the one in [14]. The examination result of the sensitivity to δ and θ will be reported below.
Another noteworthy point is that these procedures can be easily extended to other variants of the TV. For example, for the so-called anisotropic TV [9], T_{ani} = ∑_{i} ∑_{j∈∂i}|x_{j} − x_{i}|, we set F = G in step 4 and modify the definition of the link in step 3-1 accordingly, so as to render our formula applicable. In the case of the square TV, T_{sq} = ∑_{i} ∑_{j∈∂i}(x_{j} − x_{i})^{2} ≡ (1/2)x^{⊤}Jx, the formula can be significantly simpler, because this TV has no sparsifying effect and the formula of the simple ℓ_{1} case can be employed. We can employ Eq (14) with χ_{SASA} = (G_{SASA} + λ_{T}J_{SASA})^{−1} directly, without the need for cluster enumeration.
6 Application to super-resolution imaging
To test the usefulness of the developed formula, let us apply the derived expression to the super-resolution reconstruction of astronomical images. A number of recent studies have demonstrated that sparse modeling is an effective means of reconstructing astronomical images obtained through radio interferometric observations [15–18]. In particular, the capability of high-fidelity imaging in super-resolution regimes has been shown, which renders this technique a useful choice for the imaging of black holes with the EHT [17–21]. We adopt the same problem setting as [17, 20] and demonstrate the efficacy of our approximation formula through comparison with the literally conducted 10-fold CV result. Here, x_{i} denotes the ith pixel value and A is (part of) the Fourier matrix. The dataset y is generated through the linear process
y=Ax0+ξ
where ξ is a noise vector and x_{0} is the simulated image, which we infer given y and A.
In this work, we use data for simulated EHT observations based on three different astronomical images, which are available as sample data for the EHT Imaging Challenge. Our datasets 1, 2, and 3 correspond to the sample datasets 1, 2, and 5, respectively, available from [22] at July 2017. The images are reconstructed with N = 10000 = 100 × 100 pixels and with 160, 250, and 100 μ as fields of view, which are identical to the original images of Datasets 1, 2, and 3 from the EHT Imaging Challenge, respectively. We test four values for each λ_{ℓ1} and λ_{T}: λ_{ℓ1} ∈ (M/2) × {1, 10, 100, 1000} and λ_{T} ∈ (M/8) × {1, 10, 100, 1000}. M is 1910, 1910, and 2786, for Datasets 1–3, respectively. Later, we also use different size data from the same datasets, for checking the size dependence of the result.
Table 1 shows the mean CVE values for the three datasets, determined by the 10-fold CV and by our approximation formula for varying λ_{T}. λ_{ℓ1} is fixed to the optimal value, which is coincidently common for all datasets and satisfies 2λ_{ℓ1}/M = 1. It is clear that the approximate CVE values accord well with the 10-fold results, even on the error-bar scale, demonstrating that our approximation formula works very well. Note that the error bar for the approximation is given by the standard deviation of the M terms in Eq (14) divided by M-1.
10.1371/journal.pone.0188012.t001CVE values determined by 10-fold CV and our approximation formula against λ<sub><italic>T</italic></sub>.
λ_{ℓ1} is fixed to the optimal value (2λ_{ℓ1}/M = 1, coincidentally common to all cases). The number in brackets denotes the error bar to the last digits. The optimal values are bolded. The tuning constants δ and θ are set to be δ = 10^{−4} and θ = 10^{−12}, respectively.
8λ_{T}/M
1
10
100
1000
Dataset 1
10-fold
1.101(47)
1.090(44)
1.091(44)
1.455(108)
Approx.
1.087(35)
1.080(35)
1.082(35)
1.385(49)
Dataset 2
10-fold
1.368(91)
1.260(55)
1.286(65)
2.843(234)
Approx.
1.180(37)
1.157(36)
1.210(37)
2.669(108)
Dataset 3
10-fold
1.026(18)
1.020(19)
1.020(22)
1.235(52)
Approx.
1.028(26)
1.018(26)
1.020(27)
1.226(40)
To directly observe the reconstruction quality, in Fig 1 we display the images at all investigated parameters and the reconstructed image at the optimal λ_{ℓ1} and λ_{T} for Dataset 3, as well as the associated errors plotted against λ_{ℓ1} and λ_{T} in Fig 2. Again, we can see the proposed method approximates the 10-fold result well, and the reconstructed image reasonably resembles the original. The RSS is monotonic with respect to the changes of λ_{l1} and λ_{T} but the approximate LOOE is not, which implies that the LOOE factor computed through Eq (21) appropriately reflects the effect of the penalty terms.
10.1371/journal.pone.0188012.g001Super-resolution imaging results for Dataset 3 based on model image of supermassive black hole at center of nearby elliptical galaxy, M87.
(a) Images for all investigated parameters; the star-marked panel is obtained at the optimum parameters. (b) Original images (top) and reconstructed images (bottom) at optimal parameters ((2λ_{ℓ1}, 8λ_{T})/M = (1, 10)). The images are convolved with a circular Gaussian beam on the right-hand side, the full width at half maximum (FWHM) of which is 25% of the nominal angular resolution of the EHT and corresponds to the diameters of the yellow circles. This coincides with the optimal resolution minimizing the mean square error between them.
10.1371/journal.pone.0188012.g002
<p>(a) 3D plot of mean CVEs against λ<sub>ℓ<sub>1</sub></sub> and λ<sub><italic>T</italic></sub> without error bars. (b) Plot of mean CVEs and RSS against λ<sub><italic>T</italic></sub> at the optimal value of λ<sub>ℓ<sub>1</sub></sub>, 2λ<sub>ℓ<sub>1</sub></sub>/<italic>M</italic> = 1. (c) Plot of mean CVEs and RSS against λ<sub>ℓ<sub>1</sub></sub> at the optimal value of λ<sub><italic>T</italic></sub>, 8λ<sub><italic>T</italic></sub>/<italic>M</italic> = 10. For (c), the RSS is overlapped with the CVEs in the symbol size. In all the cases, the agreement between the approximate LOOE and the 10-fold CVE is fairly good. The tuning constants <italic>δ</italic> and <italic>θ</italic> are set to be <italic>δ</italic> = 10<sup>−4</sup> and <italic>θ</italic> = 10<sup>−12</sup>, respectively.</p>
</caption>
<graphic mimetype="image" position="float" xlink:href="info:doi/10.1371/journal.pone.0188012.g002" xlink:type="simple"/>
</fig>
<p>Next, we check the sensitivity of the approximate result to the tuning constants <italic>δ</italic> and <italic>θ</italic>. In <xref ref-type="fig" rid="pone.0188012.g003">Fig 3</xref>, the approximate LOOEs at the optimal λ<sub>ℓ<sub>1</sub></sub> are plotted against λ<sub><italic>T</italic></sub> when changing <italic>δ</italic> (left) and <italic>θ</italic> (right). This indicates that the approximate LOOEs are stable against the change of both <italic>δ</italic> and <italic>θ</italic>. Hence, we may choose these values rather arbitrarily. This is a good news because tuning them makes the problem more numerically amenable: Enlarging <italic>δ</italic> makes the computation of the Hessian inversion more numerically stable; increasing <italic>θ</italic> lowers the effective degrees of freedom. The second property associated with <italic>θ</italic> is really beneficial when treating a large-size dataset, because it can downsize the Hessian and reduce the cost for computing its matrix inversion. In <xref ref-type="table" rid="pone.0188012.t002">Table 2</xref>, the values of the effective degrees of freedom are given when changing <italic>θ</italic>. The reduction of the degree of freedom at large (yet small enough compared to <italic>δ</italic> = 10<sup>−4</sup>) <italic>θ</italic> is significant, which encourages us to apply the proposed formula to larger-size datasets.</p>
<fig id="pone.0188012.g003" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0188012.g003</object-id>
<label>Fig 3</label>
<caption>
<title>Comparative plots of mean approximate LOOEs againstλ<sub><italic>T</italic></sub> at 2<italic>M</italic><sup>−1</sup> λ<sub>ℓ<sub>1</sub></sub> = 1 when (a) <italic>δ</italic> changes as 10<sup>−6</sup>–10<sup>−3</sup> with fixed <italic>θ</italic> = 10<sup>−12</sup>; (b) <italic>θ</italic> changes as 10<sup>−12</sup>–10<sup>−6</sup> with fixed <italic>δ</italic> = 10<sup>−4</sup>.
They show that the LOOE curves are rather stable against the choice of the tuning constants.
10.1371/journal.pone.0188012.t002The effective degrees of freedom <inline-formula id="pone.0188012.e104"><alternatives><graphic id="pone.0188012.e104g" mimetype="image" position="anchor" xlink:href="info:doi/10.1371/journal.pone.0188012.e104" xlink:type="simple"/><mml:math display="inline" id="M104"><mml:mrow><mml:mrow><mml:mo>|</mml:mo></mml:mrow> <mml:msub><mml:mover accent="true"><mml:mi>S</mml:mi> <mml:mo>˜</mml:mo></mml:mover> <mml:mrow><mml:mi>I</mml:mi> <mml:mo>+</mml:mo> <mml:mi>Z</mml:mi></mml:mrow></mml:msub> <mml:mrow><mml:mo>|</mml:mo></mml:mrow></mml:mrow></mml:math></alternatives></inline-formula>, the number of clusters + the number of isolated variables, against <italic>θ</italic> for Dataset 3 at <italic>δ</italic> = 10<sup>−4</sup> and the optimal parameters (2λ<sub>ℓ<sub>1</sub></sub>, 8λ<sub><italic>T</italic></sub>)/<italic>M</italic> = (1, 10).
θ
1e-12
1e-11
1e-10
1e-09
1e-08
1e-07
1e-06
|S˜I+Z|
5733
5524
5243
4814
4112
2922
1408
Finally, let us see the data-size dependence of the approximation accuracy and of the computational cost for solving Eq (3) and for obtaining the approximate LOOE from the solution. The data analyzed here is an identical simulated image of black hole expressed with different number of pixels. When solving Eq (3), we used Intel(R) Core(TM) i7-5820K CPU of 3.30GHz with 6 cores for N = 50^{2} = 2500 and Intel(R) Xeon(R) CPU E5-2699 v3 of 2.30GHz with 36 cores for N = 100^{2} and 150^{2}, and employed an algorithm called “MFISTA” proposed in [8, 9]. Meanwhile, we used a laptop of a 1.7 GHz Intel Core i7 with two CPUs for evaluating the approximate LOOE. Hence the comparison is not fair and unfavorable to the approximation formula. The left panel indicates that the approximation accuracy becomes better for larger sizes. This is reasonable because the perturbation we have employed should have better accuracy as the model and data become larger, though the accuracy at N = 50^{2} = 2500 is already good. The right panel clearly shows the advantage of the developed formula: The actual computational time of the approximate LOOE is significantly shorter than that of the algorithm convergence for solving Eq (3) in the investigated range of system sizes, even under the unfair comparison mentioned above. However, this advantage will be less prominent if the model becomes very large: Our approximation formula needs the Hessian inversion whose computational cost is scaled as O((|C| + |S_{I}|)^{3})≈O(N^{3}), while MFISTA requires the cost of O(N^{2}) as long as the number of steps to convergence is constant against N. The crossover size at which these two computational costs become comparable is roughly estimated as N_{×} ≈ 10^{6}, though such crossover tendency cannot be seen yet from Fig 4. For such large systems, a new fundamental solution should be tailored to resolve the computational-cost problem, though tuning θ to a large value in the present method can still be a good first aide.
10.1371/journal.pone.0188012.g004
<p>(a) Plot of mean CVEs at optimal parameters of different sizes. (b) Log-log plot of the computational times for solving the optimization problem (<xref ref-type="disp-formula" rid="pone.0188012.e004">3</xref>) and for obtaining the approximate value of CVE against the size of datasets.</p>
</caption>
<graphic mimetype="image" position="float" xlink:href="info:doi/10.1371/journal.pone.0188012.g004" xlink:type="simple"/>
</fig>
</sec>
<sec id="sec009" sec-type="conclusions">
<title>7 Conclusion
In this paper, we have developed an approximation formula for the CVE of a sparse linear regression penalized by ℓ_{1} and TV terms, and demonstrated its usefulness in the reconstruction of simulated black hole images. Our derivation is based on the perturbation assuming the small difference between the full and leave-one-out solutions. This assumption will not be fulfilled for some specific cases, i.e. when the measurement matrix is sparse. However, for most of dense measurement matrices, such as the Fourier matrix discussed in this paper, our assumption will be reasonably satisfied. Hence we expect the range of application of our formula is wide enough and we would like encourage the readers to use this formula in their own work. It is also straightforward to generalize the developed formula to other types of TV, and two examples of the generalization for the anisotropic and square TVs have been explained.
The key concept of our formula, perturbation between the LOO and full systems, is very general and can be applied to more general statistical models and inference frameworks [23]. The development of practical formulas for those cases will facilitate higher levels of modeling and computation.
We would like to express our sincere gratitude to Mareki Honma and Fumie Tazaki for their helpful discussions. We thank Katherine L. Bouman for preparing the EHT Imaging Challenge website [22, 24]. We also thank Andrew Chael and Lindy Blackburn for writing a simulation software to produce sample data sets [25].
ReferencesRishI, GrabarnikG. HastieT, TibshiraniR, WainwrightM. http://sparse-modeling.jp/index_e.htmlMairal J, Bach F, Ponce J. Sparse modeling for image and vision processing. Available from: arXiv:1411.3230v2.TibshiraniR. Regression shrinkage and selection via the lasso. EfronB, HastieT, JohnstoneI, TibshiraniR. Least angle regression. RudinL I, OsherS, FatemiE. Nonlinear total variation based noise removal algorithms. ChambolleA. An algorithm for total variation minimization and applications. BeckA, TeboulleM, Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems. ObuchiT, KabashimaY. Cross validation in LASSO and its acceleration. http://www.eventhorizontelescope.orgAsada K, Kino M, Honma M, Hirota T, Lu R.-S, Inoue M. White Paper on East Asian Vision for mm/submm VLBI: Toward Black Hole Astrophysics down to Angular Resolution of 1 R_{S}, arXiv:1705.04776AkiyamaK, LuR, FishV L, DoelemanS S, BroderickA E, DexterJ, et al. 230 GHz VLBI observations of M87: Event-horizon-scale structure during an enhanced very-high-energy γ-ray state in 2012. ChanT F, OsherS, ShenJ. The Digital TV Filter and Nonlinear Denoising, WiauxY, JacquesL, PuyG, ScaifeA M M, VandergheynstP. Compressed sensing imaging techniques for radio interferometry. LiF, CornwellT J, de HoogF. The application of compressive sampling to radio astronomy I. Deconvolution. HonmaM, AkiyamaK, UemuraM, IkedaS. Super-resolution imaging with radio interferometry using sparse modeling. HonmaM, AkiyamaK, TazakiF, KuramochiK, IkedaS, HadaKet al. Imaging black holes with sparse modeling. IkedaS, TazakiF, AkiyamaK, HadaK. PRECL: A new method for interferometry imaging from closure phase. AkiyamaK, IkedaS, PleauM, FishV, TazakiF, KuramochiKet al. Superresolution Full-polarimetric Imaging for Radio Interferometry with Sparse Modeling. AkiyamaK, KuramochiK, IkedaS, FishV, TazakiF, HonmaMet al. Imaging the Schwarzschild-radius-scale Structure of M87 with the Event Horizon Telescope Using Sparse Modeling. http://vlbiimaging.csail.mit.edu/imagingchallengeKabashima Y, Obuchi T, Uemura M, Approximate cross–validation formula for Bayesian linear regression. Available from arXiv:1610.07733.Bouman K L, Johnson M D, Zoran D, Fish V L, Doeleman S S, Freeman W T. Computational Imaging for VLBI Image Reconstruction. The IEEE Conference on Computer Vision and Pattern Recognition, 913 (2016).ChaelA A, JohnsonM D, NarayanR, DoelemanS S, WardleJ F C, BoumanK L. High-resolution Linear Polarimetric Imaging for the Event Horizon Telescope.