Wavelets¶

tl;dr¶

Wavelets can be used to capture information about features in a signal or image at different locations and size scales.

The wavelet transform is performed with two functions: a scaling function which captures the average behavior over some size scale, and a wavelet function which reflects differences from that average.

The field of wavelet analysis is broad and deep, touching many other areas of mathematics such as functional analysis, signal processing, and statistics. The name "wavelet" evokes an image of a "little wave", which isn't a terrible description. We're going to gloss over the technical details here, but the basic idea of a wavelet analysis is to rewrite a function as a linear combination of other functions, like this:

$ f(x) = a_0 g_0(x) + a_1 g_1(x) + a_2 g_2(x) + ... = \sum_{i=0}^\infty a_i g_i(x) $

You might be familiar with Taylor series, which allow you to write a function as the sum of polynomials multiplied by constant coefficients. Or perhaps you've heard of Fourier analysis, where we can rewrite a function as the sum of sine and cosine functions with different frequencies.

The basis of such a transformation are the functions $g_i(x)$ above. Analysis is the process of finding the $a_i$ coefficients given a function $f(x)$ and a basis. So the basis of a Fourier transform are sines/cosines of different frequencies, and Fourier analysis is how we find the coefficients of those basis functions to give us our original function. A wavelet transform uses two types of function for the basis:

A scaling function - captures the "average" behavior at a particular location and size scale;
A wavelet function - captures the differences from the average at that location and size scale.

Unlike a Fourier transform, which has a single type of basis function, there are infinite variety of wavelet bases, each providing some unique combination of mathematical characteristics. Some examples are shown below:

Overkill? Maybe, but hopefully it gives some idea of the richness of wavelets.

The Why of Wavelets¶

tl;dr¶
Wavelets can compress the information in many types of data, so the same information can be expressed with fewer numbers. Compression also allows us to potentially suppress the effects of noise, while still getting a good estimate of the "interesting" information in the data.

Before trying to understand the details of how the wavelet transform works, let's first understand the "why". And the first step there is choosing a particular wavelet basis. Our choice here could be informed by many different things. For example, as you might have guessed from the wavelet menagerie above, you can pick wavelets with just about any sort of functional characteristics you want: smoothness, continuity, differentiability, locality. That choice could be motivated either by your prior knowledge of the data being analyzed, or perhaps a hypothesis you wish to test. Another motivation might be simple practicality: does the choice of wavelet basis make it easier and faster to do the required calculations for my analysis? That turns out to be the driving factor for our analysis of Fermi LAT data, for reasons which will be come clear later. We will be employing the simplest wavelet basis, usually denoted the "Haar basis", which is the same as db1 and several others above. The way we're going write the Haar basis will be the following:

The scaling function is equal to one over some continous interval, and zero everywhere else.
The corresponding wavelet is equal to one over half of that interval, negative one over the other half, and zero everywhere else.

Digital signal and image processing deals with discrete rather than continuous data, like pixels in an image. So we will focus on the discrete wavelet transform, or DWT. Let's look at some example one dimensional signals and their corresponding wavelet/scaling coefficents.

Our first example input data (plotted in blue) is distinctly blocky, similar to the Haar basis. Remember that the wavelet transform operates at different size scales, which is represented by the different rows. The scaling coefficients are labeled S and the wavelet coefficients W. The number of each label indicates the size scale for the coefficients. So W1 means the wavelet coefficients at scale $2^1$ power, or spanning 2 data points. S5 are the scaling coefficients at scale $2^5$ power, or spanning 32 data points. Now, except for S6, the scaling coefficients are intermediate products. The final output of the wavelet transform in this case would be W1-W6 plus S6, for a total of 128 coefficients, equal to the number of data points.

To get W1 and S1, we apply one step of the DWT algorithm. Notice that S1 has half the number of points of the original data, and tends to capture the average features at size scale 2. W1 also has half the number of points compared to the data, and captures the detailed differences between the original and the "averaged" behavior reflected in S1. To get S2 and W2, we again apply one step of the DWT algorith, but this time to S1 instead of the data. We similarly get S3 and W3 from S2 and so forth until we're done.

Which brings us to the heart of the matter: Why wavelets? Notice that most of the wavelet coefficents above are effectively zero. We only see a large wavelet coefficient where there is an "interesting" feature in the data at the given size scale. So W1 catches the sharp edges, and is otherwise zero. Out of 128 coefficients, only 33 are non-zero in this result. This is a very important concept called compression: after transforming the signal, the same information is represented by only 25% as much data. Our choice of basis has allowed us to detect "interesting" features in the data, while effectively ignoring boring stuff. Let's see a practical application of this by adding some noise to the data:

Despite the noise, we still see the interesting features in the data. That inspires an idea: if the interesting stuff in the image corresponds to larger wavelet coefficients, maybe we could remove some of the noise by zeroing any coefficients whose absoulte value falls below some threshold.

Not bad! Not perfect, but that's not really possible. "Noise", by definition, is anything that obscures some of the information in the original signal. But if we can transform the data to amplify the interesting features relative to the noise, we can recover more useful information.

Now, this blocky data is a great match with the shape of the Haar functions. Let's try something maybe more realistic:

So we're geting pretty good compression still, even though the shapes aren't blocky. Let's try adding some noise and then applying our wavelet-based denoising algorithm:

Again, pretty good. The main issue here is that the "blockiness" of the Haar wavelets starts to leak through in the denoised result. If we chose a different wavelet which better matched our expectations about the underlying signal, we might get better compression and denoising. But as we said earlier, for our gamma-ray analysis we're going to need to stick with Haar for other reasons. We can improve things, however, by using a variation of the wavelet transform algorithm, the stationary or translationally invariant version. We need to know a little more about the wavelet transform algorithm to see what this means and why it helps.

Performing one step of the DWT corresponds to applying two filters to the signal: one which gives the scaling coefficients, and one which gives the wavelet coefficients. Now if we have 128 points, after doing the two filters we have a total of 256. The nature of these filters means that half of those results are redundant. So we drop every other point, getting us back to a total of 128 (64 scaling, 64 wavelet). That last step is called decimation, and it involves a choice of whether we drop the even or odd points. That choice doesn't matter if we just invert the transformation, we get the same answer regardless. But if we do some thresholding to remove the noise, we may get different answers depending on that choice. So let's just not make the choice. We'll keep all of the points after filtering. For $N$ data points, we'll wind up with $N \log_2 N$ total coefficients. A lot of these are technically redundant, but that redundancy is a good thing when attempting to remove the noise.

We'll start by looking at the translationally invariant transform of the blocks data.

The difference is obvioius, as we have 128 points for both scaling and wavelet coefficients at every size scale. Now let's see how denoising does:

That's a pretty solid result, but again not surprising given the obvious match between the blocks data and Haar functions. Let's check out the bumps data denoised with the translationally invariant transform:

That's a definite improvement, no more "blocky" artifacts.

Ultimately we'll need two-dimensional wavelets for analyzing our gamma-ray image maps. But we can get some intuition from what the 1D versions look like by transforming a slice through the Galactic plane. We'll start with our model of expected counts:

We're certainly seeing what we want here, wavelets picking up interesting information at different size scales. The effect of noise is definitely apparent in the transformed data. We might think of attempting the same sort of denoising we did above. That would do...something, but the answer is probably going to be misleading, and like the simple blurring filter method, doesn't leave us a way to say how much we believe the results. The denoising algorithm we used in the above examples works because the noise is both additive and stationary. Shot noise from counting photons is generally neither. The approach we use will be similar in spirit, but will be able to account for the particular statistics of shot noise.

One last concept which will be useful is multiresolution analysis, or MRA. The MRA is really just another way of looking at the wavelet transform. Rather than plot coefficients, we instead plot the inverse transform at each size scale. Here's an example for our count model:

The blue plot is the input data. The green plots correspond to the reconstruction for each set of coefficients. If you add up all the green plots, you get back to the original data. This is handy because it gives us a direct picture of what the wavelet transform "sees" at each size scale. Just for chuckles, let's look at the MRA of the difference between model and data.

The 2D Haar Basis¶

If we want to analyze image data, we need two-dimensional wavelets. Since we're now in 2D, the wavelets not only detect information at a given location and scale, but also have a directional aspect. Examples of the 2D Haar basis we will use are shown below, at a size scale of $4 \times 4$ pixels.

Optional: The How of Wavelets¶

tl;dr¶
The version of Haar wavelet transform we will use requires only addition and subtraction to compute the coefficients. Specifically:

The scaling function coefficient is just the sum of the data values;

The wavelet coefficient is found by summing adjacent groups of data values, and taking their difference.

This section doesn't discuss the actual algorithms we use. They are simple, and you can read the code to figure out how they work. Instead this is more of a conceptual overview, to give some idea of the arithmetic and patterns involved in calculating the Haar transform. In particular, we are focusing on the variant of the Haar transform which will prove useful our subsequent analyses, which is a little different than the usual presentation. We also only discuss the one-dimensional version for clarity, but the extension to 2D doesn't really change much except where the $1$'s and $-1$'s show up. You can skip this section without missing out on much, and will probably get more out of it if you have some basic understanding of linear algebra.

Since the wavelet transform is linear, it is convenient to write it in terms of linear algebra. If you're not familiar with matrices and vectors, don't worry, it's just a shorthand way to write some basic arithmetic. Say the data we wished to transform had only two points, d = [a, b]. We could then write the Haar DWT like this:

$$ w = H d = \left[ {\begin{array}{cc} 1 & -1 \\ 1 & 1 \\ \end{array} } \right] \left[ {\begin{array}{c} a \\ b \\ \end{array} } \right] = \left[ {\begin{array}{c} a - b \\ a + b \\ \end{array} } \right] $$

$H$ is the matrix corresponding to the Haar DWT. The first row of $H$ is the discrete wavelet function, the second is the discrete scaling function. The important point to take away here is that the wavelet coefficients $w$ are obtained only by adding or subtracting values in $d$, with no multiplications (other than by $1$ or $-1$, of course).

Given a basis and coefficients, we can synthesize or reconstruct the original data values via the inverse wavelet transform. Now, many wavelets, including Haar, have the nice property of being orthogonal. One way to think about orthongality is that the wavelet transform of a wavelet or scaling function always results in a zero coefficient, except where they match. For or example above:

$$ \left[ {\begin{array}{cc} 1 & -1 \\ 1 & 1 \\ \end{array} } \right] \left[ {\begin{array}{c} 1 \\ -1 \\ \end{array} } \right] = \left[ {\begin{array}{c} 2 \\ 0 \\ \end{array} } \right], \left[ {\begin{array}{cc} 1 & -1 \\ 1 & 1 \\ \end{array} } \right] \left[ {\begin{array}{c} 1 \\ 1 \\ \end{array} } \right] = \left[ {\begin{array}{c} 0 \\ 2 \\ \end{array} } \right] $$

If a matrix like $H$ is orthongonal, that means it's inverse is just the same as the transpose (where you switch rows for columns and vice versa) times some constant. From the above, you can probably guess the number here is $1/2$.

$$ H^{-1} = \frac{1}{2}H^T = \frac{1}{2} \left[ {\begin{array}{cc} 1 & 1 \\ -1 & 1 \\ \end{array} } \right] $$

Let's test this with the wavelet coefficients we got above, and see if we get back the original data:

$$ \frac{1}{2} \left[ {\begin{array}{cc} 1 & 1 \\ -1 & 1 \\ \end{array} } \right] \left[ {\begin{array}{c} a - b \\ a + b \\ \end{array} } \right] = \frac{1}{2} \left[ {\begin{array}{c} a - b + a + b \\ -(a-b) + a + b \\ \end{array} } \right] = \frac{1}{2} \left[ {\begin{array}{c} a - b + a + b \\ b - a + a + b \\ \end{array} } \right] = \frac{1}{2} \left[ {\begin{array}{c} 2a \\ 2b \\ \end{array} } \right] = \left[ {\begin{array}{c} a \\ b \\ \end{array} } \right] $$

What about longer signals? The decimated wavelet transform takes $N$ data values and transforms into $N$ coefficients. This is often called the dyadic or decimated wavelet transform, and one constraint is that $N = 2^J, J = 1, 2, 3, ...$, i.e. $N$ has to be a power of 2. The first step is easy: to make $H$ for longer signals, we can just copy the $2 \times 2$ $H$ matrix above. So for $N = 4$:

$$ \left[ {\begin{array}{cc} 1 & -1 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 0 & 0 & 1 & -1 \\ 0 & 0 & 1 & 1 \\ \end{array} } \right] $$

For $N = 8$:

$$ \left[ {\begin{array}{cc} 1 & -1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & -1 & 0 & 0 & 0 & 0\\ 0 & 0 & 1 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & -1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & -1 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 \\ \end{array} } \right] $$

And so on.

But this is only one scale. We can see how the Haar transform is going to depend on position, but right now we have only a single scale. We index the scale by $j = 1, 2, 3 ... J$. So the only scale we've looked at so far is for $j = 1$, or $2$ data points. What about the other scales? Let's see how that works for $N = 4$. Start by rewriting $H$ so that the wavelet and scaling coefficients all stick together.

$$ H = \left[ {\begin{array}{cc} 1 & -1 & 0 & 0 \\ 0 & 0 & 1 & -1 \\ 1 & 1 & 0 & 0 \\ 0 & 0 & 1 & 1 \\ \end{array} } \right] $$

Now, the magic trick of the wavelet transform is that to get to the next size scale, you just leave the wavelet coefficients alone, and do the trick with the $2 \times 2$ $H$ matrix on the scaling coefficients:

$$ H_{4,2} = \left[ {\begin{array}{cc} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & -1 \\ 0 & 0 & 1 & 1 \\ \end{array} } \right] \times \left[ {\begin{array}{cc} 1 & -1 & 0 & 0 \\ 0 & 0 & 1 & -1 \\ 1 & 1 & 0 & 0 \\ 0 & 0 & 1 & 1 \\ \end{array} } \right] = \left[ {\begin{array}{cc} 1 & -1 & 0 & 0 \\ 0 & 0 & 1 & -1 \\ 1 & 1 & -1 & -1 \\ 1 & 1 & 1 & 1 \\ \end{array} } \right] $$

See how we stuffed the $2 \times 2 H$ in the bottom right of the first matrix? Now you can see that we have the first two rows of the answer corresponding the wavelets for $j = 1$, or $2$ data points, while the third row is a wavelet spanning $4$ data points, and the last is the scaling function across all 4 points. How about $N = 8$?

$$ \left[ {\begin{array}{cc} 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & -1 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 \\ \end{array} } \right] \left[ {\begin{array}{cc} 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & -1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & -1 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 \\ \end{array} } \right] \left[ {\begin{array}{cc} 1 & -1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & -1 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 1 & -1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & -1 \\ 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 \\ \end{array} } \right] \\ = \left[ {\begin{array}{cc} 1 & -1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & -1 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 1 & -1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & -1 \\ 1 & 1 & -1 & -1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 1 & -1 & -1 \\ 1 & 1 & 1 & 1 & -1 & -1 & -1 & -1 \\ 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ \end{array} } \right] $$

Hopefully the pattern is clear. Let's again emphasize the most important points going forward:

The scaling function coefficient is just the sum of the data values;
The wavelet coefficient is found by summing adjacent groups of data values, and taking their difference.

Again, only addition and subtraction. This will be important for our gamma-ray data analysis. Even if you don't really get the details of the above, this is really the main thing you need to understand.

Just one more note: We don't actually construct these matrices when doing the wavelet transform. Instead there is a fast wavelet transform (FWT) algorithm, which is indeed very fast (basically avoids all the multiplying by zero implied with the full matrix). You might be able to guess what that looks like based on the pattern in the matrices above.