tl;dr¶
The Skellam distribution models the difference between Poisson variables. Since the sum of Poisson variables is just another Poisson variable, our Haar scaling coefficients will be Poisson distributed. The Haar wavelet coefficients will correspondingly be the difference between Poisson variables. The wavelet coefficients are modeled via the Skellam distribution.
We made a big deal of chosing the Haar transform and a particular formulation where only addition and subtraction was required to compute the scaling and wavelet coefficients. Now it's time for the payoff. Let's look at a small 1D example:
$$ w = H d = \left[ {\begin{array}{cc} 1 & -1 & 0 & 0 \\ 0 & 0 & 1 & -1 \\ 1 & 1 & -1 & -1 \\ 1 & 1 & 1 & 1 \\ \end{array} } \right] \left[ {\begin{array}{c} a \\ b \\ c \\ d \\ \end{array} } \right] = \left[ {\begin{array}{c} a - b \\ c - d \\ (a + b) - (c + d) \\ a + b + c + d \\ \end{array} } \right] $$Now, you might have guessed by now that we're going to want to play the whole p-value game, but using the coefficients from the wavelet transform rather than operating on pixels. That means we need to have distributions for those coefficients, and be able to compute p-values given the wavelet transforms of the model and data. Looking at the above, how are the coefficients distributed? We know that the sum of Poisson variables is another Poisson variable. So clearly the scaling coefficient is distributed like $Pois(a + b + c + d)$. Similarly, $a+b$ and $c+d$ are Poisson variables. But what about $a-b$ and the other coefficients involving subtractions?
The difference of Poisson variables turns out to be modeled by Skellam distribution. Given expected rates $\lambda_1$ and $\lambda_2$, Skellam gives the probability that the difference between the measured counts would be $N$. Unlike the Poisson distribution, which is only defined for $N \geq 0$, Skellam is defined for all integers, positive and negative. Here's an example with $\lambda_1 = 3$ and $\lambda_2 = 1$.
from scipy.stats import skellam
import matplotlib.pyplot as plt
import numpy
fig, ax = plt.subplots(1, 1, figsize=(15, 7))
lam1 = 3
lam2 = 1
x = numpy.arange(skellam.ppf(0.000001, lam1, lam2), skellam.ppf(0.99999, lam1, lam2))
ax.set_xticks(x)
_ = ax.vlines(x, 0, skellam.pmf(x, lam1, lam2), colors='b', lw=5)
Going back to our soccer example: if we modeled the goals scored per game by a team as a Poisson variable, we could then model the score difference between teams using Skellam. Say the Protons average 3 goals per game and the Annihilators average 1. If the two teams played, the distribution of the score difference between Protons and Annihilators would be the above. What's the probability that the Annihilators would win? That's just the p-value for $N \leq -1$, or that the Annihilators scored at least one more goal than the Protons.
fig, ax = plt.subplots(1, 1, figsize=(15, 7))
lam1 = 3
lam2 = 1
N = -1
x = numpy.arange(skellam.ppf(0.000001, lam1, lam2), skellam.ppf(0.99999, lam1, lam2))
ax.set_xticks(x)
cs = ['b' if n > N else 'r' for n in x]
_ = ax.vlines(x, 0, skellam.pmf(x, lam1, lam2), colors=cs, lw=5)
The p-value will be the sum of the red lines, but all the way to $N = -\infty$. Again, scipy.stats
to the rescue:
skellam.cdf(-1, 3, 1)
0.09386311341649485
With this model, we'd estimate about a 9% chance that the Annihilators beat the Protons. More importantly, we can model the distribution of Haar wavelet coefficients, and to compute associate p-values. Let's adapt the NHST procedure we did for the image pixels, but instead we'll do it for the coefficients of the wavelet transform. So, given model expected photon counts and measured data counts, we compute the wavelet transform of each, then
Here's what we get for the 1-2 GeV energy band with false detection rate $\alpha = 10^{-6}$ and using the translationally invariant 2D Haar transform:
Let's remind ourselves of the result with the same $\alpha$ when we did the test with pixel values:
Clearly our use of wavelets instead of pixels is spectacularly successful in the sense of doing a much better job of capturing our prior expectations. Use of the wavelet transform captures our prior expectation that the expected counts model shows structures at many different size scales. The wavelet result has some mathematical oomph compared to the simple filtering result. The filter is going to tend to only highlight structure at the size scale we chose for the filter, whille the wavelets are explicitly multiresolution. In fact, we can gain more insight by looking at the multiresolution analysis (MRA), so see what sort of structure the wavelet transform detected at different scales in the residual:
Based on our FDR of $10^{-6}$, we would expect about 1 out of every million coefficients to be a false detection. There's no way to make a similar statement about the simple filter results (at least not without lengthy Monte Carlo calculations). We can further see how statistical different the data coefficients were compared to the model by visualizing the p-values. We'll only look at the wavelet coefficients here. The scaling function effectively has only one coefficient value in this case, which is the total number of counts in tall the map pixels. Note that the color scale will change for the different size scales.
To really drive the point home, let's recall the simple filtering for 500-1000 GeV, where we only had a couple of thousand detected photons total:
And for our wavelet thresholding method:
And we found...very little. And this is probably not so surprising, given the sparseness of the data and FDR of $10^{-6}$. But unlike the simple filtering result, we can state our precise statistical confidence. Again, we can take a look at the p-values for more insight. Unlike the above, we'll keep the color scale constant and capped at a value corresponding to $\alpha/2$. So everything that appears bright white probably made the cut, while everything else was below the threshold.
The p-value maps should provide a better sense of the continuous nature of scientific belief. Our NHST process necessarily involves pinning some value for the false detection rate to arrive at final map (note - there's an argument for choosing different FDR values for different wavelet scales). But there's nothing profound or special about that choice. It simply informs our belief that the wavelet coefficients of the difference between data and model correspond to the rejection of the model by the data at our chosen FDR. Provided we keep that in mind when interpreting the results, we are free to choose other values of $\alpha$. Below are examples for a range of $\alpha$. for both the 1-2 and 500-1000 GeV bands.
Let's sum up how we got here:
Hopefully this illustrates not only the details of this method, but also provides an example of how we can use mathematics to effectively reason about the answers we get when posing scientific (or any) questions given some data.
Here is a more "sciencey" version showing some detailed results for the Fermi bubbles.
© 2023 Dave Dixon