The Poisson Distribution¶

tl;dr¶
The Poisson distribution is a statistical model which can be used to estimate the probability of measuring some number of events over a time period, assuming

The rate at which events arrive is constant, and,

The arrival time of each event is independent of previous events.

The Poisson distribution allows us to answer questions about the data we observed (or might observe), assuming that our model is true.

Most scientific investigations are (or should be) attempting to answer questions in one of two broad categories:

How much to a believe a hypothesis, given that I have observed some data, and have other prior information (like knowledge of physics, previous experimental results, etc.)
How much do I believe that the data I measured would have been seen if some hypothesis were true (again, perhaps including prior knowledge).

These may sound the same, but they're really not, because the first answers questions about the hypothesis, and the second about the data. For example, if I wanted test multiple hypotheses and decide which one I believed the most, I'd need to follow the path of (1). Here we are interested in (2). We've measured photons from the gamma-ray sky, we have a model of the expected counts over the same time period, and want to see where the data disagrees with that model. Those areas of disagreement are targets for new or improved theories.

In both cases we would not be dealing with absolute values of true or false, but a continuum of values between true and false, something like a probability. This is usually constrained to fall between 0 and 1. The closer to 1, the more we believe, with less belief as we move toward 0. We should start by asking a mathematically well-posed question, which will be this:

Assume we are observing the number of some events over some fixed time interval, like 1 second, or 13 years. Further assume that the expected rate of events is constant (does not change in time) and that each event is independent (one event does not influence any others). Call the number of expected events $\lambda$ (which could be fractional), and the number of measured events $N$ (always an integer >= 0). What is the probability that we would see $N$ events given some value of $\lambda$?

Fortunately for us, this problem has already been solved, the answer being the Poisson distribution (pronunciation). We won't go into depth about the motivation or derivation of Poisson, but if you're interested, this is a good article (open in a private window if you hit the paywall). Below we show an example for $\lambda = 7$:

Here we have used the poisson class from the Python module scipy.stats to plot the probabilities of the Poisson distribution over an interesting range of values. Points to note:

The distribution is only defined for $N \geq 0$ (there's no such thing as negative events).
The mean of the distribution is $N$, and the standard deviation $\sqrt N$.

We can use the pmf function of poisson to calculate the probability that we would see $N$ counts, given $\lambda$. So for $\lambda = 7$ and $N = 9$:

You can interpret this number as follows: if I measured event counts from a Poisson process with $\lambda = 7$, I would expect about 10% of the measurements to be $N = 9$. The Poisson distribution answers question (2) above. It tells us something about the data we observe (or might observe) if we assume our model is true, in this case the value of $\lambda$. By itself, Poisson does not allow us to answer question (1), like "how much do I believe $\lambda = 7$ instead of $\lambda = 7.5$.

The Poisson distribution can model anything meeting the conditions defined above (constant arrival time and independence). For example, we could consider using it to model the number of goals scored per game by a soccer team. Is this an accurate model? It's not too difficult to poke some holes in the assumptions:

Since different opponents have different skills, the expected goals per game probably is not the same from game to game.
Each goal may not be independent of the others. Scoring goals means a lot of running, players get more tired as the game goes on, and if the team had a big lead the coach might substitute less-skilled players.

Does this mean Poisson is a bad choice for modeling goals per game? There is a famous quote from statistician George Box, often repeated as:

"All models are wrong. Some are useful."

There's no evidence Box actually said this (again, private window if you hit the paywall). The closest actual recorded quote is this:

“Remember that all models are wrong; the practical question is how wrong do they have to be to not be useful.”

So the question of whether Poisson is a "good" model for soccer goals really depends on your purpose, and whether the assumptions and approximations are acceptable for that purpose. The big advantage of accepting the Poisson approximations for our model is that we can use a solved problem, and quickly calculate statistical quantities of interest using well-tested software libraries like scipy.stats. If you need more accuracy or don't find the assumptions an acceptable approximation, you'll have to solve a more difficult problem, maybe write your own software, which could have bugs, etc. Regardless, the important point is that we are transparent about our assumptions and approximations. Using Poisson immediately communicates this information, and everyone will at least understand what limitations we have accepted in our model.

We can examine Poisson as a suitable model for our gamma-ray problem. It is, again, an approximation, and in most cases probably technically "wrong". For example we know that many gamma-ray sources such as quasars have highly variable intensities, very much not constant in time. That said, this approximation is very widely accepted by scientists as "good enough", because the more exact alternatives are generally much harder or impractical to work with, and the benefits in terms of scientific understanding are likely neglible in most cases.

Sum of Poisson Variables¶

tl;dr¶
The distribution of the sum of Poisson variables is just another Poisson distribution.

Say we have two radiation detectors. One is counting a radioactive source emitting an expected 1.2 counts per second, with the other source having an expected rate of 2.5 counts/second. We know the distribution of counts/second for each individual detector diven the expected rates. But what about the total counts/sec from both detectors?

The Poisson distribution has the useful property that the sum of Poisson variables is just another Poisson variable, whose expectation is the sum of the rates. So in the example above, distribution of the total count rates of both detectors would be Poisson with expected rate 1.2 + 2.5 = 3.7 counts/second.

Comparing Data with Model¶

tl;dr¶
One way of comparing data with a model is to assume the model is true, and calculate the probability of measuring a data value equal to or more extreme than the observed value. This probability is called p-value. If the p-value is small enough (where we have to choose what "small enough" means), then we say the data rejects the model; otherwise the data fails to reject the model.

Remember the question we wish to answer with the Fermi LAT data: given a model describing the number of expected gamma-rays in each pixel of the sky, where is that model "consistent" with the data, and where is it rejected? Statisticians answer this sort of question with "null hypothesis significance testing" (NHST), where we have two hypotheses:

The null hypothesis: here, that would be "the data is consistent with the model";
The alternative hypothesis: "the data is not consistent with the model".

Note that this is very much a true/false question, where we said above that scientific questions do not have such absolute answers, but rather yield a degree of belief between these extremes. Here this enters in the definition of "consistent", which we define via a false detection rate (FDR). The FDR, often denoted with the Greek letter $\alpha$, gives the probability that our data would incorrectly reject the null hypothesis. So if we chose $\alpha = 0.05$, and get the "data is not consistent with the model" answer, we would estimate there is a 5% chance that this data could have been seen if the model were true.

The key statistic for NHST is the p-value, which is defined as the probability we would have seen a value equal to or more extreme than the data assuming the model is true. The more extreme is perhaps best illusrated with a picture:

Here, our model is Poisson with $\lambda = 7$ and our data is $N = 10$. The values included in the p-value are shown in red, and actually extend all the way to $N = \infty$. That would be impossible to calculate directly, but fortunately, scipy.stats provides us some more handy functions:

cdf(x) - the cumulative distribution function (CDF) at some value $x$ probability of seeing a value less than or equal to $x$. In the example above, poisson.cdf(9, 7) would be the sum of the blue lines.
sf(x) - Technically the same as 1 - poisson.cdf(x), but sometimes faster and more accurate. poisson.sf(9, 7) above would be the sum of the red lines (all the way to infinity; note sf(9, 7) does not include the point at $9$).

So we can then calculate the p-value for $N = 10$ as

So about a 17% chance that if we expect a count of $7$ that we will see a count of $10$ or more. We have to include pmf(10, 7) because sf(10, 7) actually starts at $N = 11$.

What if $N = 3$?

In this case, we would calculate the p-value of 3 as the total probability that $N \leq 3$, which is just the CDF:

Let's write a pvalue function for later use:

So we have all of the ingredients now to perform the NHST, given model $\lambda$ and data $N$:

Choose a false-detection rate $\alpha$;
Calculate the p-value $p$ of $N$ given $\lambda$ for the Poisson distribution;
Reject the hypothesis if the $p < \alpha/2$.

The $\alpha/2$ is required because the distribution has two tails, and the total FDR is spread between them.

Let's do some examples. Choose $\lambda = 3.2$, and $\alpha = 0.1$. Is the model rejected at FDR $\alpha$ for $N = 5$?

No, because 0.22 is greather than $\alpha / 2$. How about for $N = 10$?

Yes, 0.002 is less than $\alpha/2$. $N = 0$?

Yes.

We can also simulate some data for our Poisson model, and get some feel for what false-detection rate really means.

So out of 100000 samples, we "got it wrong" 8546 times, in the ballpark of 10% (for small $\lambda$ the discrete nature of Poisson means we aren't going to nail our FDR exactly; the approximation gets better with larger $\lambda$). In other words, 8546 times out of 100000 we got random Poisson data from our model distribution which rejected the model given our choice of $\alpha = 0.1$, i.e. we falsely rejected the model despite it being true. But there's nothing magical about any given choice of $\alpha$, it really just reflects our desire around what confidence we'd like to put in the results. 10% is pretty relaxed. You may often see $\alpha = 0.05$ used as a threshold in scientific papers, but again this is arbitrary, and if you think about it, also not very stringent. Let's pick a stronger FDR, like $\alpha = 10^{-6}$:

Zero false detections, which is a little more cozy. Choosing smaller FDRs means we have greater confidence that the results are not simply an accident of noise in the data. The tradeoff is that a smaller FDR means we reject more stuff which might be real (in the limit of $\alpha = 0$, we reject everything).

Let's again emphasize exactly what's being tested, and what you can and can't say about the results. We are rejecting or failing to reject the model based on observed count $N$, our choice of $\alpha$, and the p-value that we would have seen data equal to or greater than $N$ if the model is true. The only firm quantitative statement we can make is that the model was rejected (or not) with FDR $\alpha$. You cannot, for example, say that the model is accepted just because it passed your p-value threshold, only that you failed to reject. A common mistake is to take the result of a NHST as the strength of evidence in the model, e.g. I chose $\alpha = 0.05$. failed to reject the model, and claim the "model is correct with 95% confidence".

This perhaps sounds like splitting hairs, so let's look at an example to gain a little intuition. Say you selected $\alpha = 0.1$. You would reject the model for any p-value less than $\alpha/2$, whether it was $0.0499999999999$ or $0.0000000000000001$. Similarly, you fail to reject the model for both $0.05000000000001$ or $0.5$. So while the hypothesis test is useful (and we will be using it moving forward), the p-value itself perhaps better captures the "scientific spirit" of quantifying the degree of belief (provided we're careful about exactly what question for which we're assessing the answer). And hopefully these examples have shown that NHST is really focused on questions about the data, given a model. If we want to assess belief in different models given some data, we need to employ a different approach. But that's a topic for another day.

Let's make the approximation that each of our pixels can be treated as a statistically independent measurement (strictly speaking, this isn't true; we'll discuss further later). Then each model pixel contains an expected number of photons, and thus has it's own implied Poisson distribution. Let's calculate the p-values for the observed pixel counts, assuming the model:

The intensity scale above is the negative of the base-10 logarithm of the p-value in each pixel. We've also capped the "maximum" value at 15. "15" actually means a very small p-value of $10^{-15}$, meaning a $1$ in $10^{15}$ chance we would have observed the data if the model were true, which is pretty unlikely.

Our discussion of the NHST maybe suggests a way of keeping the "statistically interesting" difference between the data and model, something like:

Compute the p-value for each pixel.
If the p-value fails to reject the model at a pixel, put the model expected counts in the result pixel.
If the p-value rejects the model at a pixel, put the measured counts in the result pixel.
Subtract the model from the result.

Which after a bit of head scratching, boils down to:

Compute the p-value for each pixel.
If the p-value fails to reject the model at a pixel, put zero in the result pixel.
If the p-value rejects the model at a pixel, put difference between the data and model in the result pixel.

The threshold_pixels function below accomplishes this in Python:

To try this, we need to pick a false-detection rate. Let's try $\alpha = 10^{-6}$:

Hmm, not much to look at there. Let's relax that, and maybe go with the "accepted" value of $\alpha = 0.05$:

So we see more stuff, but don't seem to be making a lot of progress in scientific interpretation. And "more stuff" means we also let in more noise. We can estimate how much by multiplying the total number of tests (equal to the number of pixels here) by $\alpha$: $3600 \times 1800 \times 0.05 = 32940$. So over $30K$ pixels have potentially "falsely" rejected the model.

How to proceed? To get a clue, let's compare the above with the raw difference between data and model:

Which image "feels" like it has more useful information: the raw one, or those where we did the p-value thresholding? If we're being honest, the answer might depending on the scientific question. If we're interested in point-like sources, those very bright spots due to quasars, pulsars, and the like, then perhaps the first would be useful (in fact, in this case we'd probably just want to follow the standard likelihood analysis usually performed for Fermi LAT data). But let's remind ourselves what the model actually looks like:

We see lots of stuff which we might label "large scale structure", features spanning many, many pixels. This isn't just some effect of pareidolia or something, but exactly what we expect based on prior knowledge. The model is created from observations of the distribution of matter density in the Galaxy. So we certainly expect to see the plane of the Milky Way, along with other structures like various gas clouds. See, for example, this map of hydrogen for comparison:

Scientific interpretation is always performed in the context of prior expectations. There's no such thing as complete objectivity, where the analysis is performed without referencing that context. This is reflected even in the procedure of running the image through a low-pass filter, which reduces the information at the smallest size scales in favor of that at larger. Based on our knowledge of physics and the distribution of matter and cosmic rays in the Galaxy, we expect that the larger-scale structure is less likely to be just an "accident of statistics".

When we did the p-value test using pixels, we effectively asserted that each model pixel was independent of all others, in the sense that knowing the expected counts in one pixel would give us no information about the expected (or measured) counts in any other pixels. That clearly isn't the case here. Our prior knowledge of matter and cosmic ray distributions and the associated physics means that looking at any particular model pixel gives us a lot of information about it's neighbors. We have a strong expectation that neighboring pixels will not have radically different values. Further, we expect to see such correlations across multiple size scales. We completely ignore this prior information when we do p-value thresholding with pixels, so it shouldn't be surprising that we find the results disappointing, in the sense that they fail to match our expectations. Similarly, even in the raw difference image, we "see" such correlated structure on different size scales, matching our expectations. Is this just pareidolia? Again, we must turn to mathematics, so we can reflect our prior beliefs mathematically, and perform our p-value testing in the appropriate context of those expectations.