If you don't want to read my personal babbling...

...here's the links to the other parts of this writeup. I recommend reading them in order. Or you can just read the tl;dr version.

  1. Introduction
  2. Poisson Statistics
  3. Wavelets
  4. The Skellam Distribution

The code behind this can be found at https://github.com/sparkofreason/fermi2. It is currently in various states of usability, in particular the notebooks won't run without the data, which I need to clean up and upload someplace. The core analysis algorithms can be run, that code is located in the fermi folder.

Preface

Data analytics often proceeds as follows:

  1. Start with pile of data;
  2. Fiddle around with various ad hoc algorithms, charts, etc;
  3. Arrive at some results which "make sense", i.e. has some sort of intuitive appeal, or tells a nice story;
  4. Convince others that your "makes sense" really does "make sense".

There are at least two aspects of this which are dubious. Starting with data and proceeding to conclusions is one. Starting at the data end should probably only lead to more questions, which you then try to answer via other means (preferably using other, independently gathered data). But the one I want to focus on here is the "makes sense". People say this all the time, but what does it even mean? I'll grant you that humans are remarkably good at intuitive reasoning (compared, say to "artificial intelligence" like ChatGPT). But this is a weak foundation upon which to predicate high-dollar business decisions, choices which might cost people's lives, or even just scientific advancement.

I've certainly been guilty of this. In the 1990s I did some research in gamma-ray astrophysics, mostly involving analysis of data from the Compton Gamma Ray Observatory, and for part of that time followed the path above. The statistics education I had received during undergraduate and graduate physics studies amounted to a single semester course, and even that was nothing more than a bunch of recipes with little conceptual underpinning. So I, like many, had a "lies, damned lies, and statistics" viewpoint, and relied mostly on "makes sense" sort of arguments, often expanded to punitive proportions. I mean, I wrote that paper, and reading it today makes my head hurt. And I wasn't alone in this view. I remember giving a scientific talk and having a prominent PhD physicist actually stand up and say "You can use statistics to get any answer you want", in front of a whole room of other PhD's, and nobody blinked.

Sometime around 1997 I came across this paper by Eric Kolaczyk. Initially it just "made sense", and I thought it would be interesting to extend to the two-dimensional case for working with gamma-ray image data. But talking to Eric was a major eye-opener, and I realized his approach actually "made sense" from a more rigorous mathematical perspective, rather than just some sort of hand-wavy intuition. Eric's method started with a well-defined question, worked through various mathematical steps and approximations, and arrived at the final algorithm. The conclusions you could draw from the results had some real quantitative "oomph", something beyond just an opinion, not just "it seems reasonable and kind of looks like astrophysics".

Anyway, Eric was able to make the extension to 2D, and added a further enhancement which when applied to data from the EGRET telescope led to the discovery of the galactic gamma-ray halo. The experience of publicizing this result really drove home the power of the math-based approach. For instance, the announcement of the halo discovery was made at a meeting of the American Astronomical Society (in the same hotel which inspired "The Shining" by Stephen King; yes, that was freaky). During the pre-meeting reception, a senior member of the EGRET team told me something like "I intend to be critical of your result. I've seen these sort of analytical methods come and go many times." Which was quite reasonable, given the checkered history of such things which "made sense" at the time and subsequently proved to be...lacking. I suggested we grab a table, and proceeded to walk him through the math, literally scribbling on cocktail napkins. After about 15 minutes he had completely changed his opinion. And that's the true power of math: not only can you use it to solve problems, but more importantly others can dissect and verify that solution, without recourse to opinions.

Not long after all that I left academic science for greener pastures in Silicon Valley, and largely lost touch with the field. Sometime around 2012 I read about the discovery of the Fermi Bubbles, new large-scale gamma-ray features revealed by EGRET's much more powerful successor, the Fermi Space Telescope. And I just assumed they had used the same analysis method as was applied for the gamma-ray halo. I mean, their paper even referenced our paper, and were thus presumably aware of the technique. But no, the approach there was pretty much what is described in the first paragraph above, which clearly had remained entrenched as "acceptable practice" in gamma-ray astrophysics. And I felt like I had failed to spread the word that there really was a better way.

Around Christmas of 2021, I had some extra time due to my employer's generous holiday time-off policy, and finally decided to do something about this. The original code for our algorithm had been long lost (there was no Github in those days), but as a pretty proficient software engineer and data scientist type, I figured I could hack it together in short order. I downloaded 13 years-worth of Fermi data and set to work, and pretty quickly realized why, perhaps, our neato-keen approach had not gained wider traction. I had jumped back into the paper describing the method, and found myself having a pretty difficult time getting my head around it. And I had helped write that paper. You'd think that after a couple of more decades of working with more advanced statistics and math it would have been a no-brainer to pick it up again, so perhaps it wasn't surprising that, presumably with about that same one-semester course of stats education that I had received, most scientists hadn't picked it up either.

So, that's why we're here. Nominally it is to describe in detail the method and the reasoning supporting it. More broadly, I hope this can serve as an example of how you can use math and statistics to ask questions of data and get quantifiable answers supported by actual reasoning, instead of just making 💩 up and trying to convince people that it's really ice cream.

Acknowledgments

This wound up being a much bigger project than I had anticipated, and would not have been successful without some help. A big "thank you" to those below for giving me their valuable time and expertise, particularly given that I'm no longer "in science", and could have easily been dismissed as a crackpot.

Home
Introduction >

© 2023 Dave Dixon