I want this to be a helpful resource for newcomers to the field of *Bayesian*
machine learning. The objective here is to collect relevant literature
that brings insights into modern inference methods. Of course, this requires me
to extract insights myself to be sure that the papers I put on are
meaningful. Therefore, this post remains a living document.

I will post commentary, when I can, in terms of what to expect when reading the
material. Often, however, I will only put materials in list to be considered
as the recommended reading order. A recommendation for the overall sequence
in which topics should be considered is harder to be prescribed. ~~I do,
however, suggest that this not be your first excursion into machine learning.~~
I now encourage that this perspective be your first foray into machine learning.

When diving deep into a topic, we often find ourselves too close to the action.
It is important to start with and keep the bigger picture in mind. I recommend
the following to get a feel for the fundamental thesis around being *Bayesian*.
It is not a silver bullet, but a set of common-sense principles to abide by.

- In my introductory article,
*The Beauty of Bayesian Learning*, I describe the essence of Bayesian learning using a simple pattern guessing demo.- Second opinion needed: communicating uncertainty in medical machine learning provides a broad survey grounded in real-world applications of the need to quantify uncertainty.

- PRML Chapter 1
^{@bishop2006pattern}is the place to start for a succinct treatment on the topic. - The ideas can be further reinforced through DJCM's PhD Thesis, Chapter 2.
- AGW's PhD Thesis Chapter 1 provides a broader background on the big picture.

Less so now, but often arguments around the subjectivity of the prior is brought into question. This is unfortunately a misdirected argument because without subjectivity, "learning" cannot happen and is in general an ill-defined problem to tackle. Although, subjective priors is not the only thing that being Bayesian brings to the table.

Many people, including seasonsed researchers, have the wrong idea of what it
means to be Bayesian. Putting prior assumptions *does not* make one a Bayesian.
In that sense, everyone is a Bayesian because they build algorithms starting
with priors, whether they know it or not. I die a little when people compare
Bayesian methods to simply regularlizing with the prior. That is an effect often
misconstrued. For instance, take a look at this fun post by Dan Simpson,
"The king must die"
on why simply assuming a Laplace prior does not imply sparse solutions unlike
its popular *maximum a-posteriori* variant known as the Lasso.

When explaining the data using a model, we usually have many competing
hypothesis available, naturally leading to the *model selection* problem.
*Occam's razor* principle advocates that we must choose the simplest possible
explanation. Bayesian inference shines here as well by automatically embodying
this "principle of parsimony".

- ITILA Chapter 28
^{@mackay2004information}describes how Bayesian inference handles "automatic Occam's razor" quantitatively. - Seeing the ever increasing complexity of neural network models, one may doubt the
validity of Occam's razor, perhaps sensing a contradiction. Rasmussen & Ghahramani, in their paper titled
*Occam's razor*, resolve this through a simple experiment. Maddox & Benton et. al. provide an excellent realization of this principle for large models in*Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited*.

*Bayesian model averaging* (BMA) is another perk enjoyed by Bayesians, which
allows for *soft model selection*. See Bayesian Model Averaging: A Tutorial
for a classic reference. Andrew G. Wilson clarifies the value it adds in a technical report titled *The Case for Bayesian Deep Learning*. Unfortunately, BMA is often misconstrued as
model combination. Minka dispells any misunderstandings
in this regard, in his technical note *Bayesian model averaging is not model combination*.

The *Frequentist-vs-Bayesian* debate has unfortunately occupied more minds than
it should have. Any new entrant to the field will undoubtably still come across
this debate and be forced to take a stand (make sure you don't fall for the trap).
Christian Robert's answer on Cross Validated is the best technical introduction to start with. Then, I highly recommend this
talk by a dominant figure in the field, *Michael Jordan*, titled *Bayesian or Frequentist, Which Are You?* (Part I, Part II). Having read and listened to all this,
one should keep this excellent exposition by Robert E. Kass Statistical Inference: The Big Picture on their reading list always. Everytime someone starts this debate again, ask them to read this first.

Gelman and Yao describe *Holes in Bayesian Statistics* which may be a worthwhile reader
at a later stage.

On a concluding note, I would refrain from labelling anyone or any algorithm as an exclusive Bayesian. In one is still hell-bent on being labeled, remember keeping an open mind is the hallmark of a true Bayesian.

References so that one doesn't have to always remember those tricky identities but come up commonly.

- Sam Roweis provides Gaussian Identities, a handy reference. See also PRML Chapter 2.3
^{@bishop2006pattern}. - The Matrix Cookbook by Kaare Brandt Petersen, Michael Syskind Pedersen

Gaussian Process (GP) research interestingly started as a consequence of the popularity and early success of neural networks.

- DJCM's Introduction, Sections 1-6
^{@mackay1998introduction}to understand where GPs comes from. A single reading before the next should help calibrate the mindset. I also recommend returning to this once more after the next reading. - GPML Chapter 1, 2, 3
^{@williams2009gaussian}for a detailed treatment on the usual regression and classification problems. - LWK Chapter 1
^{@scholkopf2001learning}is worth a read for a big picture view of kernel machines. It does not, however, present a Bayesian perspective, but an optimization perspective. Nevertheless, it is a useful perspective. - GPML Chapter 5
^{@williams2009gaussian}to understand how model selection behaves with GPs, and key caveats to look out for, especially regarding Bayesian Model Averaging. It also has a nice example of a non-trivial composite kernel.

The non-parametric nature is slightly at odds with scalability of Gaussian Processes, but we've made some considerable progress through first principles in this regard as well.

Covariance functions are the way we describe our inductive biases in a Gaussian Process model and hence deserve a separate section altogether.

- GPML Chapter 4
^{@williams2009gaussian}provides a broad discussion around where covariance functions come from. - DD's PhD Thesis, Chapter 2 contains some basic advice and intuitions. This is more succinctly available as The Kernel Cookbok.
- A quick skim of Section 2 of Structure Discovery in Nonparametric Regression through Compositional Kernel Search may be helpful.

Monte Carlo algorithms are used for exact inference in scenarios when closed-form inference is not possible.

- PRML Chapter 11.1
^{@bishop2006pattern}

The simple Monte Carlo algorithms rely on *independent* samples from a target distribution to be useful. Relaxing the independence assumption leads to
correlated samples via Markov Chain Monte Carlo (MCMC) family of algorithms.

- IM's PhD Thesis, Chapter 1,2 is arguably the best introduction to the topic of MCMC.
- Betancourt's
*A Conceptual Introduction to Hamiltonian Monte Carlo*is the best introduction to HMC. - PRML Chapter 11.2
^{@bishop2006pattern}

The following readings are only worth after one has played more closely with MCMC algorithms.

- Charles Geyer's Burn-In is Unnecessary

PRML Chapter 10 ^{@bishop2006pattern} shows the zero-forcing behavior of the KL term involved
in variational inference, as a result underestimating the uncertainty when
unimodal approximations are used for multimodal true distributions. This,
however, should not be considered a law of the universe, but only a thumb
rule as clarified by Turner et. al. *Counterexamples to variational free energy compactness folk theorems*.
Rainforth et. al show that tighter variational bounds are not necessarily better.

Cutting-edge research is a good way to sense where the field is headed. Here are a few venues that I occassionally sift through.

- Bayesian Analysis: An electronic journal by the ISBA.
- Bayesian Deep Learning: A regular NeurIPS workshop.
- Symposium on Advances in Approximate Bayesian Inference: A regular NeurIPS workshop grown into an independent symposium.

I'm inspired by
Yingzhen Li's resourceful
document on *Topics in Approximate Inference* (2017).
Many of the interesting references also come from discussions with my advisor,
Andrew Gordon Wilson.

- Bishop, C.M. (2006). Pattern Recognition and Machine Learning (Information Science and Statistics).↩
- MacKay, D. (2004). Information Theory, Inference, and Learning Algorithms. IEEE Transactions on Information Theory, 50, 2544-2545.↩
- MacKay, D. (1998). Introduction to Gaussian processes.↩
- Rasmussen, C., & Williams, C.K. (2009). Gaussian Processes for Machine Learning. Adaptive computation and machine learning.↩
- Schölkopf, B., & Smola, A. (2001). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Journal of the American Statistical Association, 98, 489-489.↩