I want this to be a helpful resource for newcomers to the field of Bayesian
machine learning. The objective here is to collect relevant literature
that brings insights into modern inference methods. Of course, this requires me
to extract insights myself to be sure that the papers I put on are
meaningful. Therefore, this post remains a living document.
I will post commentary, when I can, in terms of what to expect when reading the
material. Often, however, I will only put materials in list to be considered
as the recommended reading order. A recommendation for the overall sequence
in which topics should be considered is harder to be prescribed. I do,
however, suggest that this not be your first excursion into machine learning.
When diving deep into a topic, we often find ourselves too close to the action.
It is important to start with and keep the bigger picture in mind. I recommend
the following to get a feel for the fundamental thesis around being Bayesian.
It is not a silver bullet but a set of common-sense principles to abide by.
PRML Chapter 1  is the place to start for a succinct
treatment on the topic.
The ideas can be further reinforced through DJCM's PhD Thesis, Chapter 2 .
AGW's PhD Thesis Chapter 1  provides a broader background on the big picture.
Less so now, but often arguments around the subjectivity of the prior is brought
into question. This is unfortunately a misdirected argument because without
subjectivity, "learning" cannot happen and is in general an ill-defined problem
to tackle. Although, subjective priors is not the only thing that being Bayesian
brings to the table.
Many people, including seasonsed researchers, have the wrong idea of what it
means to be Bayesian. Putting prior assumptions does not make one a Bayesian.
In that sense, everyone is a Bayesian because they build algorithms starting
with some implicit priors (not statistical biases). I die a little
when people compare Bayesian methods to simply regularlizing with the prior.
That is an effect often misconstrued. For instance, take a look at this fun post
by Dan Simpson, "The king must die"
on why simply assuming a Laplace prior does not imply sparse solutions unlike
its popular maximum a-posteriori variant known as the Lasso.
When explaining the data using a model, we usually have many competing
hypothesis available, naturally leading to the model selection problem.
Occam's razor principle advocates that we must choose the simplest possible
explanation. Bayesian inference shines here as well by automatically embodying
this "principle of parsimony".
Seeing the ever increasing complexity of neural network models, one may doubt the
validity of Occam's razor, perhaps sensing a contradiction. Rasmussen & Ghahramani  resolve this through a simple experiment. Maddox & Benton et. al.  provide an excellent realization of this principle for large models.
Bayesian model averaging (BMA) is another perk enjoyed by Bayesians, which
allows for soft model selection. Andrew G. Wilson
clarifies the value it adds in a technical report titled The Case for Bayesian Deep Learning. Unfortunately, BMA is often misconstrued as
model combination. Minka  dispells any misunderstandings
in this regard.
Gaussian Process (GP) research interestingly started as a consequence of the
popularity and early success of neural networks.
DJCM's Introduction, Sections 1-6  to understand where GPs comes from. A single reading before the next should help calibrate the mindset. I also recommend returning to this once more after the next reading.
GPML Chapter 1, 2, 3  for a detailed treatment on the usual regression and classification problems.
LWK Chapter 1  is worth a read for a big picture view of kernel machines. It does not, however, present a Bayesian perspective, but an optimization perspective. Nevertheless, it is a useful perspective.
GPML Chapter 5  to understand how model selection
behaves with GPs, and key caveats to look out for, especially regarding Bayesian Model Averaging. It also has a nice example of a non-trivial composite kernel.
Sparse Gaussian Processes
The non-parametric nature is slightly at odds with scalability of Gaussian
Processes, but we've made some considerable progress through first principles
in this regard as well.
Covariance functions are the way we describe our inductive biases in a Gaussian
Process model and hence deserve a separate section altogether.
GPML Chapter 4  provides a broad discussion around where covariance functions
DD's PhD Thesis Chapter 2  contains some basic advice and intuitions. This is more succinctly available as The Kernel Cookbok.
Monte Carlo algorithms
Monte Carlo algorithms are used for exact inference in scenarios when
closed-form inference is not possible.
The simple Monte Carlo algorithms rely on independent samples from a target distribution to be useful. Relaxing the independence assumption leads to
correlated samples via Markov Chain Monte Carlo (MCMC) family of algorithms.
IM's PhD Thesis, Chapter 1,2  is arguably the best introduction to the topic.
Bishop, C.M., 2006. Pattern recognition and machine learning, springer.
Duvenaud, D., 2014. Automatic model construction with Gaussian processes. University of Cambridge. Available at: https://www.cs.toronto.edu/ duvenaud/thesis.pdf.
MacKay, D.J., 1998. Introduction to Gaussian processes. NATO ASI Series F Computer and Systems Sciences, 168, pp.133–166. Available at: http://www.inference.org.uk/mackay/gpB.pdf.
MacKay, D.J. & Mac Kay, D.J., 2003. Information theory, inference and learning algorithms, Cambridge university press.
MacKay, D.J.C., 1992. Bayesian Methods for Adaptive Models. Available at: http://www.inference.org.uk/mackay/PhD.html.
Maddox, W.J., Benton, G. & Wilson, A.G., 2020. Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited. arXiv preprint arXiv:2003.02139.
Minka, T.P., 2002. Bayesian model averaging is not model combination. https://tminka.github.io/papers/minka-bma-isnt-mc.pdf.
Murray, I.A., 2007. Advances in Markov chain Monte Carlo methods. University of London. Available at: http://homepages.inf.ed.ac.uk/imurray2/pub/07thesis/murray_thesis_2007.pdf.
Rasmussen, C.E. & Ghahramani, Z., 2001. Occam’s razor. In Advances in neural information processing systems. pp. 294–300.
Scholkopf, B. & Smola, A.J., 2018. Learning with kernels: support vector machines, regularization, optimization, and beyond, Adaptive Computation and Machine Learning series.
Williams, C.K. & Rasmussen, C.E., 2006. Gaussian processes for machine learning, MIT press Cambridge, MA. Available at: http://www.gaussianprocess.org/gpml/.
Wilson, A.G., 2014. Covariance kernels for fast automatic pattern discovery and extrapolation with Gaussian processes. University of Cambridge. Available at: http://www.cs.cmu.edu/%7Eandrewgw/andrewgwthesis.pdf.