It wasn’t until I entered CMU that I realized the great debate between the frequentist world and the Bayesian world. In simplest terms, the former models the world assuming constant parameters \( \theta \) while the latter assumes uncertainty in those parameters. Maximum likelihood estimate is at the heart of machine learning and hence it deserves a treatment of lucid explanation. At first I wanted to just write a post about Fisher information matrix but then I stumbled across the following piece of writeup by Konstantin Kashin titled Statistical Inference: Maximum Likelihood Estimation. I believe that this piece of writing is self-contained and explains the things well. However, for the benefit of the reader, I have put sidenotes to make some of the derivations clearer. Section 1 and 2 covers the basics of MLE which if you’re a beginner should definitely go through. Intermediate and advanced readers can jump to section 3 for more commentary and notes.