It wasn’t until I entered CMU that I realized the great debate between the *frequentist* world and the *Bayesian* world. In simplest terms, the former models the world assuming *constant* parameters \( \theta \) while the latter assumes uncertainty in those parameters. Maximum likelihood estimate is at the heart of machine learning and hence it deserves a treatment of lucid explanation. At first I wanted to just write a post about Fisher information matrix but then I stumbled across the following piece of writeup by Konstantin Kashin titled *Statistical Inference: Maximum Likelihood Estimation*. I believe that this piece of writing is self-contained and explains the things well. However, for the benefit of the reader, I have put sidenotes to make some of the derivations clearer. Section 1 and 2 covers the basics of MLE which if you’re a beginner should definitely go through. Intermediate and advanced readers can jump to section 3 for more commentary and notes.