Hidden Markov Models

Recently, I’ve taken up reading on speech recognition and it’s good to refresh some basic models that have been traditionally a part of ASR. There are great videos on fundamentals of speech and signals in the course on Foundations for Speech Processing. I’m also looking forward to read section E of the Springer Handbook of Speech Processing. Hidden Markov Models have been around for quite some time now and are still being used in conjuction with deep neural networks to achieve state of the art ASR performance. There are numerous (and overwhelming) resources on HMMs but I find the following chapter from the Speech and Language Processing book by Daniel Jurafsky & James H. Martin, an excellent resource. To my understanding, this piece of text is beginner friendly and serves as a good refresher for someone looking to glance through theory behind HMMs.

The chapter is a self-contained resource and I’ve highlighed a few places that need more focussed attention of the reader. Enjoy.

Rotate screen
Icons made by Smashicons from www.flaticon.com is licensed by CC 3.0 BY

Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright c
2016. All
rights reserved. Draft of August 7, 2017.
9Hidden Markov Models
Her sister was called Tatiana.
For the first time with such a name
the tender pages of a novel,
we’ll whimsically grace.
Pushkin, Eugene Onegin, in the Nabokov translation
Alexander Pushkin’s novel in verse, Eugene Onegin, serialized in the early 19th cen-
tury, tells of the young dandy Onegin, his rejection of the love of young Tatiana, his
duel with his friend Lenski, and his later regret for both mistakes. But the novel is
mainly beloved for its style and structure rather than its plot. Among other inter-
esting structural innovations, the novel is written in a form now known as the One-
gin stanza, iambic tetrameter with an unusual rhyme scheme. These elements have
caused complications and controversy in its translation into other languages. Many
of the translations have been in verse, but Nabokov famously translated it strictly
literally into English prose. The issue of its translation and the tension between
literal and verse translations have inspired much commentary—see, for example,
Hofstadter (1997).
In 1913, A. A. Markov asked a less controversial question about Pushkin’s text:
could we use frequency counts from the text to help compute the probability that the
next letter in sequence would be a vowel? In this chapter we introduce a descendant
of Markov’s model that is a key model for language processing, the hidden Markov
model or HMM.
The HMM is a sequence model. A sequence model or sequence classifier is a
sequence model
model whose job is to assign a label or class to each unit in a sequence, thus mapping
a sequence of observations to a sequence of labels. An HMM is a probabilistic
sequence model: given a sequence of units (words, letters, morphemes, sentences,
whatever), they compute a probability distribution over possible sequences of labels
and choose the best label sequence.
Sequence labeling tasks come up throughout speech and language processing,
a fact that isn’t too surprising if we consider that language consists of sequences
at many representational levels. These include part-of-speech tagging (Chapter 10)
named entity tagging (Chapter 20), and speech recognition (Chapter 31) among oth-
In this chapter we present the mathematics of the HMM, beginning with the
Markov chain and then including the main three constituent algorithms: the Viterbi
algorithm, the Forward algorithm, and the Baum-Welch or EM algorithm for unsu-
pervised (or semi-supervised) learning. In the following chapter we’ll see the HMM
applied to the task of part-of-speech tagging.
9.1 Markov Chains
The hidden Markov model is one of the most important machine learning models
in speech and language processing. To define it properly, we need to first introduce
the Markov chain, sometimes called the observed Markov model. Markov chains
and hidden Markov models are both extensions of the finite automata of Chapter 3.
Recall that a weighted finite automaton is defined by a set of states and a set of
transitions between states, with each arc associated with a weight. A Markov chain
Markov chain
is a special case of a weighted automaton in which weights are probabilities (the
probabilities on all arcs leaving a node must sum to 1) and in which the input se-
quence uniquely determines which states the automaton will go through. Because
it can’t represent inherently ambiguous problems, a Markov chain is only useful for
assigning probabilities to unambiguous sequences.
a23 a34
(a) (b)
Figure 9.1 A Markov chain for weather (a) and one for words (b). A Markov chain is specified by the
structure, the transition between states, and the start and end states.
Figure 9.1a shows a Markov chain for assigning a probability to a sequence of
weather events, for which the vocabulary consists of HOT,COLD, and WA R M . Fig-
ure 9.1b shows another simple example of a Markov chain for assigning a probability
to a sequence of words w1wn. This Markov chain should be familiar; in fact, it
represents a bigram language model. Given the two models in Fig. 9.1, we can as-
sign a probability to any sequence from our vocabulary. We go over how to do this
First, let’s be more formal and view a Markov chain as a kind of probabilistic
graphical model: a way of representing probabilistic assumptions in a graph. A
Markov chain is specified by the following components:
Q=q1q2qNa set of Nstates
A=a01a02 ...an1ann atransition probability matrix A, each aij rep-
resenting the probability of moving from state i
to state j, s.t. Pn
j=1aij =18i
q0,qFa special start state and end (final) state that are
not associated with observations
Figure 9.1 shows that we represent the states (including start and end states) as
nodes in the graph, and the transitions as edges between nodes.
A Markov chain embodies an important assumption about these probabilities. In
afirst-order Markov chain, the probability of a particular state depends only on the
Markov chain
previous state:
Markov Assumption: P(qi|q1qi1)=P(qi|qi1)(9.1)
Note that because each aij expresses the probability p(qj|qi), the laws of prob-
ability require that the values of the outgoing arcs from a given state must sum to
aij =18i(9.2)
An alternative representation that is sometimes used for Markov chains doesn’t
rely on a start or end state, instead representing the distribution over initial states and
accepting states explicitly:
π=π1,π2,...,πNan initial probability distribution over states. πiis the
probability that the Markov chain will start in state i. Some
states jmay have πj=0, meaning that they cannot be initial
states. Also, Pn
QA ={qx,qy}a set QA Qof legal accepting states
Thus, the probability of state 1 being the first state can be represented either as
a01 or as π1. Note that because each πiexpresses the probability p(qi|START ), all
the πprobabilities must sum to 1:
Before you go on, use the sample probabilities in Fig. 9.2b to compute the prob-
ability of each of the following sequences:
(9.4) hot hot hot hot
(9.5) cold hot cold hot
What does the difference in these probabilities tell you about a real-world weather
fact encoded in Fig. 9.2b?
9.2 The Hidden Markov Model
A Markov chain is useful when we need to compute a probability for a sequence
of events that we can observe in the world. In many cases, however, the events
we are interested in may not be directly observable in the world. For example, in
Chapter 10we’ll introduce the task of part-of-speech tagging, assigning tags like
Noun and Verb to words.
we didn’t observe part-of-speech tags in the world; we saw words and had to in-
fer the correct tags from the word sequence. We call the part-of-speech tags hidden
because they are not observed. The same architecture comes up in speech recogni-
tion; in that case we see acoustic events in the world and have to infer the presence
of “hidden” words that are the underlying causal source of the acoustics. A hidden
Markov model (HMM) allows us to talk about both observed events (like words
Markov model
(a) (b)
Figure 9.2 Another representation of the same Markov chain for weather shown in Fig. 9.1.
Instead of using a special start state with a01 transition probabilities, we use the πvector,
which represents the distribution over starting state probabilities. The figure in (b) shows
sample probabilities.
that we see in the input) and hidden events (like part-of-speech tags) that we think
of as causal factors in our probabilistic model.
To exemplify these models, we’ll use a task conceived of by Jason Eisner (2002).
Imagine that you are a climatologist in the year 2799 studying the history of global
warming. You cannot find any records of the weather in Baltimore, Maryland, for
the summer of 2007, but you do find Jason Eisner’s diary, which lists how many ice
creams Jason ate every day that summer. Our goal is to use these observations to
estimate the temperature every day. We’ll simplify this weather task by assuming
there are only two kinds of days: cold © and hot (H). So the Eisner task is as
Given a sequence of observations O, each observation an integer cor-
responding to the number of ice creams eaten on a given day, figure
out the correct ‘hidden’ sequence Qof weather states (H or C) which
caused Jason to eat the ice cream.
Let’s begin with a formal definition of a hidden Markov model, focusing on how
it differs from a Markov chain. An HMM is specified by the following components:
Q=q1q2qNa set of Nstates
A=a11a12 ...an1ann atransition probability matrix A, each aij rep-
resenting the probability of moving from state i
to state j, s.t. Pn
j=1aij =18i
O=o1o2oTa sequence of Tobservations, each one drawn
from a vocabulary V=v1,v2,...,vV
B=bi(ot)a sequence of observation likelihoods, also
called emission probabilities, each expressing
the probability of an observation otbeing gen-
erated from a state i
q0,qFa special start state and end (final) state that are
not associated with observations, together with
transition probabilities a01a02 ...a0nout of the
start state and a1Fa2FanF into the end state
As we noted for Markov chains, an alternative representation that is sometimes
Definition of a HMM
used for HMMs doesn’t rely on a start or end state, instead representing the distri-
bution over initial and accepting states explicitly. We don’t use the πnotation in this
textbook, but you may see it in the literature1:
π=π1,π2,...,πNan initial probability distribution over states. πiis the
probability that the Markov chain will start in state i. Some
states jmay have πj=0, meaning that they cannot be initial
states. Also, Pn
QA ={qx,qy}a set QA Qof legal accepting states
A first-order hidden Markov model instantiates two simplifying assumptions.
First, as with a first-order Markov chain, the probability of a particular state depends
only on the previous state:
Markov Assumption: P(qi|q1qi1)=P(qi|qi1)(9.6)
Second, the probability of an output observation oidepends only on the state that
produced the observation qiand not on any other states or any other observations:
Output Independence: P(oi|q1qi,…,qT,o1,…,oi,…,oT)=P(oi|qi)(9.7)
Figure 9.3 shows a sample HMM for the ice cream task. The two hidden states
(H and C) correspond to hot and cold weather, and the observations (drawn from the
alphabet O={1,2,3}) correspond to the number of ice creams eaten by Jason on a
given day.
P(1 | COLD) .5
P(2 | COLD) = .4
P(3 | COLD) .1
P(1 | HOT) .2
P(2 | HOT) = .4
P(3 | HOT) .4
Figure 9.3 A hidden Markov model for relating numbers of ice creams eaten by Jason (the
observations) to the weather (H or C, the hidden variables).
Notice that in the HMM in Fig. 9.3, there is a (non-zero) probability of transition-
ing between any two states. Such an HMM is called a fully connected or ergodic
HMM. Sometimes, however, we have HMMs in which many of the transitions be-
Ergodic HMM
tween states have zero probability. For example, in left-to-right (also called Bakis)
Bakis network
HMMs, the state transitions proceed from left to right, as shown in Fig. 9.4. In a
Bakis HMM, no transitions go from a higher-numbered state to a lower-numbered
state (or, more accurately, any transitions from a higher-numbered state to a lower-
numbered state have zero probability). Bakis HMMs are generally used to model
temporal processes like speech; we show more of them in Chapter 31.
1It is also possible to have HMMs without final states or explicit accepting states. Such HMMs define a
set of probability distributions, one distribution per observation sequence length, just as language models
do when they don’t have explicit end symbols. This isn’t a problem since for most tasks in speech and
language processing the lengths of the observations are fixed.
made in
first order HMM
Figure 9.4 Two 4-state hidden Markov models; a left-to-right (Bakis) HMM on the left and
a fully connected (ergodic) HMM on the right. In the Bakis model, all transitions not shown
have zero probability.
Now that we have seen the structure of an HMM, we turn to algorithms for
computing things with them. An influential tutorial by Rabiner (1989), based on
tutorials by Jack Ferguson in the 1960s, introduced the idea that hidden Markov
models should be characterized by three fundamental problems:
Problem 1 (Likelihood): Given an HMM λ=(A,B)and an observation se-
quence O, determine the likelihood P(O|λ).
Problem 2 (Decoding): Given an observation sequence Oand an HMM λ=
(A,B), discover the best hidden state sequence Q.
Problem 3 (Learning): Given an observation sequence Oand the set of states
in the HMM, learn the HMM parameters Aand B.
We already saw an example of Problem 2 in Chapter 10. In the next three sec-
tions we introduce all three problems more formally.
9.3 Likelihood Computation: The Forward Algorithm
Our first problem is to compute the likelihood of a particular observation sequence.
For example, given the ice-cream eating HMM in Fig. 9.3, what is the probability of
the sequence 313? More formally:
Computing Likelihood: Given an HMM λ=(A,B)and an observa-
tion sequence O, determine the likelihood P(O|λ).
For a Markov chain, where the surface observations are the same as the hidden
events, we could compute the probability of 313just by following the states labeled
313and multiplying the probabilities along the arcs. For a hidden Markov model,
things are not so simple. We want to determine the probability of an ice-cream
observation sequence like 313, but we don’t know what the hidden state sequence
Let’s start with a slightly simpler situation. Suppose we already knew the weather
and wanted to predict how much ice cream Jason would eat. This is a useful part
of many HMM tasks. For a given hidden state sequence (e.g., hot hot cold), we can
easily compute the output likelihood of 313.
Let’s see how. First, recall that for hidden Markov models, each hidden state
produces only a single observation. Thus, the sequence of hidden states and the
sequence of observations have the same length.2
Given this one-to-one mapping and the Markov assumptions expressed in Eq. 9.6,
for a particular hidden state sequence Q=q0,q1,q2,...,qTand an observation se-
quence O=o1,o2,...,oT, the likelihood of the observation sequence is
The computation of the forward probability for our ice-cream observation 313
from one possible hidden state sequence hot hot cold is shown in Eq. 9.9. Figure 9.5
shows a graphic representation of this computation.
P(313|hot hot cold)=P(3|hot)P(1|hot)P(3|cold)(9.9)
1 3
.2 .1
Figure 9.5 The computation of the observation likelihood for the ice-cream events 313
given the hidden state sequence hot hot cold.
But of course, we don’t actually know what the hidden state (weather) sequence
was. We’ll need to compute the probability of ice-cream events 313instead by
summing over all possible weather sequences, weighted by their probability. First,
let’s compute the joint probability of being in a particular weather sequence Qand
generating a particular sequence Oof ice-cream events. In general, this is
The computation of the joint probability of our ice-cream observation 313and
one possible hidden state sequence hot hot cold is shown in Eq. 9.11. Figure 9.6
shows a graphic representation of this computation.
P(313,hot hot cold)=P(hot|start)P(hot|hot)P(cold|hot)
Now that we know how to compute the joint probability of the observations
with a particular hidden state sequence, we can compute the total probability of the
observations just by summing over all possible hidden state sequences:
2In a variant of HMMs called segmental HMMs (in speech recognition) or semi-HMMs (in text pro-
cessing) this one-to-one mapping between the length of the hidden state sequence and the length of the
observation sequence does not hold.
This enumeration makes it