Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright c
rights reserved. Draft of August 7, 2017.
9Hidden Markov Models
Her sister was called Tatiana.
For the ﬁrst time with such a name
the tender pages of a novel,
we’ll whimsically grace.
Pushkin, Eugene Onegin, in the Nabokov translation
Alexander Pushkin’s novel in verse, Eugene Onegin, serialized in the early 19th cen-
tury, tells of the young dandy Onegin, his rejection of the love of young Tatiana, his
duel with his friend Lenski, and his later regret for both mistakes. But the novel is
mainly beloved for its style and structure rather than its plot. Among other inter-
esting structural innovations, the novel is written in a form now known as the One-
gin stanza, iambic tetrameter with an unusual rhyme scheme. These elements have
caused complications and controversy in its translation into other languages. Many
of the translations have been in verse, but Nabokov famously translated it strictly
literally into English prose. The issue of its translation and the tension between
literal and verse translations have inspired much commentary—see, for example,
In 1913, A. A. Markov asked a less controversial question about Pushkin’s text:
could we use frequency counts from the text to help compute the probability that the
next letter in sequence would be a vowel? In this chapter we introduce a descendant
of Markov’s model that is a key model for language processing, the hidden Markov
model or HMM.
The HMM is a sequence model. A sequence model or sequence classiﬁer is a
model whose job is to assign a label or class to each unit in a sequence, thus mapping
a sequence of observations to a sequence of labels. An HMM is a probabilistic
sequence model: given a sequence of units (words, letters, morphemes, sentences,
whatever), they compute a probability distribution over possible sequences of labels
and choose the best label sequence.
Sequence labeling tasks come up throughout speech and language processing,
a fact that isn’t too surprising if we consider that language consists of sequences
at many representational levels. These include part-of-speech tagging (Chapter 10)
named entity tagging (Chapter 20), and speech recognition (Chapter 31) among oth-
In this chapter we present the mathematics of the HMM, beginning with the
Markov chain and then including the main three constituent algorithms: the Viterbi
algorithm, the Forward algorithm, and the Baum-Welch or EM algorithm for unsu-
pervised (or semi-supervised) learning. In the following chapter we’ll see the HMM
applied to the task of part-of-speech tagging.