Learning What’s in a Name with Graphical Models
“The UK” is a country, but “The UK Department of Transport” is an organization within that country. In a named entity recognition (NER) task, where we want to label each word with a name tag (organization/person/location/other/not a name), how can a computer model know one from the other?
In this case, contextual information is key. The fact that “The UK” is followed by “Department” is compelling evidence that when taken together the phrase refers to an organization. Sequence models — machine learning systems designed to take sequences of data as input, can recognize and put such relationships to productive use. Rather than making isolated predictions based on individual words or word groups, they take the given sequence as a combined unit, model the dependencies between words in that sequences, and depending on the problem, can return the most likely label sequence.
In this article, we’ll explore three families of sequence models that are remarkably successful at NER: Hidden Markov Models (HMMs), Maximum-Entropy Markov Models (MEMMs), and Conditional Random Fields (CRFs). All three are probabilistic graphical models, which we’ll cover in the next section. Each simultaneously builds upon the success of the one before and addresses specific, fundamental problems affecting its predecessor. Therein lies a story of iterative development, as models become more capable and theories more generalized and expressive.
Graphical modeling is a robust framework for representing probabilistic models. Complex multivariate probability distributions can be expressed with compact graphs that are easier to understand and interpret.
Let’s start with a simple example with two random variables, and . Assume that is conditionally dependent on . Through a canonical application of the chain rule, the joint distribution of and is:
The shorthand p(a) means p(A = a), that is, the probability of variable A taking value a.
This is a simple example with two factors in the right-hand side. Add more variables, however, and the result can get messy fast. To see this, assume that there are two more variables, and , and that is conditionally dependent on , , and . The factorization becomes:
The relationship between variables is more opaque, hidden behind second-order dependencies. For example, while it’s clear that is directly dependent on , we may miss the fact that there is another, second-order dependency between the two ( is dependent on , which in turn is dependent on ).
Directed Acyclic Graphs, or DAGs, offer a natural remedy to this problem. Each factor in the equation can be represented by a node. An arrow indicates conditional dependence. The resulting graph would look like:
With this graph, it’s easier to construct a story of how , , and are sampled. The process proceeds in topological order, for example → → → , to ensure that all dependencies have been resolved by the time each variable is sampled.
Below is what a sampled population of the given distributions would look like. For the sake of demonstration, many distribution parameters are modifiable — in reality these are the quantities that need to be learned from training data.
For more detailed accounts of probabilistic graphical models, consider reading the textbooks Probabilistic Graphical Models: Principles and Techniques by Daphne Koller and Nir Friedman [1] and Probabilistic Reasoning in Intelligent Systems by Judea Pearl [2].
Hidden Markov Models (HMMs) are an early class of probabilistic graphical models representing partially hidden (unobserved) sequences of events. Structurally, they are built with two main layers, one observed () and one hidden ():
HMMs have been successfully applied to a wide range of problems, including gene analysis [3], information extraction [4, 5], speech recognition [6], and named entity recognition [7].
The hidden layer is assumed to be a Markov process: a chain of events in which each event’s probability depends on the state of only the preceding event. More formally, given a sequence of random events , ,…, , the Markov assumption states:
In a graph, this translates to a linear chain of events where each event has one arrow pointing towards it (except for the first event) and one pointing away from it (except for the last):
A second assumption that HMMs make is time-homogeneity: that the probability of transitioning from one state to the next is independent of time. For example, if the last state was “O”, then the probability of the current state being “B-LOC” is the same value regardless of what time step we are at. In formal terms: