Learning What’s in a Name with Graphical Models

I. Probabilistic Graphical Models

Factorizing Joint Distributions

p(a,b)=p(a)p(ba)p(a, b) = p(a) \cdot p(b|a)

Notation: the shorthand p(a) means p(A = a), that is, the probability of variable A taking value a.

p(a,b,c,d)=p(a)p(ba)p(c)p(da,b,c)p(a, b, c, d) = p(a) \cdot p(b|a) \cdot p(c) \cdot p(d|a, b, c)

Directed Acyclic Graphs

II. Hidden Markov Models

The Hidden Layer

p(sis1,s2,,si1)=p(sisi1)for all i{2,,N}p(s_i | s_1, s_2,…, s_{i-1}) = p(s_i | s_{i-1}) \\ \footnotesize\textrm{for all $i\,{\scriptscriptstyle \in}\,\{2,…, N\}$}
p(sisi1)=p(si+1si)for all i{2,,N1}p(s_i | s_{i-1}) = p(s_{i+1} | s_i) \\ \footnotesize\textrm{for all $i\,{\scriptscriptstyle \in}\,\{2,…, N-1\}$}

The Observed Layer

p(ois1,s2,,sN)=p(oisi)for all i{1,2,,N}p(o_i | s_1, s_2,…, s_N) = p(o_i | s_i) \\ \footnotesize\textrm{for all $i\,{\scriptscriptstyle \in}\,\{1, 2,…, N\}$}

Representing Named Entities

Representation: rather than labeling each node using the name of the variable it represents (X₁, Y₁) as we have until this point, we'll instead display the value of that variable (“O”, “Great”). This helps make the graphs easier to read.

Training

Inference

Results

Accuracy

90.1%

Precision

64.2%

Recall

55.8%

F₁ Score

59.7%

Precision

TagNo OOV1+ OOV
ORG0.80.21
PER0.850.62
LOC0.870.06
MISC0.780.12
ALL0.840.39

Recall

TagNo OOV1+ OOV
ORG0.640.33
PER0.580.59
LOC0.710.05
MISC0.540.06
ALL0.640.41

Limitations

TagEntity Length 1Entity Length 2Entity Length 3Entity Length 4Entity Length 5
ORG0.610.680.30.120.28
PER0.960.670.7
LOC0.880.36
MISC0.90.460.240.12
ALL0.770.610.310.120.29
Entity LengthOOV Rate 0OOV Rate 0.2OOV Rate 0.25OOV Rate 0.33OOV Rate 0.4OOV Rate 0.5OOV Rate 0.6OOV Rate 0.66OOV Rate 0.75OOV Rate 0.8OOV Rate 1
10.860.3
20.80.50.52
30.780.150.060
40.420.22000
50.670.2200.14

III. Maximum Entropy Markov Models

Discriminative Structure

Word Features

b(ot)={1 if ot has shape “Xxx”0 otherwiseb(o_t) = \begin{cases} \textrm{1 if $o_t$ has shape “Xxx”} \\ \textrm{0 otherwise} \end{cases}
fb,s(ot,st)={1 if b(ot)=1 and st=s0 otherwisef_{\langle b, s \rangle}(o_t, s_t) = \begin{cases} \textrm{1 if $b(o_t) = 1$ and $s_t = s$} \\ \textrm{0 otherwise} \end{cases}

State Transitions

ps(so)=1Z(o,s)exp(aλafa(o,s))p_{s\prime}(s|o) = \frac{1}{Z(o, s\prime)} \exp\left(\sum_{a}{\lambda_a \, f_a(o, s)}\right)

Training & Inference

Results

Accuracy

93.1%

Precision

72.9%

Recall

63.5%

F₁ Score

67.9%

Precision

TagNo OOV1+ OOV
ORG0.810.36
PER0.820.8
LOC0.820.17
MISC0.740.14
ALL0.80.54

Recall

TagNo OOV1+ OOV
ORG0.680.12
PER0.720.57
LOC0.890.29
MISC0.780.02
ALL0.790.37

Advantage Over HMMs

Entity LengthOOV Rate 0OOV Rate 0.2OOV Rate 0.25OOV Rate 0.33OOV Rate 0.4OOV Rate 0.5OOV Rate 0.6OOV Rate 0.66OOV Rate 0.75OOV Rate 0.8OOV Rate 1
10.830.34
20.760.720.55
30.60.560.590.5
40.160.20
50.60.5
TagEntity Length 1Entity Length 2Entity Length 3Entity Length 4Entity Length 5
ORG0.760.690.840.360.8
PER0.590.910.660.25
LOC0.80.330.350
MISC0.820.570.290.18
ALL0.770.70.580.160.43

Label Bias Problem

IV. Conditional Random Fields

Markov Random Fields

ϕ(X,Y)={3 if X=1 and Y=12 if X=0 and Y=01 otherwise\phi(X, Y) = \begin{cases} \textrm{3 if $X = 1$ and $Y = 1$} \\ \textrm{2 if $X = 0$ and $Y = 0$} \\ \textrm{1 otherwise} \end{cases}
p(A,B,C)=1Zϕ(A,B)ϕ(B,C)ϕ(C,A)where Z is a normalization factorp(A,B,C) = \frac{1}{Z}\,\phi(A,B)\,\phi(B,C)\,\phi(C,A) \\ \footnotesize\textrm{where Z is a normalization factor}
p(x1,x2,)=1ZcCϕc(xc)where Z is a normalization factorand C is the set of cliques in Gp(x_1, x_2,…) = \frac{1}{Z}\prod_{c\,{\scriptscriptstyle \in}\,C}{\phi_c(x_c)} \\ \footnotesize\textrm{where Z is a normalization factor} \\ \footnotesize\textrm{and C is the set of cliques in $\mathcal{G}$}

Conditional Form

p(yx)=1Z(x)cCϕc(yc,xc)where Z is a normalization factorand C is the set of cliques in thegraph G representing the labels yp(y|x) = \frac{1}{Z(x)}\prod_{c\,{\scriptscriptstyle \in}\,C}{\phi_c(y_c, x_c)} \\ \footnotesize\textrm{where Z is a normalization factor} \\ \footnotesize\textrm{and C is the set of cliques in the} \\ \footnotesize\textrm{graph $\mathcal{G}$ representing the labels $y$}
Linear-chain CRF where the hidden layer depends on the current, previous, and future observations.

Exponential Factors

ϕc(yc,xc)=exp(aλafa(yc,xc))where fa is a feature function defined for clique cand λa is the weight parameter for fa\phi_c(y_c, x_c) = \exp\left(\sum_{a}{\lambda_a \, f_a(y_c, x_c)}\right) \\ \footnotesize\textrm{where $f_a$ is a feature function defined for clique $c$} \\ \footnotesize\textrm{and $\lambda_a$ is the weight parameter for $f_a$}

Training & Inference

Label Bias Problem

Accuracy

94.8%

Precision

77.8%

Recall

74.9%

F₁ Score

76.3%

Precision

TagNo OOV1+ OOV
ORG0.870.46
PER0.870.72
LOC0.90.29
MISC0.810.23
ALL0.870.6

Recall

TagNo OOV1+ OOV
ORG0.790.4
PER0.820.8
LOC0.920.2
MISC0.840.09
ALL0.850.57

References

  1. Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning
    Daphne Koller and Nir Friedman. 2009. The MIT Press.
  2. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference
    Judea Pearl. 1988. Morgan Kaufmann Publishers Inc, San Francisco, CA, USA.
  3. Genes, Themes and Microarrays: Using Information Retrieval for Large-Scale Gene Analysis
    Hagit Shatkay, Stephen Edwards, W John Wilbur, and Mark Boguski. 2000. In Proceedings of the  International Conference on Intelligent Systems for Molecular Biology, 317–328.
  4. Information Extraction Using Hidden Markov Models
    Timothy Robert Leek. 1997. Master’s Thesis, UC San Diego.
  5. Information Extraction with HMMs and Shrinkage PDF
    Dayne Freitag and Andrew McCallum. 1999. In Papers from the AAAI-99 Workshop on Machine Learning for Information Extraction (AAAI Techinical Report WS-99-11), 31–36.
  6. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition
    Lawrence R Rabiner. 1989. Proceedings of the IEEE 77, 2: 257–286. https://doi.org/10.1109/5.18626
  7. An Algorithm that Learns What’s in a Name PDF
    Daniel M. Bikel, Richard Schwartz, and Ralph M. Weischedel. 1999. Machine Learning 34, 1: 211–231. https://doi.org/10.1023/A:1007558221122
  8. Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm
    A. Viterbi. 1967. IEEE Transactions on Information Theory 13, 2: 260–269.
  9. Appendix A.4 — Decoding: The Viterbi Algorithm PDF
    Daniel Jurafsky and James H. Martin. 2021. In Speech and Language Processing. 8–10.
  10. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition PDF
    Erik F. Tjong Kim Sang and Fien De Meulder. 2003. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, 142–147.
  11. Maximum Entropy Markov Models for Information Extraction and Segmentation PDF
    Andrew McCallum, Dayne Freitag, and Fernando C. N. Pereira. 2000. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML ’00), 591–598.
  12. Maximum Entropy Models for Antibody Diversity Link
    Thierry Mora, Aleksandra M. Walczak, William Bialek, and Curtis G. Callan. 2010. Proceedings of the National Academy of Sciences 107, 12: 5405–5410. https://doi.org/10.1073/pnas.1001705107
  13. Human Behavior Modeling with Maximum Entropy Inverse Optimal Control PDF
    Brian Ziebart, Andrew Maas, J. Bagnell, and Anind Dey. 2009. In Papers from the 2009 AAAI Spring Symposium, Technical Report SS-09-04, Stanford, California, USA, 92–97.
  14. On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes PDF
    Andrew Ng and Michael Jordan. 2001. In Advances in Neural Information Processing Systems.
  15. Inducing Features of Random Fields
    S. Della Pietra, V. Della Pietra, and J. Lafferty. 1997. IEEE Transactions on Pattern Analysis and Machine Intelligence 19, 4: 380–393. https://doi.org/ 10.1109/34.588021
  16. Une Approche théorique de l’Apprentissage Connexionniste: Applications à la Reconnaissance de la Parole
    Léon Bottou. 1991. Université de Paris X.
  17. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data PDF
    John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML ’01), 282–289.
  18. The Label Bias Problem Link
    Awni Hannun. 2019. Awni Hannun — Writing About Machine Learning.
  19. Discriminative Probabilistic Models for Relational Data Link
    Ben Taskar, Pieter Abbeel, and Daphne Koller. 2013. https://doi.org/10.48550/ARXIV.1301.0604
  20. Accurate Information Extraction from Research Papers using Conditional Random Fields Link
    Fuchun Peng and Andrew McCallum. 2004. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, 329–336.
  21. Discriminative Fields for Modeling Spatial Dependencies in Natural Images PDF
    Sanjiv Kumar and Martial Hebert. 2003. In Advances in Neural Information Processing Systems.
  22. Multiscale Conditional Random Fields for Image Labeling Link
    Xuming He, R.S. Zemel, and M.A. Carreira-Perpinan. 2004. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004, CVPR 2004, II–II. https://doi.org/10.1109/CVPR.2004.1315232
  23. Conditional Random Fields as Recurrent Neural Networks Link
    Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip H. S. Torr. 2015. In 2015 IEEE International Conference on Computer Vision (ICCV). https://doi.org/10.1109/iccv.2015.179
  24. Convolutional CRFs for Semantic Segmentation Link
    Marvin T. T. Teichmann and Roberto Cipolla. 2018. https://doi.org/10.48550/arxiv.1805.04777
  25. RNA Secondary Structural Alignment with Conditional Random Fields Link
    Kengo Sato and Yasubumi Sakakibara. 2005. Bioinformatics 21: ii237–ii242. https://doi.org/10.1093/bioinformatics/bti1139
  26. Protein Fold Recognition Using Segmentation Conditional Random Fields (SCRFs)
    Yan Liu, Jaime Carbonell, Peter Weigele, and Vanathi Gopalakrishnan. 2006. J. Comput. Biol. 13, 2: 394–406.
  27. Introduction to Markov Random Fields
    Andrew Blake and Pushmeet Kohli. 2011. In Markov Random Fields for Vision and Image Processing. The MIT Press.
  28. An Introduction to Conditional Random Fields Link
    Charles Sutton and Andrew McCallum. 2010. arXiv. https://doi.org/10.48550/ARXIV.1011.4088