# Seminal Contributions by IBM

Here are some of the top contributions to speech recognition from IBM. The papers listed have been cited more than 10,000 citations times combined.

### 1. First speech recognition application

In the early 1960s, IBM developed and demonstrated the Shoebox -- a forerunner of today's voice recognition systems. The device recognized digits and arithmetic commands and responded to them. The following is a 1961 Times article about the device.### 2. Introduction of HMMs to speech

IBM was first to introduce hidden Markov models (HMMs) to the world of speech recognition. Although Rabiner's tutorial's on HMMs are more widely cited the IBM papers were first. The first paper is on the forward-backward algorithm:

[1] L.R. Bahl, J. Cocke, F. Jelinek and J. Raviv, "Optimal decoding of linear codes for minimizing symbol error rate," *IEEE Trans. Inform Theory,* vol **IT-20**, pp. 248-287, March 1974.

Here are two papers on the maximum likelihood approach to speech recognition:

[2] Maximum likelihood approach to continuous speech recognition. LR Bahl, F Jelinek, RL Mercer, * IEEE Transactions on Pattern Analysis and Machine Intelligence* **5:22**, 179-190, 1983.

[3] F. Jelinek, "Continuous Speech Recognition by Statistical Methods", * IEEE Proceedings* (Invited Paper), April 1976, Vol. **64**, No. 4, pp. 532--556. Publication Date: April 1976.

### 3. Pioneering Innovations in Language Modeling

The first successful language modeling smoothing algorithm, deleted interpolation, was invented at IBM. IBM also had the first application of the Maximum Entropy Principle to language modeling. Here are some of the papers that introduced these techniques:

[4] Bahl, L.R., Brown, P.F., deSouza, P.V., Mercer, R.L., Nahamoo, D., 1991. A fast algorithm for deleted interpolation. In: *Proc. Europ. Conf. Speech Comm. Tech.*, Genova, pp. 1209-1212.

[5] Jelinek, F., Mercer, R.L., 1980. Interpolated estimation of Markov source parameters from sparse data. In: Gelsema, E.S., Kanal, L.N. (Eds.), *Pattern Recognition in Practice*. North-Holland, Amsterdam, pp. 381-397.

[6] [Berger et al., 1996] Adam Berger, Stephen A. Della Pietra, and Vincent J. Della Pietra.1996. A Maximum Entropy Approach to Natural Language Processing. *Computational Linguistics*, **22**(1):39-71.

[7] Steven Della Pietra,Vincent Della Pietra, and John Lafferty. 1995. Inducing Features of Random Fields. Technical Report CMU-CS95-144, School of Computer Science, Carnegie-Mellon University

### 4. Statistical Machine Translation

The work on maximum entropy and parsing led to the first purely statistically based translation system. Here are some of the earliest important IBM papers on machine translation:

[8] P. Brown , J. Cocke , S. Della Pietra , V. Della Pietra , F. Jelinek , R. Mercer , P. Roossin, A statistical approach to language translation, * Proceedings of the 12th conference on Computational linguistics*, p.71-76, August 22-27, 1988.

[9] Peter F. Brown , Vincent J. Della Pietra , Stephen A. Della Pietra , Robert L. Mercer, The mathematics of statistical machine translation: parameter estimation, * Computational Linguistics*, v.**19** n.2, June 1993

[10] Peter F. Brown , John Cocke , Stephen A. Della Pietra , Vincent J. Della Pietra , Fredrick Jelinek , John D. Lafferty , Robert L. Mercer , Paul S. Roossin, A statistical approach to machine translation, *Computational Linguistics*, v.**16** n.2, p.79-85, June 1990

[11] A. Berger, P. Brown, S. Della Pietra, V. Della Pietra, J. Gillett, J. Lafferty, H. Printz, L. Ures (1994). The Candide system for machine translation. *ARPA Workshop on Speech and Natural Language. Morgan Kaufman Publishers, 157-163.
*

### 5. Discriminative training of acoustic and language models

The discriminative training craze for acoustic modeling started with Maximum Mutual Information (MMI) in Peter Brown's thesis. It was then continued at IBM and followed up with a paper on the extended Baum Welch algorithm that gives a recipe to optimize rational functions satisfying certain constraints. The real success of MMI did however not appear at IBM, but rather at Cambridge, where it was also extended to the Minimum Phone Error (MPE) criteria. Dan Povey did this work for his dissertation before coming to IBM. At IBM he came up with new discriminative features (feature space MPE (fMPE)) as well as an altogether better criteria: boosted MMI (bMMI). The machine learning community has also taken on the discriminative world of classification with Support Vector Machines, and large Margin classifiers. Some of the large margin classifier work was inspired by Povey's work. Below are some of the older papers as well as selected papers from Povey's IBM work.

[12] Nadas, A.: A decision-theoretic formulation of a training problem in speech recognition and a comparison of training by unconditional versus conditional maximum likelihood. *IEEE Transactions on Acoustics, Speech and Signal Processing* **31**(4), p. 814-817, 2008.

[13] L.R. Bahl, P.F. Brown, P.V. de Souza, R.L. Mercer (1986). Maximum Mutual Information Estimation of Hidden Markov Model Parameters for Speech Recognition, *Proc. ICASSP 86*, pp. 49-52, Tokyo.

[14] Gopalakrishnan, P.S. Kanevsky, D. Nadas, A. Nahamoo, D. An inequality for rational functions with applications to some statistical estimation problems, *IEEE Trans. Inform. Theo.* **37** (1), pp. 107-113, 1991.

[15] Povey, D. et al., "FMPE: Discriminatively trained features for speech recognition", *Proc. of ICASSP 2005*, pp.961-964, Philadelphia, 2005.

[16] Povey, D., Kanevsky, D., Kingsbury, B., Ramabhadran, B., Saon, G., Visweswariah, K.:Boosted MMI for model and feature-space discriminative training. In: *Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing* (ICASSP-08), Las Vegas, NV (2008)