Foundational Papers in Complexity Science pp. 2699–2731
DOI: 10.37911/9781947864559.85
From the Euclidean to the Natural
Author: Nihat Ay, Hamburg University of Technology
Excerpt
The idea of learning as an optimization process can be traced back to the early years of artificial neural networks. This idea has been very fruitful, ultimately leading to the recent successes of deep neural networks as learning machines. While being consistent with optimization, however, the first learning algorithms for neurons were inspired by neurophysiological and neuropsychological paradigms, most notably by the celebrated work of Donald Hebb (1949). Building on such paradigms, Frank Rosenblatt (1957) proposed an algorithm for training a simple neuronal model, which Warren McCulloch and Walter Pitts had introduced in their seminal article in 1943. The convergence of this algorithm can be formally proved with elementary arguments from linear algebra (perceptron convergence theorem; see Novikoff 1962). The idea of learning as an optimization process, however, offers not only a unified conceptual foundation of learning, it also allows us to study learning from a rich mathematical perspective. In this context, the stochastic gradient descent method plays a fundamentally important role (Widrow 1963; Amari 1967; Rumelhart, Hinton, and Williams 1986). Nowadays, it represents the main instrument for training artificial neural networks, which brings us to Shun-Ichi Amari’s article “Natural Gradient Works Efficiently in Learning.” Let us unfold this title and thereby reveal the main insights of Amari’s work.
Bibliography
Amari, S. 1967. “A Theory of Adaptive Pattern Classifiers.” IEEE Transactions on Electronic Computers EC-16 (3): 299–307. https://doi.org/10.1109/PGEC.1967.264666.
— . 2016. Information Geometry and Its Applications. Vol. 194. Applied Mathematical Sciences. Tokyo, Japan: Springer Tokyo. https://doi.org/10.1007/978-4-431-55978-8.
Amari, S., and H. Nagaoka. 2000. Methods of Information Geometry. Vol. 191. Translations of Mathematical Monographs. Oxford, UK: Oxford University Press.
Ay, N. 2020. “On the Locality of the Natural Gradient for Learning in Deep Bayesian Networks.” Information Geometry 6:1–49. https://doi.org/10.1007/s41884-020-00038-y.
Ay, N., J. Jost, H. V. Lê, and L. Schwachhöfer. 2017. Information Geometry. Vol. 64. Ergebnisse der Mathematik und ihrer Grenzgebiete, 3. Folge / A Series of Modern Surveys in Mathematics. Cham, Switzerland: Springer. https://doi.org/10.1007/978-3-319-56478-4.
Hebb, D. O. 1949. The Organization of Behaviour. New York, NY: Wiley.
McCulloch, M., and W. Pitts. 1943. “A Logical Calculus of the Ideas Immanent in Nervous Activity.” Bulletin of Mathematical Biophysics 5:115–133. https://doi.org/10.1007/BF02478259.
Novikoff, A. B. 1962. “On Convergence Proofs for Perceptrons.” In Symposium on the Mathematical Theory of Automata, 12:615–622. Brooklyn, NY: Polytechnic Institute of Brooklyn.
Rosenblatt, F. 1957. “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain.” Psychology Review 65:386–407. https://doi.org/10.1037/h0042519.
Rumelhart, D. E., G. E. Hinton, and R. J. Williams. 1986. “Learning Internal Representations by Error Propagation.” In Parallel Distributed Processing, 1:318–362. Cambridge, MA: MIT Press.
Widrow, B. 1963. A Statistical Theory of Adaptation. Oxford, UK: Pergamon Press.