Abstract Mehler & Sichelschmidt
Re-conceptualizing Latent Semantic Analysis in Terms of Complex Network Theory. A Corpus-Linguistic Approach
Recently, the small world phenomenon has been investigated by example of networks of lexical (Steyvers & Tenenbaum 2005) and textual units (Mehler 2006). These analyses show a remarkable conformity of the topology of social and biological networks on the one hand and linguistic networks on the other hand. This relates to their macroscopic organization as characterized by high cluster values and short average geodesic distances between any randomly chosen pair of nodes. These findings question the cognitive plausibility of Latent Semantic Analysis (LSA; Landauer & Dumais 1997) as the predominant model of semantic spaces. The reason is that LSA induces a completely connected graph in which any vertex is directly connected with any other vertex of the same graph – in contradiction to what is known about the topology of small world graphs.
This paper raises the question about a computational linguistic model which combines the corpus-analytic stance of LSA with a cognitively more adequate representation format as an alternative to the predominant model of semantic spaces. Starting from Steyvers & Tenenbaum's findings, it additionally asks for the SW-like networking of the underlying text corpora. From a corpus linguistic point of view, the SW property of text networks can be seen as an argument in favour of representative samples as input to computing cognitively plausible models of lexical association. Although it is known from quantitative linguistics that such samples are hardly possible, this property can be utilized as a necessary condition which has to be fulfilled by corpora in order to be judged as reliable data bases for computing lexical memory models showing the SW property on their own. Consequently, we propose to reconstruct LSA in terms of SW-like networking of lexical and of textual units. Thus, this paper pleads for a model of semantic spaces which combines restrictions on corpus internal networking with constraints on the networking of the relations of cognitive units derived from these corpora. It focuses on the validity of lexical association models from the point of view of their time and space complexity and, vice versa, sheds light on the validity of computational models from the point of view of their SW characteristics. A central outcome of the paper is that complex network theory allows deriving necessary conditions which have to be fulfilled by corpora of natural language texts in order to be a reliable input to computing lexical associations thereof.
Landauer, T. K. and Dumais, S. T. (1997). A solution to Plato’s problem. Psychological Review, 104(2):211–240.
Mehler, A. (2006). Text linkage in the wiki medium – a comparative study. In Proc. of the EACL Workshop on New Text – Wikis and blogs and other dynamic text sources, Trento, April 3-7.
Steyvers, M. and Tenenbaum, J. (2005). The large-scale structure of semantic networks: Statistical analyses and a model of semantic growth. Cognitive Science, 29(1):41–78.