Abstract Hiltula

On the use of linguistic corpora in connectionist modelling

There has been little discussion in the connectionist literature of language acquisition on the criteria of what makes a language corpus and the training set of a neural network based on that corpus representative for the purposes of modelling.

In constructing a training set, modellers often use lexical and frequency data from linguistic corpora, such as the Brown Corpus, which is the starting point of e.g. Rumelhart and McClelland (1987). Usually the attempt is to find data about the actual shape of the input, and thus mimic the acquisitional situation of a young native learner (MacWhinney et al. 1989, p. 263).

As I interpret it, the input is a collection of utterances the learner hears and perhaps understands, but the knowledge of language the learner eventually comes to master cannot be equated or replaced with that collection. The distinction is essentially that of between theoretical or contemplative “observer’s knowledge” and practical “agent’s knowledge” (Itkonen 2005, p. 187).

Plunkett and Marchman (1991, p. 46) suggest that the performance of the past tense network by Rumelhart and McClelland (1987) could be characterized at a more abstract level, as modelling a hypothetical, internal system-building process. The training set constructed for the model incorporates the assumption about what patterns in the environment are salient enough to be entrenched in this process. In other words, the process takes place in the mind of the learner, but it is still strongly mediated through “observer’s knowledge”.

Connectionist approaches such as described above have not given much consideration to the question of meaning. Mandler (2004, p. 251) has noted that without any conceptual interface, language learning models can become unrealistic. If pattern association is all there is, gaining a command of a task could under a connectionist account seem as if the child didn’t know what s/he gains a command of.

The real language learning environment necessarily contains material from diverse sources, both spoken and written, to which the utterances by the children themselves also contribute. A different conception of a learning model, taking into account the issue of conceptual and semantic meaning, might require training data which stems from the learners themselves, a step away from viewing the learning environment solely external to them.



