Periodic Reporting for period 1 - ThReDS (A Theory of Reference for Distributional Semantics)
Período documentado: 2017-07-01 hasta 2019-06-30
Before an artificial agent can talk, it needs to learn about the world, just as a child would do. In linguistic and computational terms, this means acquiring *representations* of the things the agent is exposed to. In the field of Distributional Semantics, such computational representations have traditionally been built from raw text data (sometimes enriched with visual information) and take the form of a 'vector', that is, a mathematical model of the way a particular word is used by human beings, as experienced by the agent. Such vectors can be found in many everyday applications like search engines, recommendation systems and conversational agents. So far, however, they have only been constructed for *concepts* (e.g. 'student', 'owl', 'broom') rather than individual entities ('Harry Potter', 'Hedwig', 'Harry's Nimbus 2000'). This is because current algorithms need considerable amounts of data to learn properly, and references to individual entities are much less frequent in raw text than generic occurrences of words. Further, those raw vector representations are not suitable to refer from, because they do not explicitly encapsulate the properties of the concept or individual that a human would use to identify them (e.g. 'wearing glasses' for Harry Potter). In order to make vectors compatible with so-called 'Referring Expression Generation' systems, that is, algorithms that can produce successful references to things in the world, a translation must be found to a more formal and structured representation of meaning, which in theoretical linguistics has its incarnation in 'Model-theoretic Semantics'.
ThReDS tackled two challenges: a) the computational extraction of representations of entities from raw text, concentrating on the small data issue; b) the theoretical account of how raw exposure to linguistic data (distributional semantics information) can shape the agent's representation of the world (their model-theoretic semantics).
Nonce2Vec is hoped to pave the way for extracting representations of single entities from text, as well as new concepts. To give an example, if a computer were to simulate a human reading the Harry Potter series, it should be able to very quickly learn who Harry or Hermione are, and to follow their development throughout the text, modifying its representation of the characters as it reads. It should similarly be able to learn new concepts like 'quidditch'. I have started setting up an experiment to test the ability of the software to acquire quality representations of individuals. This new experiment involves simulating the broad picture that a human might acquire of a person after reading a Wikipedia article about them. A balanced dataset of individuals has been produced, together with an experimental setup for eliciting individual properties from human subjects. This preliminary work will allow me to conduct a series of behavioural and computational experiments in the future.
Finally, the linguistic theory behind ThReDS has been laid out in a draft paper. The paper highlights how what is *said* about an entity or concept (i.e. the noisy, observable data that a human or computer might learn from) can be related to the *properties* of that entity/concept in the world (a 'cleaner', database-like representation). This translation from observable data to properties is essential to explain how humans acquire the formal models necessary to discriminate between individuals and refer to them accurately.
On a more general level, the contributions made by ThReDS are expected to help build more intelligent artificial agents and crucially, to gain an understanding of *how* and *what* they learn through their exposure to linguistic data.