Probabilistic Argumentation on the Web

Información del proyecto

ARGUE_WEB

Identificador del acuerdo de subvención: 707407

Sitio web del proyecto

DOI

10.3030/707407

Proyecto cerrado

Fecha de la firma de la CE 26 Febrero 2016

Fecha de inicio 1 Marzo 2016

Fecha de finalización 31 Diciembre 2017

Financiado con arreglo a

EXCELLENT SCIENCE - Marie Skłodowska-Curie Actions

Coste total

€ 168 166,90

Aportación de la UE

€ 168 166,90

168 166,90

Coordinado por

THE CHANCELLOR, MASTERS AND SCHOLARS OF THE UNIVERSITY OF OXFORD
United Kingdom

Periodic Reporting for period 1 - ARGUE_WEB (Probabilistic Argumentation on the Web)

Período documentado: 2016-03-01 hasta 2017-12-31

A review of the current literature in the extraction of arguments and their components revealed that there were no existing tools that enabled the extraction of arguments and their components. Recent research in the classification of argumentative text using traditional machine learning algorithms focused on the recognition of arguing subjectivity (please see below a brief review of research) and showed that it is possible to identify subjectivity in arguing text found in online debates (blog posts and editorials). The methods employed in these cases, were similar to sentiment recognition in text. Although research on the subjectivity of arguing is a significant progress in the recognition of arguments, its is still not possible to extract, represent and use natural language arguments in KRR systems for reasoning. In order to overcome the above obstacle, the Researcher was asked to investigate Neural Network Deep Learning algorithms for the extraction of data. The lack of sufficiently large annotated data was a major problem in the experimental evaluation of different neural network architectures. The Researcher tried to overcome the problem by using NLP tools. Although NLP tools are widely used in the traditional machine learning tasks, their accuracy is very low when used for the extraction of types of text. The main goal of this project as can be defined as: to investigate algorithms and tools capable of identifying argumentation text on line textual resources. The term 'argumentation text' in this context refers to text intended to persuade the participants in a debate.

The work covered so far falls under the following two categories:
(I) Natural Language and Linguistics for the identification of patterns in text.
(II) Neural Deep Learning algorithms for the classification of text (ongoing)

(I). In terms of NL and Linguistics, during the period of my fellowship I became acquainted and have implemented various methods falling under the following categories:

1. Extraction of data from various public political forums on the web and the AraucariaDB files. In particular, I have used the Scrapy [22] library to extract data from forums: www.debatepolitics.com www.politicsforum.ac.uk and www.discoursedb.org.

2. Classification algorithms (e.g. K-Means) showed the difficulty of identifying categories where semantic information relevance cannot be modeled adequately.

3. Parsing techniques and grammars.
(a). I have implemented various parsing and lexicographic methods in order to extract the syntactic patterns involved in argumentative text and assess their effectiveness when used for the automatic extraction of argumentative text. For example, the following lexico-graphic pattern (via regular expressions) shows that even for very simple syntactic patterns, regular expressions can become quite complex.

president = "The President of Cyprus argued that the bill was an excuse by the EU"
toksTags = ';'.join(['/'.join(list(t)) for t in nltk.pos_tag(president.split())])
arguer = re.compile(r"(?Pthe/DT;[A-Za-z]+(/NN|/NNS|/NNP);((of|in)/IN;[A-Za-z]+/(NN|NNS|NNP);)?)(?P(argue/VBP;|argued/VBD;))", re.I)
print(“RESULT:”, arguer.search(toksTags).groupdict())
> RESULT: {'ARG_EVENT': 'argued/VBD;', 'ARGUER': 'The/DT;President/NNP;of/IN;Cyprus/NNP;'}

(b) nltk libraries that enable the representation of text in the form of Tree structures. The intention behind this labeling of textual data is that it can be used to query data but it can also be used as input in classification algorithms, like for example the recursive deep learning algorithm used for sentiment analysis.

(c) Grammars and chunking: the part of speech tags, although useful in determining the syntactic type of individual tokens were not adequate to describe sequences of tokens referring to single syntactic category. E.g. Noun Phrases (NP) vs Nouns (NN, NNS) in PennTrees but also tokens that jointly refer to the subject of a verb. Grammars have been defined via the Stanford nltk library, e.g. grammar1 = r""" {(<(NN.*)>)(<(MD)>)(<(VB)>)(<(DT)>)(<(JJ)>)*(<(NN.*)>)} and parsed via RegexpParser. Available libraries (like Pattern) provided the ability to perform chunking on nested tagging structures; for example, Chunks are derivable as a by-product of the primary intended functionality of Pattern. For example, lets say we have a subjective lexicon, then we can extract subjective parts of Noun Phrases as follows (the example is rather simplistic and is used for illustration purposes):

Annotation = namedtuple('Annotation', ['text', 'label'])
parsed = parse(txt, lemmata=True, relations=True, chunks=True)
s = Sentence(parsed)
subjective_lexicon = ['bad','good', 'beautiful', 'ungly']
for c in s.chunks:
if c.type.startswith('NP'):
jj_s = [w.string for w in c if w.type=='JJ']
if jj_s:
subj = filter(lambda w:w.lower() in subjective_lexicon, jj_s)
if subj:
annot = Annotation(text=c.string label='SUBJECTIVE')
print(annot)

4. Deriving the entity types of textual components can be helpful in cases where textual components with the same syntactic structure but different entity types should be categorized differently. Examples of libraries used for entity recognition are: StanfordNERTagger, ne_chunk (nltk), and spaCy2.

5. Sentiment and Subjectivity are related to the notions of Opinion Mining and Arguing and can be used within other algorithms to determine the nature of subjectivity involved. Valuable tools that provide information about this aspect of text that were iused to provide information about the sentiment of sentences or individual words, are: VADER4, the CoreNLP annotation tool5, MPQA6, spaCy.

5. Wordnet libraries – the use of synonyms or antonyms is important for determining the polarity of unknown words or synonymous words. Each word has many different senses and a hierarchy. The wordnet library (nltk.corpus) provides via very simple commands information about a word, its hypernyms and hyponymns as well as some similarity measures. It has been used in literature to establish the relation of different words. Starting from a small set of seed words one can expand the vocabulary with new words.

Deep Learning:
1. A recursive deep learning algorithm had been developed for Araucaria data. However, Araucaria data is small and the algorithm suffered from vanishing gradients. With the advent of new architectures and possibilities, research in this area is likely to produce useful results.

The advancements in Deep Learning over the last years suggest the investigation of Deep Learning methods for the classification, the prediction and the translation of data.

[22] https://scrapy.org

Various methods have been used which were dispersed. However, due to the limitations of technical nature discussed earlier (e.g. data) no publication was made. Of course, given the time I am optimistic that continuation of this research will prove results that will benefit my Institute.

summary-of-publication.png

Periodic Reporting for period 1 - ARGUE_WEB (Probabilistic Argumentation on the Web)

Compartir esta página Compartir esta página en las redes sociales

Descargar Descargar el contenido de la página