Periodic Reporting for period 1 - ARGUE_WEB (Probabilistic Argumentation on the Web)
Okres sprawozdawczy: 2016-03-01 do 2017-12-31
(I) Natural Language and Linguistics for the identification of patterns in text.
(II) Neural Deep Learning algorithms for the classification of text (ongoing)
(I). In terms of NL and Linguistics, during the period of my fellowship I became acquainted and have implemented various methods falling under the following categories:
1. Extraction of data from various public political forums on the web and the AraucariaDB files. In particular, I have used the Scrapy [22] library to extract data from forums: www.debatepolitics.com www.politicsforum.ac.uk and www.discoursedb.org.
2. Classification algorithms (e.g. K-Means) showed the difficulty of identifying categories where semantic information relevance cannot be modeled adequately.
3. Parsing techniques and grammars.
(a). I have implemented various parsing and lexicographic methods in order to extract the syntactic patterns involved in argumentative text and assess their effectiveness when used for the automatic extraction of argumentative text. For example, the following lexico-graphic pattern (via regular expressions) shows that even for very simple syntactic patterns, regular expressions can become quite complex.
president = "The President of Cyprus argued that the bill was an excuse by the EU"
toksTags = ';'.join(['/'.join(list(t)) for t in nltk.pos_tag(president.split())])
arguer = re.compile(r"(?P
print(“RESULT:”, arguer.search(toksTags).groupdict())
> RESULT: {'ARG_EVENT': 'argued/VBD;', 'ARGUER': 'The/DT;President/NNP;of/IN;Cyprus/NNP;'}
(b) nltk libraries that enable the representation of text in the form of Tree structures. The intention behind this labeling of textual data is that it can be used to query data but it can also be used as input in classification algorithms, like for example the recursive deep learning algorithm used for sentiment analysis.
(c) Grammars and chunking: the part of speech tags, although useful in determining the syntactic type of individual tokens were not adequate to describe sequences of tokens referring to single syntactic category. E.g. Noun Phrases (NP) vs Nouns (NN, NNS) in PennTrees but also tokens that jointly refer to the subject of a verb. Grammars have been defined via the Stanford nltk library, e.g. grammar1 = r""" {(<(NN.*)>)(<(MD)>)(<(VB)>)(<(DT)>)(<(JJ)>)*(<(NN.*)>)} and parsed via RegexpParser. Available libraries (like Pattern) provided the ability to perform chunking on nested tagging structures; for example, Chunks are derivable as a by-product of the primary intended functionality of Pattern. For example, lets say we have a subjective lexicon, then we can extract subjective parts of Noun Phrases as follows (the example is rather simplistic and is used for illustration purposes):
Annotation = namedtuple('Annotation', ['text', 'label'])
parsed = parse(txt, lemmata=True, relations=True, chunks=True)
s = Sentence(parsed)
subjective_lexicon = ['bad','good', 'beautiful', 'ungly']
for c in s.chunks:
if c.type.startswith('NP'):
jj_s = [w.string for w in c if w.type=='JJ']
if jj_s:
subj = filter(lambda w:w.lower() in subjective_lexicon, jj_s)
if subj:
annot = Annotation(text=c.string label='SUBJECTIVE')
print(annot)
4. Deriving the entity types of textual components can be helpful in cases where textual components with the same syntactic structure but different entity types should be categorized differently. Examples of libraries used for entity recognition are: StanfordNERTagger, ne_chunk (nltk), and spaCy2.
5. Sentiment and Subjectivity are related to the notions of Opinion Mining and Arguing and can be used within other algorithms to determine the nature of subjectivity involved. Valuable tools that provide information about this aspect of text that were iused to provide information about the sentiment of sentences or individual words, are: VADER4, the CoreNLP annotation tool5, MPQA6, spaCy.
5. Wordnet libraries – the use of synonyms or antonyms is important for determining the polarity of unknown words or synonymous words. Each word has many different senses and a hierarchy. The wordnet library (nltk.corpus) provides via very simple commands information about a word, its hypernyms and hyponymns as well as some similarity measures. It has been used in literature to establish the relation of different words. Starting from a small set of seed words one can expand the vocabulary with new words.
Deep Learning:
1. A recursive deep learning algorithm had been developed for Araucaria data. However, Araucaria data is small and the algorithm suffered from vanishing gradients. With the advent of new architectures and possibilities, research in this area is likely to produce useful results.
The advancements in Deep Learning over the last years suggest the investigation of Deep Learning methods for the classification, the prediction and the translation of data.
[22] https://scrapy.org(odnośnik otworzy się w nowym oknie)