Skip to main content

Cross-Lingual Embeddings for Less-Represented Languages in European News Media

Deliverables

Initial cross-lingual and multilingual embeddings technology (T1.1)

Initial embeddings and transformations between a selection of all targeted languages (Estonian, Finnish, Swedish, Latvian, Lithuanian, Croatian, Slovene, English, Russian) (report and source code) (T1.1)

Initial cross-lingual semantic enrichment technology (T2.1)

Initial approach to named entity (NE) extraction and disambiguation and event detection, covering multiple domains and languages (report and source code) (T2.1).

Datasets, benchmarks and evaluation metrics for cross-lingual content analysis (T4.4)

Gathering and preprocessing training and testing data (Estonian, Latvian, Lithuanian, Russian, Croatian, Finnish and English) provided by the media partners (report and dataset) (T4.4) .

Initial deep network architecture (T1.3)

Deep neural networks will be adapted to morphologically rich languages by using character-level inputs and additional information on morphology (suffixes, prefixes, separately trained POS tags) (report and source code) (T1.3).

Interim report on ethics and responsible science and journalism (T6.5)

Interim report on ethics and responsible science and journalism, with analysis of news production and new tool development (T6.5).

Initial interpretability and visualisation technology (T1.4)

Initial approaches to explanation of deep learning models by adoptation of perturbation based explanation methods based on coalitional game theory to ext classification and initial development of visual tools for visually explaining the classification process. (report and source code) (T1.4).

Initial context-dependent and dynamic embeddings technology (T1.2)

Context-aware cross-lingual embeddings which will enable improved understanding of short texts such as user comments in the context of an emerging comment thread and the news story being commented (report and source code) (T1.2).

Report on user needs and challenges for news media industry (T6.1).

Initial report on identification and analysis of needs of different stakeholders in news media industry. We will arrange workshop to identify in detail challenges that are specific to operations of different media partners and prepare a specifications documentation (T6.1).

Recommendations on avoiding gender and other biases (T6.4)

The means to avoid and detect gender and other biases in news media contents creation will be developped in T6.4. This deliverable will propose the recommendations for avoiding gender bias (T6.4).

Initial cross-lingual context and opinion analysis technology (T3.1)

Report on initial developed technology for a range of user comment analyses, including topic modelling, conversation structure and context modelling, sentiment, stance and opinion detection and effect and information spread measurement (report and source code) (T3.1).

Initial multilingual news linking technology (T4.1)

Development of initial tools for linking news stories across languages based on their topics and contents (report and source code) (T4.1).

Initial keyword extraction techniques (T2.2)

Initial keyword extraction by application of statistical approaches (based on heuristics), machine learning approaches, as well as graph-based approaches (report and source code) (T2.2).

Initial dynamic news generation technology (T5.2)

Development of a novel method for automatically organising news articles, considering the domain of the article, effects of time and news repetition (report and source code) (T5.2).

Refined analysis of news media partners’ needs and challenges (T6.1).

Refined report of news media partners’ needs and challenges and their analysis with regard to the state of the art in NLP for news media (T6.1).

Datasets, benchmarks and evaluation metrics for cross-lingual user generated content filtering and analysis (T3.4)

Evaluation and development of algorithms requires relevant, annotated, and multilingual datasets (report and dataset) (T3.4).

Multilingual language generation approach (T2.3)

Incorporating hybrid techniques in the architecture, to take advantage of the robustness of machine learning techniques and transparency of rule-based techniques. Adaptation of the context-aware word-embeddings developed in T1.2 to improve fluency and variability in the generated texts (report and source code) (T2.3).

Initial news generation technology (T5.1)

Based on the analysis of newsrooms (WP6), the NLG technology will be adapted for the requirements of news generation. The task will develop mechanisms for (i) determining what is interesting or important in the given data and deciding what to report, and for (ii) rendering that information in an accurate manner (iii) in multiple languages (report and source code) (T5.1).

Platform requirements documentation and platform design (T6.2)

The EMBEDDIA Toolkit will incorporate different tools and resources developed in WP1–WP5 and on top of it build the EMBEDDIA Media Assistant platform. The platform will be built as a series of base microservices, functional microservices and task oriented APIs. This deliverable will report on platform requirements and platform design (T6.2).

Initial cross-lingual news viewpoints identification technology (T4.3)

Initial approaches for detecting viewpoints and sentiments based on media sources (report and source code) (T4.3) .

Datasets, benchmarks and evaluation metrics for advanced cross-lingual NLP technology (T2.4)

Report on existing evaluation datasets and benchmarks for NER, NEL and event detection (for instance, ACE, Meantime and TAC KBP’s Entity Discovery and Linking tasks) (report and dataset) (T2.4).

Initial cross-lingual comment filtering technology (T3.2)

Report on developed tools for automatic flagging or filtering of user comments, specifically targeted at the use cases defined by end user partners in WP6, e.g., detection of hate speech and political trolling, attempts to elicit extreme reactions and influence others’ opinions (report and source code) (T3.2).

Datasets, benchmarks and evaluation metrics for multilingual text generation (T5.4)

From news partners texts (news stories) and structured datasets from which news can be generated will be collected (report and datasets) and methodology for evaluation defined (T5.4).

Initial cross-lingual news summarisation and visualisation technology (T4.2)

Development of textual and visual language-independent multi-document news summarisation (report and source code) (T4.2).

Datasets, benchmarks and evaluation metrics for cross-lingual word embeddings (T1.5)

A repository of training and evaluation data, stored in a dedicated GitHub repository (report and datasets) (T1.5).

Project website and social media accounts (T7.1)

Created project website --- which will function both as a project dissemination tool and for providing access to the technical outcomes produced by the project --- and social media accounts/pages on relevant social networks will be created (T7.1)

Publications

Cross-lingual Transfer of Twitter Sentiment Models Using a Common Vector Space

Author(s): Robnik-Šikonja, Marko; Reba, Kristijan; Mozetič, Igor
Published in: In Proceedings of the Conference on Language Technologies and Digital Humanities, JTDH2020, 2020, Page(s) 87-92
DOI: 10.5281/zenodo.4059725

Know your Neighbors: Efficient Author Profiling via Follower Tweets

Author(s): Koloski, Boško; Pollak, Senja; Škrlj, Blaž
Published in: Notebook for PAN at CLEF 2020, 2020
DOI: 10.5281/zenodo.4059641

Robust Named Entity Recognition and Linking on Historical Multilingual Documents

Author(s): Boros, Emanuela; Linhares Pontes, Elvys; Cabrera-Diego, Luis Adrián; Hamdi, Ahmed; Moreno, Jose G.; Sidère, Nicolas; Doucet, Antoine
Published in: Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum (CLEF-HIPE 2020), 2020
DOI: 10.5281/zenodo.4059652

Linking Named Entities across Languages using Multilingual Word Embeddings

Author(s): Elvys Linhares Pontes, Jose G. Moreno, Antoine Doucet
Published in: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, 2020, Page(s) 329-332
DOI: 10.1145/3383583.3398597

Embeddia at SemEval-2019 Task 6: Detecting hate with neural network and transfer learning approaches

Author(s): Andraž Pelicon, Matej Martinc, and Petra Kralj Novak
Published in: Proceedings of The 13th International Workshop on Semantic Evaluation (SemEval), 2019

Generating Data using Monte Carlo Dropout

Author(s): Kristian Miok, Dong Nguyen-Doan, Daniela Zaharie, and Marko Robnik-Šikonja
Published in: IEEE 15th International Conference on Intelligent Computer Communication and Processing (ICCP 2019), 2019

Detecting Depression with Word-Level Multimodal Fusion

Author(s): Morteza Rohanian, Julian Hough, Matthew Purver
Published in: Interspeech 2019, 2019, Page(s) 1443-1447
DOI: 10.21437/interspeech.2019-2283

Word Clustering for Historical Newspapers Analysis

Author(s): Lidia Pivovarova, Elaine Zosa, and Jussi Kurunmäki
Published in: Proceedings of the Workshop on Language Technology for Digital Historical Archives, 2019

Clustering Ideological Terms in Historical Newspaper Data with Diachronic Word Embeddings

Author(s): Jani Marjanen, Lidia Pivovarova, Elaine Zosa, and Jussi Kurunmäki
Published in: Proceedings of the 5th International Workshop on Computational History, 2019

Karst exploration: Extracting terms and definitions from karst

Author(s): Senja Pollak, Andraž Repar, Matej Martinc, and Vid Podpečan
Published in: Proceedings of the 6th biennial conference on electronic lexicography, eLex 2019, 2019

Who is hot and who is not? Profiling celebs on Twitter

Author(s): Martinc, Matej; Škrlj, Blaž; Pollak, Senja
Published in: Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Issue 6, 2019

Fake or Not: Distinguishing Between Bots, Males and Females

Author(s): Martinc, Matej; Škrlj, Blaž; Pollak, Senja
Published in: Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Issue 2, 2019

Pooled LSTM for Dutch cross-genre gender classification

Author(s): Matej Martinc, Senja Pollak
Published in: Proceedings of the Shared Task on Cross-Genre Gender Detection in Dutch at Computational Linguistic in Netherlands (CLIN 2019) conference, 2019

Methods for Generating Colourful and Factual Multilingual News Headlines

Author(s): Alnajjar, Khalid; Leppänen, Leo; Toivonen, Hannu
Published in: In Proceedings of the 10th International Conference on Computational Creativity (ICCC 2019), Issue 1, 2019, Page(s) 258-265

TLR at BSNLP2019: A Multilingual Named Entity Recognition System

Author(s): Jose G. Moreno, Elvys Linhares Pontes, Mickael Coustaty, Antoine Doucet
Published in: Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, 2019, Page(s) 83-88
DOI: 10.18653/v1/w19-3711

Generating Data using Monte Carlo Dropout

Author(s): Miok, Kristian; Nguyen-Doan, Dong; Zaharie, Daniela; Robnik-Šikonja, Marko
Published in: Issue 1, 2019
DOI: 10.5281/zenodo.3559060

Clustering Ideological Terms in Historical Newspaper Data with Diachronic Word Embeddings

Author(s): Jani Marjanen; Lidia Pivovarova; Elaine Zosa; Jussi Kurunmäki
Published in: HistoInformatics 2019: International Workshop on Computational History 2019, 2019
DOI: 10.5281/zenodo.3689467

A Corpus Study on Questions, Responses and Misunderstanding Signals in Conversations with Alzheimer's Patients

Author(s): Shamila Nasreen; Matthew Purver; Julian Hough
Published in: Proceedings of the 23rd Workshop on the Semantics and Pragmatics of Dialogue, Issue 13, 2019
DOI: 10.5281/zenodo.3689456

Word Clustering for Historical Newspapers Analysis

Author(s): Pivovarova, Lidia; Marjanen, Jani; Zosa, Elaine
Published in: Proceedings of the Workshop on Language Technology for Digital Historical Archives in conjuction with RANLP-2019, 2019, Page(s) 3-10
DOI: 10.5281/zenodo.3402940

TeMoCo: A Visualization Tool for Temporal Analysis of Multi-party Dialogues in Clinical Settings

Author(s): Shane Sheehan, Pierre Albert, Saturnino Luz, Masood Masoodian
Published in: 2019 IEEE 32nd International Symposium on Computer-Based Medical Systems (CBMS), 2019, Page(s) 690-695
DOI: 10.1109/CBMS.2019.00140

Gender, language, and society: word embeddings as a reflection of social inequalities in linguistic corpora

Author(s): Supej, Anka; Plahuta, Marko; Purver, Matthew; Mathioudakis, Michael; Pollak, Senja
Published in: In Znanost in družbe prihodnosti, Slovensko sociološko srečanje [Annual meeting of the Slovenian Sociological Association: Science and future societies], 2019
DOI: 10.5281/zenodo.3894466

No Time Like the Present: Methods for Generating Colourful and Factual Multilingual News Headlines

Author(s): Alnajjar, Khalid; Leppänen, Leo; Toivonen, Hannu
Published in: Proceedings of the 10th International Conference on Computational Creativity (ICCC2019), 2019

Multiple Imputation for Biomedical Data using Monte Carlo Dropout Autoencoders

Author(s): Kristian Miok, Dong Nguyen-Doan, Marko Robnik-Sikonja, Daniela Zaharie
Published in: 2019 E-Health and Bioengineering Conference (EHB), 2019, Page(s) 1-4
DOI: 10.1109/EHB47216.2019.8969940

High Quality ELMo Embeddings for Seven Less-Resourced Languages

Author(s): Ulčar, Matej; Robnik-Šikonja Marko
Published in: Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), 2020, Page(s) 4731–4738
DOI: 10.5281/zenodo.3894535

Leveraging Contextual Embeddings for Detecting Diachronic Semantic Shift

Author(s): Martinc, Matej; Kralj Novak, Petra; Pollak, Senja
Published in: Proceedings of the 12th Language Resources and Evaluation Conference (LREC2020), 2020, Page(s) 4811‑4819
DOI: 10.5281/zenodo.3894557

Multilingual Culture-Independent Word Analogy Datasets

Author(s): Ulčar, Matej; Vaik, Kristiina; Lindström, Jessica; Dailidėnaitė, Milda; Robnik-Šikonja, Marko
Published in: Proceedings of the 12th Language Resources and Evaluation Conference (LREC2020), Issue 1, 2020, Page(s) 4074‑4080
DOI: 10.5281/zenodo.3894553

Dataset for Temporal Analysis of English-French Cognates

Author(s): Frossard, Esteban; Coustaty, Mickael; Doucet, Antoine; Jatowt, Adam; Hengchen, Simon
Published in: Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), 2020, Page(s) 855-859
DOI: 10.5281/zenodo.3693651

A Dataset for Multi-lingual Epidemiological Event Extraction

Author(s): Mutuvi, Stephen; Doucet, Antoine; Lejeune, Gael; Odeo, Moses
Published in: Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), 2020, Page(s) 4139–4144
DOI: 10.5281/zenodo.3709626

CoSimLex: A Resource for Evaluating Graded Word Similarity in Context

Author(s): Carlos Santos Armendariz; Matthew Purver; Matej Ulčar; Senja Pollak; Nikola Ljubešič; Marko Robnik-Šikonja; Mark Granroth-Wilding; Kristiina Vaik
Published in: Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), 2020, Page(s) 5878–5886
DOI: 10.5281/zenodo.3894565

Text Visualization for the Support of Lexicography-Based Scholarly Work

Author(s): Sheehan, Shane; Luz, Saturnino
Published in: Proceedings of the 6th biennial conference on electronic lexicography, eLex 2019, 2019, Page(s) 694-725
DOI: 10.5281/zenodo.3894619

Mining semantic relations from comparable corpora through intersections of word embeddings.

Author(s): Vintar, Špela; Grčič Simeunovič, Larisa; Martinc, Matej; Pollak, Senja; Stepišnik, Uroš
Published in: Proceedings of the LREC 2020 13th Workshop on Building and Using Comparable Corpora, 2020, Page(s) 29-34
DOI: 10.5281/zenodo.3894635

Interaction Patterns in Conversations with Alzheimer's Patients

Author(s): Nasreen, Shamila; Purver, Matthew; Hough, Julian
Published in: Poster presentation at the 7th International Conference on Statistical Language and Speech Processing. Ljubljana, Slovenia, 2019
DOI: 10.5281/zenodo.3894637

Multilingual Dynamic Topic Model

Author(s): Elaine Zosa, Mark Granroth-Wilding
Published in: Proceedings - Natural Language Processing in a Deep Learning World, 2019, Page(s) 1388-1396
DOI: 10.26615/978-954-452-056-4_159

The NetViz terminology visualization tool and the use cases in karstology domain modeling

Author(s): Pollak, Senja; Podpečan, Vid; Miljkovic, Dragana; Stepinšik, Uroš; Vintar, Špela
Published in: Proceedings of the 6th International Workshop on Computational Terminology (COMPUTERM 2020), 2020, Page(s) 55-61
DOI: 10.5281/zenodo.3894686

Communities of related terms in Karst terminology co-occurrence network

Author(s): Miljkovic, Dragana; Kralj, Jan; Stepišnik, Uroš; Pollak, Senja
Published in: Proceedings of the 6th biennial conference on electronic lexicography, eLex 2019, 2019, Page(s) 357-373
DOI: 10.5281/zenodo.3894684

A Comparison of Unsupervised Methods for Ad hoc Cross-Lingual Document Retrieval

Author(s): Zosa, Elaine; Granroth-Wilding, Mark; Pivovarova, Lidia
Published in: Proceedings of the Cross-Language Search and Summarization of Text and Speech Workshop, 2020, Page(s) 32-37
DOI: 10.5281/zenodo.3898384

Capturing Evolution in Word Usage: Just Add More Clusters?

Author(s): Matej Martinc, Syrielle Montariol, Elaine Zosa, Lidia Pivovarova
Published in: Companion Proceedings of the Web Conference 2020, 2020, Page(s) 343-349
DOI: 10.1145/3366424.3382186

Zaznavanje sentimenta v novicah z globokimi nevronskimi mrežami

Author(s): Arhar Holdt, Špela; Pollak, Senja; Robnik-Šikonja, Marko; Krek, Simon
Published in: Issue In Proceedings of the Conference on Language Technologies and Digital Humanities, JTDH2020, 2020, Page(s) 10-15
DOI: 10.5281/zenodo.4059729

Evaluation of related news recommendations using document similarity methods

Author(s): Pranjić, Marko; Podpečan, Vid; Robnik-Šikonja, Marko; Pollak, Senja
Published in: Issue In Proceedings of the Conference on Language Technologies and Digital Humanities, JTDH2020, 2020, Page(s) 81-86
DOI: 10.5281/zenodo.4059710

Dimenzija spola v slovenskih vektorskih vložitvah besed: primerjava modelov prek analogij poklicev

Author(s): Supej, Anka; Ulčar, Matej; Robnik-Šikonja, Marko; Pollak, Senja
Published in: In Proceedings of the Joint Conference on Digital Libraries (JCDL 2020), 2020, Page(s) 93-100
DOI: 10.5281/zenodo.4059700

Zero-Shot Learning for Cross-Lingual News Sentiment Classification

Author(s): Andraž Pelicon, Marko Pranjić, Dragana Miljković, Blaž Škrlj, Senja Pollak
Published in: Applied Sciences, Issue 10/17, 2020, Page(s) 5993, ISSN 2076-3417
DOI: 10.3390/app10175993

Nazaj v prihodnost: avtomatizacija in preobrazba novinarske epistemologije

Author(s): Igor Vobič, Marko Robnik Šikonja, Monika Kalin Golob
Published in: Javnost - The Public, Issue 26/sup1, 2019, Page(s) S41-S61, ISSN 1318-3222
DOI: 10.1080/13183222.2019.1696600

Re-Representing Metaphor: Modeling Metaphor Perception Using Dynamically Contextual Distributional Semantics

Author(s): Stephen McGregor, Kat Agres, Karolina Rataj, Matthew Purver, Geraint Wiggins
Published in: Frontiers in Psychology, Issue 10, 2019, ISSN 1664-1078
DOI: 10.3389/fpsyg.2019.00765

Towards Robust Text Classification with Semantics-Aware Recurrent Neural Architecture

Author(s): Blaž Škrlj, Jan Kralj, Nada Lavrač, Senja Pollak
Published in: Machine Learning and Knowledge Extraction, Issue 1/2, 2019, Page(s) 575-589, ISSN 2504-4990
DOI: 10.3390/make1020034

Predicting Slovene Text Complexity Using Readability Measures

Author(s): Tadej Škvorc, Simon Krek, Senja Pollak, Špela Arhar Holdt, Marko Robnik-Šikonja
Published in: In Contributions to Contemporary History, 2019, ISSN 2463-7807

Combining n -grams and deep convolutional features for language variety classification

Author(s): Matej Martinc, Senja Pollak
Published in: Natural Language Engineering, Issue 25/5, 2019, Page(s) 607-632, ISSN 1351-3249
DOI: 10.1017/S1351324919000299

TermEnsembler

Author(s): Andraž Repar, Vid Podpečan, Anže Vavpetič, Nada Lavrač, Senja Pollak
Published in: Terminology, Issue 25/1, 2019, Page(s) 93-120, ISSN 0929-9971
DOI: 10.1075/term.00029.rep

Reproduction, replication, analysis and adaptation of a term alignment approach

Author(s): Andraž Repar, Matej Martinc, Senja Pollak
Published in: Language Resources and Evaluation, 2019, ISSN 1574-020X
DOI: 10.1007/s10579-019-09477-1

‘Our task is to demystify fears’: Analysing newsroom management of automation in journalism

Author(s): Marko Milosavljević, Igor Vobič
Published in: Journalism, 2019, Page(s) 146488491986159, ISSN 1464-8849
DOI: 10.1177/1464884919861598

Methods and visualization tools for the analysis of medical, political and scientific concepts in Genealogies of Knowledge

Author(s): Saturnino Luz, Shane Sheehan
Published in: Palgrave Communications, Issue 6/1, 2020, ISSN 2055-1045
DOI: 10.1057/s41599-020-0423-6

Exploring the Relations Between Net Benefits of IT Projects and CIOs’ Perception of Quality of Software Development Disciplines

Author(s): Damjan Vavpotič, Marko Robnik-Šikonja, Tomaž Hovelja
Published in: Business & Information Systems Engineering, 2019, ISSN 2363-7005
DOI: 10.1007/s12599-019-00612-4

Data Journalism as a Service: Digital Native Data Journalism Expertise and Product Development

Author(s): Ester Appelgren, Carl-Gustav Lindén
Published in: Media and Communication, Issue 8/2, 2020, Page(s) 62, ISSN 2183-2439
DOI: 10.17645/mac.v8i2.2757

How Furiously Can Colorless Green Ideas Sleep? Sentence Acceptability in Context

Author(s): Jey Han Lau, Carlos Armendariz, Shalom Lappin, Matthew Purver, Chang Shu
Published in: Transactions of the Association for Computational Linguistics, Issue 8, 2020, Page(s) 296-310, ISSN 2307-387X
DOI: 10.1162/tacl_a_00315

Compressive approaches for cross-language multi-document summarization

Author(s): Elvys Linhares Pontes, Stéphane Huet, Juan-Manuel Torres-Moreno, Andréa Carneiro Linhares
Published in: Data & Knowledge Engineering, Issue 125, 2020, Page(s) 101763, ISSN 0169-023X
DOI: 10.1016/j.datak.2019.101763

Computational generation of slogans

Author(s): Khalid Alnajjar, Hannu Toivonen
Published in: Natural Language Engineering, 2020, Page(s) 1-33, ISSN 1351-3249
DOI: 10.1017/S1351324920000236

Nazaj v prihodnost: avtomatizacija in preobrazba novinarske epistemologije

Author(s): Igor Vobič, Marko Robnik Šikonja, Monika Kalin Golob
Published in: Javnost - The Public, Issue 26/sup1, 2019, Page(s) S41-S61, ISSN 1318-3222
DOI: 10.1080/13183222.2019.1696600

In the Name of the Right to be Forgotten: New Legal and Policy Issues and Practices regarding Unpublishing Requests in Slovenian Online News Media

Author(s): Marko Milosavljević, Melita Poler, Rok Čeferin
Published in: Digital Journalism, 2020, Page(s) 1-17, ISSN 2167-0811
DOI: 10.1080/21670811.2020.1747942

(Mis)Information Operations: An Integrated Perspective

Author(s): Cinelli, Matteo; Conti, Mauro; Finos, Livio; Grisolia, Francesco; Kralj Novak, Petra; Peruzzi, Antonio; Tesconi, Maurizio; Zollo, Fabia; Quattrociocchi, Walter
Published in: Journal of Information Warfare, Issue 18(3), 2020, ISSN 1445-3312

A Multilingual Study of Multi-Sentence Compression using Word Vertex-Labeled Graphs and Integer Linear Programming

Author(s): Linhares Pontes, Elvys; Huet, Stéphane; Torres Moreno, Juan Manuel; Gouveia da Silva, Thiago; Carneiro Linhares, Andréa
Published in: Computación y Sistemas, Issue 24(2), 2020, ISSN 1405-5546

Automated Journalism as a Source of and a Diagnostic Device for Bias in Reporting

Author(s): Leo Leppänen, Hanna Tuulonen, Stefanie Sirén-Heikel
Published in: Media and Communication, Issue 8/3, 2020, Page(s) 39, ISSN 2183-2439
DOI: 10.17645/mac.v8i3.3022

tax2vec: Constructing Interpretable Features from Taxonomies for Short Text Classification

Author(s): Blaž Škrlj, Matej Martinc, Jan Kralj, Nada Lavrač, Senja Pollak
Published in: Computer Speech & Language, Issue 65, 2021, Page(s) 101104, ISSN 0885-2308
DOI: 10.1016/j.csl.2020.101104

Automating News Comment Moderation with Limited Resources: Benchmarking in Croatian and Estonian

Author(s): Shekhar, Ravi; Pranjić. Marko; Pollak, Senja; Pelicon, Andraž; Purver, Matthew
Published in: Journal for Language Technology and Computational Linguistics, Issue 2, 2020, Page(s) 49-79, ISSN 2190-6858
DOI: 10.5281/zenodo.4032371

Recycling a genre for news automation

Author(s): Lauri Haapanen, Leo Leppänen
Published in: AILA Review, Issue 33, 2020, Page(s) 67-85, ISSN 1461-0213
DOI: 10.1075/aila.00030.haa

RaKUn: Rank-based Keyword Extraction via Unsupervised Learning and Meta Vertex Aggregation

Author(s): Blaž Škrlj, Andraž Repar, Senja Pollak
Published in: Statistical Language and Speech Processing - 7th International Conference, SLSP 2019, Ljubljana, Slovenia, October 14–16, 2019, Proceedings, Issue 11816, 2019, Page(s) 311-323
DOI: 10.1007/978-3-030-31372-2_26

Language Comparison via Network Topology

Author(s): Blaž Škrlj, Senja Pollak
Published in: Statistical Language and Speech Processing - 7th International Conference, SLSP 2019, Ljubljana, Slovenia, October 14–16, 2019, Proceedings, Issue 11816, 2019, Page(s) 112-123
DOI: 10.1007/978-3-030-31372-2_10

Prediction Uncertainty Estimation for Hate Speech Classification

Author(s): Kristian Miok, Dong Nguyen-Doan, Blaž Škrlj, Daniela Zaharie, Marko Robnik-Šikonja
Published in: Statistical Language and Speech Processing - 7th International Conference, SLSP 2019, Ljubljana, Slovenia, October 14–16, 2019, Proceedings, Issue 11816, 2019, Page(s) 286-298
DOI: 10.1007/978-3-030-31372-2_24

Symbolic Graph Embedding Using Frequent Pattern Mining

Author(s): Blaž Škrlj, Nada Lavrač, Jan Kralj
Published in: Discovery Science - 22nd International Conference, DS 2019, Split, Croatia, October 28–30, 2019, Proceedings, Issue 11828, 2019, Page(s) 261-275
DOI: 10.1007/978-3-030-33778-0_21

Cross-lingual embeddings for hate speech detection in comments

Author(s): Marinšek, Rok
Published in: 2019
DOI: 10.5281/zenodo.3894645

Cross-lingual approach to abstractive summarization

Author(s): Žagar, Aleš
Published in: MSc Thesis, 2020
DOI: 10.5281/zenodo.3967214

Datasets

Evaluation results for When a Computer Cracks a Joke: Automated Generation of Humorous Headlines

Author(s): Alnajjar, Khalid; Hämäläinen, Mika
Published in: Zenodo

Dataset of Slovene idiomatic expressions SloIE

Author(s): Škvorc, Tadej; Gantar, Polona; Robnik-Šikonja, Marko
Published in: Faculty of Computer and Information Science, University of Ljubljana

Multilingual Culture-Independent Word Analogy Datasets

Author(s): Ulčar, Matej; Vaik, Kristiina; Lindström, Jessica; Linde, Dace; Dailidėnaitė, Milda; Šumakov, Andrei
Published in: Faculty of Computer and Information Science, University of Ljubljana

Sentiment Annotated Dataset of Croatian News

Author(s): Pelicon, Andraž; Pranjić, Marko; Miljković, Dragana; Škrlj, Blaž; Pollak, Senja
Published in: Jožef Stefan Institute

ELMo embeddings models for seven languages

Author(s): Ulčar, Matej
Published in: Faculty of Computer and Information Science, University of Ljubljana

List of single-word male and female occupations in Slovenian

Author(s): Supej, Anka; Ulčar, Matej; Robnik-Šikonja, Marko; Pollak, Senja
Published in: Jožef Stefan Institute

ELMo embeddings model, Slovenian

Author(s): Ulčar, Matej
Published in: Faculty of Computer and Information Science, University of Ljubljana

A Resource for Evaluating Graded Word Similarity in Context: CoSimLex

Author(s): Armendariz, Carlos; Matthew, Purver; Ulčar, Matej; Pollak, Senja; Ljubešić, Nikola; Robnik-Šikonja, Marko; Granroth-Wilding, Mark; Vaik, Kristiina
Published in: Queen Mary University

Reference List of Slovene Frequent Common Words

Author(s): Pollak, Senja; Arhar Holdt, Špela; Krek, Simon; Robnik-Šikonja, Marko
Published in: Jožef Stefan Institute

SimLex-999 Slovenian translation SimLex-999-sl 1.0

Author(s): Pollak, Senja; Vulić, Ivan; Pelicon, Andraž; Repar, Andraž; Armendariz, Carlos; Matthew, Purver; Ljubešić, Nikola
Published in: University of Ljubljana

SemEval-2020 Task 3: Graded Word Similarity in Contex

Author(s): Carlos S. Armendariz; Matthew Purver; Senja Pollak; Nikola Ljubešić; Matej Ulčar; Ivan Vulić; Mohammad Taher Pilehvar
Published in: Zenodo

SemEval-2020 Task 3: Graded Word Similarity in Context

Author(s): Carlos S. Armendariz; Matthew Purver; Senja Pollak; Nikola Ljubešić; Matej Ulčar; Ivan Vulić; Mohammad Taher Pilehvar
Published in: Zenodo

Data for "Dataset for Temporal Analysis of English-French Cognates"

Author(s): Frossard, Esteban; Coustaty, Mickaël; Doucet, Antoine; Jatowt, Adam; Hengchen, Simon
Published in: Zenodo

Data for "A Dataset for Multi-lingual Epidemiological Event Extraction"

Author(s): Mutuvi, Stephen; Doucet, Antoine; Lejeune, Gaël; Odeo, Moses
Published in: Zenodo

Software

Slovenian RoBERTa contextual embeddings model: SloBERTa 1.0

Author(s): Ulčar, Matej; Robnik-Šikonja, Marko
DOI: 11356/1387; 11356/1397
Publisher: Faculty of Computer and Information Science, University of Ljubljana

CroSloEngual BERT

Author(s): Ulčar, Matej; Robnik-Šikonja, Marko
DOI: 11356/1317; 11356/1330
Publisher: Faculty of Computer and Information Science, University of Ljubljana