Skip to main content

Multi-language text summarization

Periodic Reporting for period 1 - ML-TEXTSUM (Multi-language text summarization)

Reporting period: 2017-09-01 to 2019-08-31

In our daily life, we are submerged by huge amounts of text, coming from different sources such as emails, news, reports, analyses, and so on. The availability of unprecedented volumes of data represents both a challenge and an opportunity. On one hand, information overload can have severe consequences. For example, scientists might miss a relevant reference to their work due to the increased peace of publication, losing months (if not years) of work; intelligence systems might miss a security thread buried down vast amounts of data, etc. On the other hand, there is widespread agreement that the effective harnessing of text and data mining techniques is important to the performance of advanced economies, such as those of the European Union (EU). With respect to the development of text and data mining techniques, the EU faces an added challenge due to its rich cultural heritage. Multilingualism is a core value of the European Union, as integral to Europe as the freedom of movement, the freedom of residence and the freedom of expression. Hence, it is only admissible to tackle text understanding challenges from a multi-language perspective, in order to ensure that knowledge is distributed independently of spoken language.


The objective of this project is to develop a system for efficient and accurate multi-lingual text summarization. That is, given as input a text document, the system will output a summary of the document in the same or in a different language. The availability of such system shall allow citizens, regardless of their language, to better handle the information overload and to gain access to critically distilled information (e.g. what is a certain newspaper’s opinion on the same topic this year? Are male/female athletes portrayed differently by the media?).


Conclusion
----------

This project was terminated after 13 months. In this time significant progress has been made in the foundations and its computational aspects. This resulted in several publications in top venues that I describe below.

Compared to the plan outlined in the proposal, the work accomplished in this time is more theoretical in nature. We proposed novel optimization techniques that pave the ground to develop more refined modeling techniques for the text summarization problem, although due to its early termination, this perspective remains to be fully exploited.
"Research, training and career development
-----------------------------------------

*Research*. We identified as a central issue behind existing text summarization techniques the lack of a meaningful quality criterion or loss function. Existing systems essentially minimize the discrepancy between the generated text and the summary. This is not a good measure since different summaries might convey the same meaning with different words or with different phrasing. In order to take into accounts the different aspects that make up for semantic similarity is it necessary to optimize different for criteria simultaneously.

The study of these problems led to the development of more efficient optimization algorithms [1, 2, 3] (see references below) to solve a wide class of problems. Collaboration with Prof. El Ghaoui lead to the development of new deep learning models with favorable optimization properties.


Dissemination and communication
-------------------------------

The work outlined above lead to several publications in top machine learning venues. The articles [1] and [2] (detailed below) have been published at the International Conference on Machine Learning (ICML), a highly selective conference with an acceptance rate below 30%.
ICML was held in Stockholm in June 2018 and AISTATS was held in Lanzarote, Canary Islands. All publications are published as open access.

[1] Pedregosa, F. & Gidel, G.. (2018). Adaptive Three Operator Splitting. Proceedings of the 35th International Conference on Machine Learning (ICML), in PMLR 80 http://proceedings.mlr.press/v80/pedregosa18a.html

[2] Kerdreux, T., Pedregosa, F. (equal contribution) & d’Aspremont, A.. (2018). Frank-Wolfe with Subsampling Oracle. Proceedings of the 35th International Conference on Machine Learning, in PMLR 80 http://proceedings.mlr.press/v80/kerdreux18a.html

[3] Gidel, G., Pedregosa, F. & Lacoste-Julien, S.. (2018). Frank-Wolfe Splitting via Augmented Lagrangian Method. Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, in PMLR


The following articles are under review but already accessible through preprint servers:

[4] Pedregosa, F., Fatras, K., & Casotto, M. (2018). Variance Reduced Three Operator Splitting. arXiv preprint arXiv:1806.07294.

[5] Pedregosa, F., Askari, A., Negiar, G., & Jaggi, M. (2018). Step-Size Adaptivity in Projection-Free Optimization. arXiv preprint arXiv:1806.05123.

[6] Leblond, R., Pederegosa, F., & Lacoste-Julien, S. (2018). Improved asynchronous parallel optimization analysis for stochastic incremental methods. arXiv preprint arXiv:1801.03749. https://arxiv.org/abs/1801.03749

The following article has been presented at a workshop without proceedings:

[6] ""Lifted Neural Networks for Weight Initialization"" (2017), Geoffrey Negiar, Armin Askari, Fabian Pedregosa, Laurent El Ghaoui. https://people.eecs.berkeley.edu/~elghaoui/pdffiles/NIPS_Opt_workshop_2017.pdf


Furthermore, I maintain a blog with technical content geared towards a scientific audience (http:/fa.bianp.net).

Finally, a 3-minute video was made in occasion for the Neural Information Processing Systems (NIPS) conference held in Long Beach (California) in September 2017: https://youtu.be/JnqhV0KO-1I


Public Engagement
--------------------------

With the goal of raising awareness among the general public of my research work I participated in the following non-academic events:

* Science Hack Day San Francisco (http://sf.sciencehackday.org/) October 14-15, 2017.

* Brainhack (https://sfbrainhack.github.io/) May 03-05, 2018.

* Scikit-learn sprint (Berkeley, May 28th-June 1 2018). Report about my work there: http://matthewrocklin.com/blog/work/2018/08/07/incremental-saga


"
I highlight two aspects in which the work performed during this fellowship improves over state of the art.

* In [1], we proposed a method to solve a specific class of saddle point problems. From a practitioner point of view, this method is much more practical than existing ones since it eliminates the need to set a step-size parameter and instead estimates it from quantities that arise from the optimization. This makes the method more broadly applicable and easier to use. Numerous benchmarks highlight the practical advantages of the proposed method to solve problems with multiple non-smooth terms that arise often in machine learning.

A poster of this work presented at the international conference on machine learning (ICML), that shows benchmarks with other methods is attached as an image to this report.

* In [4], we propose a method that can solve a class saddle point problems with potentially millions of smooth terms. The development of such methods is a stepping stone to solve large scale saddle point problems, and we believe that (variants of) this method will be essential for the construction of text summarization models.
adatos-poster.png