Periodic Reporting for period 2 - FashionBrain (Understanding Europe’s Fashion Data Universe) Período documentado: 2018-07-01 hasta 2019-12-31 Resumen del contexto y de los objetivos generales del proyecto The primary goal of any retailer is to understand the customer and predict upcoming demands/trends. However, even a complete record of past purchases (and returns) is insufficient to fully understand how items from a product catalogue align with customers’ general tastes, lifestyle choices and aspirations. Additionally, from a business perspective, any efficiency gains in the logistics of supplier management, shipping and handling are minor, compared to the gains one could obtain from a better understanding of customers’ personalities and habits. In this project we consolidate and extend existing European technologies in the area of database management, data mining, machine learning, image processing, information retrieval, and crowdsourcing to strengthen the position of European fashion retailers among their world-wide competitors. The action concluded successfully, with novel online shopping experiences, the detection of influencers, and the prediction of upcoming fashion trends. The main innovations of the FashionBrain project are:- MonetDB with improved support for time series and unstructured data processing- Flair: A New State-of-the-Art Library For Natural Language Processing - End-to-End Text-to-Image Search with Neural Information Retrieval- Human-in-the-loop Fashion Influencer Discovery- PredTS: predict fashion trends in multiple incomplete fashion time series data Trabajo realizado desde el comienzo del proyecto hasta el final del período abarcado por el informe y los principales resultados hasta la fecha Firstly, we worked on building domain knowledge:- Developed a fashion taxonomy, which aggregates various sources, such as the Fashwell taxonomy complemented with publicly available sources. - Produced software requirements for time series analysis and developed a Probabilistic RNN for sequential data with missing values.- Extended work on the FashionBrain taxonomy visualisation tools.Secondly, we have worked on semantic data integration from three different perspectives:- Developed techniques for entity extraction from text and images. An important outcome is FashionNLP: a natural language processing tool for fashion related text.- Provided initial solutions to store and share the FashionBrain taxonomy, common datasets, extracted entities and link, and as well in-database methods and solutions.- Provided the FashionBrain integrated architecture for data integration in fashion data.- Provided a demo paper entitled “RecovDB: accurate and efficient missing blocks recovery for large time series”.- SQL window functions have been released in MonetDB.- Flair was extended from handling only text to images.Thirdly, we have developed human computation and crowdsourcing tools to improve the quality of training data and perform annotation at scale:- Improved crowdsourcing agreement measures, with a publication at HCOMP-17, live demo and open source code.- Released the open source ModOp browser plugin to improve crowdsourcing interfaces. - Analysed the vulnerabilities of crowdsourcing interfaces and potential biases related, with publication at HCOMP-18 (winning Best Paper Award).- Performed a study on perceived bias in crowdsourcing, with a publication at SIGIR 2018 and publicly available dataset. - Presented poster paper on rating systems in crowdsourcing at HCOMP-16.- Presented a WWW 2020 paper on OpenCrowd: A Human-AI Collaborative Approach for Finding Social Influencers via Open-Ended Answers Aggregation.- Built a new annotated dataset (“FashionTweets”).- Provided a new methodology to measure difficulty of a crowdsourcing task, tested on the FashionTweets dataset.- A new innovation “Human-in-the-loop Fashion Influencer Discovery” has been developed, based on UNIFR Open Crowd and on Flair trained on the FashionTweets dataset.- The organisation of the First symposium on Biases in Human Computation and Crowdsourcing in synergy with another H2020 partner (Qrowd).- A new set of experiments has been conducted, on biases in crowdsourcing for fashion data, published in the Journal of Artificial Intelligence Research (JAIR) - A study on payment biases in crowdsourcing “Platform-related Factors in Repeatability and Reproducibility of Crowdsourcing Tasks” has been published in HCOMP 2019.- Two publications on task abandonment in crowdsourcing have been realised, in WSDM 2019 and in IEEE Transactions on Knowledge and Data Engineering (TKDE).Fourthly, we have developed In-Database-Mining and Deep Learning methods:- Performed a study for integrating entity linkage in a main memory database system (IDEL) integration with MonetDB.- Undertaken work on Neural Paragraph Retrieval (SMART-MD).- Realised an in-database machine learning approach in MonetDB.- Collected and annotated of a new Fashion Corpus, in collaboration with the Hong Kong Polytechnic, to be used in relation extraction experiments.- Published a paper “Analysing Errors of Open Information Extraction Systems” in the workshop “Workshop on Building Linguistically Generalizable NLP Systems” EMNLP 2017.- Published a paper on layer-wise analysis of Transformer representation) at CIKM 2019.- Published a demonstrator for layer-wise Transformer representations in ACM WWW 2020,.- Published a paper on contextualized document representations at ACM WWW 2020.- Published a work on neural models for topic segmentation and classification at TACL 19.Fifthly. we have worked on social media streams:- Developed a tool for the recovery of missing values and implemented within MonetDB, with a demo paper entitled that has been published at ICDE’19. - Created a method for the prediction of user preferences in fashion data.- Developed a tool for prediction of trends in time series data implemented on top of MonetDB.- Completed the integration of MonetDB and Flair for advanced analysis of terabytes of Twitter data. - Written a research paper that has been accepted at the Very Large Database conference (VLDB’20).Finally, we have worked on text-to-product and image-to-product search:- Developed an image to product entity linkage data model.- Developed state-of-the-art general NLP framework, Flair.- Published a NLDL 2019 paper on multilingual language modeling and application to sequence labeling.- Published an EMNLP 2019 paper on evolving word representations.- Published an EMNLP 2019 demonstration paper on the Flair framework.- Improved the Flair framework which includes support for (1) two-tower search architecture (2) image and text embeddings and (3) the FEIDEGGER dataset. - Built a live demonstration on FashionBrain website showcasing text-image search capabilities. Avances que van más allá del estado de la técnica e impacto potencial esperado (incluida la repercusión socioeconómica y las implicaciones sociales más amplias del proyecto hasta la fecha) Our work in neural language modeling for sequence labelling has given rise to results representing the new state-of-the-art in a number of core natural language processing (NLP) tasks, including Named Entity Recognition (NER) and Part-of-Speech (PoS) tagging. The approach has been bundled into a framework called Flair, which is available open source on Github.. The library ships with state-of-the-art pre-trained models for a range of NLP tasks and includes options for training custom models. It is able to handle text and images and it has been integrated into MonetDB.An image entity linkage data model that outperforms Google’s state-of-the-art on academic DeepFashion consumer-to-shop benchmark datasets: Google (Song et al 2017) 39.2%, Fashwell 40.1%. Moreover, Fashwell technology These achievements significantly improve the quality of existing technologies.