Community Research and Development Information Service - CORDIS


SSIX Report Summary

Project ID: 645425
Funded under: H2020-EU.

Periodic Reporting for period 1 - SSIX (Social Sentiment analysis financial IndeXes)

Reporting period: 2015-03-01 to 2016-02-29

Summary of the context and overall objectives of the project

The SSIX (Social Sentiment analysis financial IndeXes) project is a Horizon 2020 Innovation action under the topic ICT-15-2014 - Big data and Open Data Innovation and take-up. The aim of this action is to improve the ability of European companies to build innovative multilingual data products and services, in order to turn large data volumes into semantically interoperable data assets and knowledge. SSIX aims to help meet this objective with the creation of a collection of adaptable tools which can be used to create actionable analytics from large amounts of multilingual data sets; producing sentiment metrics which can be utilized by European companies, enabling them to make better informed business decisions.
The sentiment metrics that the SSIX platform produces will be the end result of the projects challenging task of extracting relevant and valuable economically significant signals from the huge variety of and increasingly influential social media platforms; such as Twitter, Google+, Facebook, StockTwits and LinkedIn. Social media data represents a collective barometer of thoughts and ideas touching every facet of society. The platform will also be capable extracting these signals from the most reliable and authoritative newswires, news feeds and blogs. One of the key benefits SSIX brings is the ability to carry out multilingual analysis; non-English language support is underrepresented in the current market offering.
The SSIX pipeline is using state-of-the-art natural language processing tools to create a collection of qualitative and quantitative parameters called X-Scores. This natural language processing pipeline will be trained with a specific goal of interpreting significant sentiment signals in the project's main domain of finance. Using these X-Scores, SSIX partners will create commercially viable and exploitable social sentiment metrics, regardless of language, locale and data format. The custom sentiment metrics can be used to create custom indices which can be used to support research and investment decision-making, enabling end users to analyse and leverage real-time social media sentiment data, creating innovative products and services to support revenue growth. For the finance domain it is anticipated that these sentiment metrics can assist with alpha generation, which has already been tested in research examining the wisdom of the crowds concept from social media conversations and its predictive nature on future stock market performance.
To achieve this task, a consortium of eight partner institutions from six countries has been formed. The consortium is led by the Insight Centre for Data Analytics at the National University of Galway (Ireland); NUIG is providing big data analytics and natural language processing expertise. The University of Passau, Fakultät für Informatik und Mathematik (Germany) are providing distributional semantics and quantitative (statistical) analysis. In addition to the two academic institutions, six industry partners are involved to guarantee a strong market orientation of the project. Peracton Ltd (Ireland) are providing financial technology software and is a pilot use case partner using their MAARS platform, which provides investors with a ranking of equities to support investment decisions. Redlink (Austria) has expertise in content and big data analysis and linked data publishing; they are part of the technical team developing SSIX, offering conceptual and technological know-how for the implementation of knowledge-based information systems. Handelsblatt Research Institute (Germany) is a scientific research institute with a focus on scientific research and economic analysis, they will be leading the commercialisation of project. 3rdPlace (Italy) is a digital strategy consultancy, they are providing their experience in collecting and processing large amounts of data from different sources. EurActiv (UK) is a multilingual media network providing information on the “Community of EU Actors”. With their extensive experience, they are responsible for the dissemination strategy of the project, which is an important prerequisite for a successful commercialisation. Lionbridge (Finland) are providing translation and application testing solutions. Lionbridge is crucially involved in the technical development of SSIX catering for multilingual language resources.
As SSIX is an innovation action with an aim of producing new or improved products or services, there is a dedicated work package focussed on commercialisation and exploitation of the project's outcomes. The consortium is actively engaging in preparation for this endeavor and has already received interest from potential trial partners. While it is very encouraging to have commercial interest in the first year of the project the consortium is aware that if the project is to have a positive commercial outcome the SSIX platform must be able to show it can provide actionable analytics.

SSIX Platform
The SSIX Platform can be broken down to its three main components. Figure 1 illustrates the envisioned architecture between these components, with their inputs and outputs:

Stage One - Data Ingestion and Filtering
The main activity of WP3 consists in the implementation of the processes dedicated to gathering data and metadata from several platforms and websites, the assorted information needed for the calculation of the SSIX indices forming the core logics of the platform. These processes will allow applications to interact with different social platforms, blogs and newsfeeds, thus requiring the implementation of complex pieces of software dedicated to the collection and processing of increasing amounts of data.

Stage Two - Natural Language Processing Pipeline
The SSIX Pipeline provides in the scope of the SSIX project a scalable infrastructure to analyze data. The architecture is based on existing, state-of-the-art and open source technology. One of the primary goals of the SSIX project apply opinion mining in order to interpret stock market behaviour based on assorted social sources. The role of the SSIX pipeline, the main outcomes of WP4, consists on providing automatic execution planning and standardised API for NLP analysis components as a homogenous software artifact.

Stage Three - Sentiment Metrics: X-Scores & Indices
WP5 will be responsible for the creation of the final sentiment metrics data streams (X-Scores and Indices). A generic dashboard will be available, also an API will enable data to be conveniently incorporated into end users existing platforms, such as Peraction’s MAARS software.

Work performed from the beginning of the project to the end of the period covered by the report and main results achieved so far

This section gives an up to date description of activities and work progress for the first year (1st March 2015 - 29th February 2016) of the SSIX project. A short description is given for the activities that were carried out by the consortium within each work package. All seven work packages have been active in this phase of the project. WP1, WP6 and WP7 are of a horizontal nature as they deal with Project Management (WP1), Technology Transfer and Dissemination (WP6) and Exploitation and Commercialisation (WP7) respectively. The overall methodology for the project has been appropriately designed so as to address the project's objectives successfully. Under this context, work is divided into three phases, the Groundwork phase, the Divergent phase and Convergent phase.
The bulk of work has been concerned with the Groundwork phase, in WP2 - Business Requirements, Use Cases and Business Methodology, Deliverable 2.1 - Business Requirements’ and Business Cases’ Definitions was produced, this deliverable is concerned with identifying, refining and documenting the business requirements for each industrial partner (Peracton, 3rdPlace, Lionbridge). Additionally, it defines each case study driven by the industrial partners. Deliverable 2.2 - Business Methodology definition was also completed and submitted in M12, the consortium agreed on the software configuration that the SSIX platform requires, both generic and then personalised for the case studies.
In WP3 - Data Management, Deliverable 3.1 - Data Requirement Analysis and Data Management Plan (DMP) V1 was produced. This document contains the results of analysis of the project's data sources, including the possible technical limits and constraints. Deliverable 3.4 - Data collection and analysis, was completed in M8. This document contains a detailed description of the outcomes from the technical activities carried out during the first part of year one, it provides a technical description of the architecture developed in WP3 in order to implement the procedures dedicated to perform data ingestion from the external identified sources on the Web, which have been initially defined by the SSIX project. The operations described, implemented by multiple components, include all the activities performed on data from its first entry point to the system: data gathering, data filtering, storage of smart data.
In WP4 - NLP Services and Analysis Pipeline, Deliverable 4.1 - NLP Service and Analysis Architecture (Initial Version) was delivered in M7. The deliverable provides the initial architecture for the NLP Service and Analysis Pipeline, this first iteration is based on existing technology and state-of-the-art. Deliverable 4.3 - 1st Catalogue of SSIX Language Resources was completed in M12, an extensive analysis of the different language resources and technologies that could be used to fulfil the needs of the project was collected for this document. Deliverable 4.5 - NLP Service and Analysis Pipeline (Proof-of-Concept) was completed in M12, the proof-of-concept document comes with support for four different sentiment analysers based open source text mining tools (StanfordNLP, GATE, Redlink API and NLTK) that are orchestrated and executed by an Apache Spark cluster.
WP5 - SSIX Platform Deployment, Validation and Evaluation, builds on the work from the previous work packages, during this period the majority of the effort has gone towards, Deliverable 5.3 - SSIX Technical Validation Plan which was completed in M8, the document outlines the overall approach to testing and quality assurance for the SSIX platform. Deliverable 5.1 - SSIX Process Specification, was completed in M12. This document is concerned with the main SSIX software process flows and related business processes that are interlinked with SSIX software architecture. Deliverable 5.1 also defines the SSIX Templates, which are general-purpose frameworks that can be personalised and then modified in a unique way, customising the functionality and process flow of the SSIX software. Deliverable 5.2 - SSIX Architecture Specification, was completed in M12. This document is a description of the SSIX system architecture. It consists of a review of the business and software requirements, a high-level overview of the proposed system architecture and structure, development tools and techniques and links to the corresponding software repositories.
WP6 - For Task 6.2 - Project website (incl. forum) and wiki, two deliverables were completed during this period, Deliverable 6.2 - Project Web site, Wiki, LinkedIn and Training Materials V1.0, was delivered in M2 and Deliverable 6.3 - Project Web site, Wiki, LinkedIn and Training Materials V2.0, delivered in M12. Both these deliverables describe the ongoing work for the project website; the various social media accounts the and the progress on the training materials that are relevant to the project. Deliverable 6.7 - Technology Transfer and Dissemination Plan 1st Version was completed in M12. The document contains a definition of the target groups, the activities, planned dissemination events that each partner is committed to carry out, the means to be adopted and the expected results of the dissemination.
WP7s first deliverable will be completed in M24, but several ongoing activities have taken place over the first year. The first draft of Deliverable 7.4 - Exploitation and Go-to-Market Strategy was composed in mid 2015 with a major and comprehensive update undertaken in January and February 2016 based on the results of the SSIX General Assembly Meeting 2015 and Telcos of the Commercial Users Group (CUG) held early in 2016. Work progress will continue on the recently commenced WP7 Deliverables; 7.2 - API Commercial Toolkit and Services Trials, and 7.3 - Commercialisation Plan (drafted in early 2016).

Progress beyond the state of the art and expected potential impact (including the socio-economic impact and the wider societal implications of the project so far)

There were two major objectives defined in this work package: 1) Business Requirements and Business Use Cases definition and 2) the definition of the business methodology to be followed up by the project. The following findings can be listed as potentially generating impact:
While setting up the business requirements, within our market analysis and validation we have found out that most of the solutions out there that deal with social media analysis are basic or moderately advanced, informative rather than analytical and in an incipient phase of development:
- Existing solutions are not always backed scientifically (no clear justification provided for the use of various algorithmic approaches and nor are they empirically validated or data sampling methodologies are not often rigorous or transparent)
- Generic algorithms are used for very specific purposes -
- Financial sentiment classification is binary in most of cases (either positive or negative) which limits its real world usage, especially for generating any kind of sentiment index based on a continuous value..
- In some advanced solutions there are given more nuances of sentiment that are still not enough for proper analytical processing-
- The chain of processing social sentiment data from raw data extraction to final signal generation is quite long; each component and process down the chain (if not scientifically designed) can introduce a bias that could render the final sentiment value results meaningless.
All the above findings show that while there are many tools out there trying to handle sentiment analysis extracted from open data sources/social networks, the risk of using them for financial or other type of decision making is unknown to high. This can be summarised simply as ‘nice to watch but would one put his/her money on it?’. We have constructed a process flow that has the potential to become a reference / standard for sentiment data extraction, analysis and generation
A set of X-scores have been defined that have potential to be taken by the financial industry and reach the same level of utility as any other stock financial parameter such as P/E (price-earnings ratio), MA (moving average), MACD (moving average convergence-divergence), etc.

The activities performed within the tasks of WP3 mainly dealt with data ingestion and data storage procedures. The implemented architecture involves many state-of-the-art technologies used in the Big Data field, like Apache Spark[1], Apache Kafka[2] and Cassandra[3]. The first performance tests highlighted stability issues due to high volumes of concurrent data when listening to the most discussed financial markets (like NASDAQ 100[4] and FTSE100[5]), that can be easily overcome with hardware resources scalability.
Further experimentations are being conducted on cloud technologies provided by the Google Cloud Platform [6](Big Query, Dataproc) that could reduce the times for data storage and extraction and help the scalability of the parallel computing processes.

During the second phase of the first year part of the effort has been put on studying and identifying sampling methodologies that can be applied to the data extraction layer of the architecture. A stratified sampling technique has been adopted to extract the contents used for the creation of a gold standard for language processing, while further studies on the Shannon Sampling Theorem[7] are being conducted in order to find a proper way to sample data delivered with high frequency.
Interesting research topics related to the performance of the sampling system emerged from these activities and are currently being investigated; in particular, a challenge will consist in the calculation of the levels of entropy and information gain on big data volumes, while the cosine similarity is being considered as a technique to measure the performance.

During the first year the focus on this work packages has been primarily on evaluating the state-of-the-art existing language resources and technologies. In parallel against laboratory data samples. The conclusions are that the available multilingual domain specific and sentiment lexica may not provide the expected features for the opinion mining needs for this project. Therefore those results set down the baseline where to continue working on the second year towards achieving the expected outcomes.
In parallel it has been analysed many Big Data analysis infrastructures that could be used as foundation for the pipeline architecture. This tasks has provided a proof-of-concept implementation based on Apache Spark [1] that integrates the different NLP engines using a common API, able to deal with both native Java implementations and any other language using a RESTful integration pattern. The pipeline implementation has been tested from very early development phases in real scenarios (e.g., against the data provided by the Twitter stream from NASDAQ cashtags).

The approach to this work package over the course of the first year of the project focused on analysing the business requirements of consortium partners and defining the processes and systems that are required to realise an implementation of these business needs in the SSIX software platform. Foresight was given to the need for scalability and efficiency. The major system components are being designed to operate independently of each other, so can be distributed or centralised depending on the deployment scenario and load on the system.

Some challenges that the project may face in the next year include scalability issues and data throughput bottlenecks. The overall system architecture has been designed to allow each system component to scale up and down independently in anticipation of such issues. The data pipeline between major system components has been designed to use networking resources efficiently and work has been done to establish the minimum amount of data required to produce effective results. Areas of potential innovation include testing of new classification models, building a system for statistical calculations and NLP classification using massively parallel computing and researching new visualisations to aid end users in the decision making process.

The metadata and sentiment data generated by the SSIX platform is used to create several different sentiment metrics which can be used to analyse the change in sentiment behaviour. Given the project’s case study requirements we present an example of a small set of metrics (X-Scores) to be used by case studies. Rolling Sentiment for a given topic - continuous calculation of the sentiment value for a given topic. Rolling Sentiment Volatility - continuous calculation of the volatility sentiment for a given topic. Rolling Smart-data Sentiment Polarity Volume - volume of sentiment generated for a particular topic or stock (similar as volume of stocks sold within a particular timeframe) where metadata, user influence score and other added-value data will be considered. Rolling Sentiment Polarity Volume - continuous calculation of the sentiment volume associated with a topic. Custom index - a index (collection) of various sentiment metrics that are designed follow different topics or stocks. Additional metrics will be outlined in year two of the project, such as adjusting the sentiment weighting based on user reputation.

[7]R.J. Marks II: Introduction to Shannon Sampling and Interpolation Theory, Springer-Verlag, 1991.

Related information

Record Number: 190215 / Last updated on: 2016-11-09
Follow us on: RSS Facebook Twitter YouTube Managed by the EU Publications Office Top