Community Research and Development Information Service - CORDIS

Using Apache Storm for Trend Detection in the Social Media

Contributed by: ATC SA

The DICE project has made a significant contribution on Apache storm topologies for supporting and enhancing trend detection in social media.
Using Apache Storm for Trend Detection in the Social Media
As it is widely known, especially in the media industry, messages posted on social media contain valuable information related to events and trends in the real world. Various industries and brands that analyze social media are gaining valuable insights and information which they use in a number of operations.

For example in the news industry, trend detection is useful for:
(i) identifying emerging news based on the popularity of a certain topic and;
(ii) defining areas of great public interest that should be closely monitored as even a small development affects many people and leads to emerging news.

As another example, in the financial sector, trends may have both short-term and long-term consequences, affecting the daily price of stock to a country’s macroeconomic indicator. As an example, a trend demanding military action in the Middle East as a result of a terrorist attack may affect oil prices and subsequently decrease car sales.

To this end, and taking into account the large scale of that type of content, it is essential to develop methods for efficient trend detection in real-time.

For example, in recent years the pace of decision-making in breaking-news journalism has significantly increased. This is due to the multiplication of digital sources and incoming data streams, digital production processes, automation, real-time publishing and largely mobile news audiences.

There are different possible inputs to the topology: Candidate spouts include the Twitter streaming API and queues that inject messages into the topology (Redis, Apache Kafka). The first processing bolt is responsible for the extraction of entities and keywords from the incoming messages.

Trivial keywords (e.g. stop-words) are discarded while the rest of them are forwarded to the next bolt. The Timeline Generation bolt aggregates tuples of keywords –timestamps and creates a set of statistics for each keyword. In other words, this bolt calculates a background model of expected frequencies based on historical data. Tuples associated with the same keywords are aggregated in the same worker of the Timeline Generation bolt in a similar fashion as in map-reduce.

The resulting baseline model is forwarded to the next bolt each time there is an update. Then, the Bursty Keywords Detection bolt compares current frequencies to the baseline model and detects keywords, for which their difference is extraordinary.

Finally, the detected bursty keywords are clustered together in the final bolt of the topology based on keywords co-occurrences. The extracted trends are stored in a database.

During the DICE project, we are conducting experiments on an innovative trend detection topology and trying out changes that may improve the quality of social media results.

Contributor

Organisation

Contact

Related information

Keywords

Social media, trend detection, Apache
Follow us on: RSS Facebook Twitter YouTube Managed by the EU Publications Office Top