Computing power has increased greatly over the last few decades. Despite this increase, there are various applications whose requirements exceed the computing power of standard machines. To tackle this, specialised machines are periodically constructed with the best available technology to provide a very large amount of concentrated compute power, called high performance computers (HPC), to give the best possible answers for such demanding applications. The next generation of HPC, expected sometime after 2020, is called Exascale.
Traditional users of HPC have done simulations of one type or another. In the last decade, a new breed of HPC has appeared, those concerned with Big Data. Doing simulations is mostly about large amounts of computation to observe the behaviour of a sophisticated model with few parameters. Big Data problems, by contrast, deal with less sophisticated models but with many more parameters, and try to choose the parameters by analysing large amounts of data. Folk wisdom in this field states that the ability to capture and analyse more data is more valuable than making more sophisticated models, and this works well when data is cheap and easy to get. However, there are problems for which the data are very expensive to collect. In this case it becomes important to be able to use more sophisticated models to be able to squeeze as much knowledge as possible out of the expensive data. Such problems are at the juncture of HPC and Big Data in that they have large data sets to analyse, yet should exploit more sophisticated models through computation to make the most of the available data.
The ExCAPE project is about how to tackle such problems. The core of the project is on maths and software and how they work on HPC machines. However, to be able to advance the state of the art it helps to have a concrete problem to tackle. For this we take the chemogenomics problem: predicting the activity of compounds in the drug discovery phase of medical research, hence the project name - Exascale Compound Activity Prediction Engines. Chemogenomics is a Machine Learning problem.
The objectives of the project are to find methods and systems that can tackle large and complex machine learning problems, such as chemogenomics. This will require algorithms and software that make efficient use of the latest HPC machines. Creating these, along with preparing the data to give the system something to work on, is the main work of the project.
The project team is composed of 9 top research institutions and companies from 8 different countries in Europe, working together to advance the state of technology.
The final conclusions of the project are that it is possible to improve the results for this type of problem by using more sophisticated machine learning techniques that require large amounts of compute power, and further more that some of these techniques can be adapted to work more efficiently on HPC machines. We have shown these results through large scale tests using over four million core hours, compute power equivalent to roughly 35 years of run time on a modern server.