Optimising cloud computing
Computer users are increasingly faced with finding means to store vast amounts of data. Larger hard drives do meet some of these needs but there is growing trend towards saving data on an off-site storage system. Within just a few years, companies have switched from hardware to such third party cloud services. The advent of cloud infrastructures has also made it feasible to analyse massive data sets with parallel processing integrated into the new virtual environment. The 'Cloud-based indexing and query processing' (CLOUDIX)(opens in new window) project adopted MapReduce to process and generate large data sets. The cutting-edge research work conducted during the two-year project significantly increased the performance of MapReduce. MapReduce is a programming model widely used for special-purpose computations involving large amounts of data such as web request logs. It is also used to derive various kinds of data including inverted indices. A "map" function is applied to each logical "record" to compute a set of intermediate key values. Then, a "reduce" process identifies all values that share the same key to combine derived data appropriately. The CLOUDIX researchers provided mechanisms for accessing a subset of the input data, instead of scanning all data to produce the same result. Specifically, advanced algorithms support early termination of data processing when sufficient data for producing the correct result has been accessed. The decisive first steps have also been made towards integrating efficient ranking techniques to sort results according to their relevance. During the CLOUDIX project, different approaches were combined to address the shortcomings of the most prominent framework for parallel query processing in the cloud. On the other hand, its merits include scalability, fault-tolerance, load-balancing and most importantly simplicity. The CLOUDIX results, published in peer-reviewed scientific journals, are expected to help scientists and professionals save working hours while analysing large data sets.