Skip to main content

Transforming Raw Data into Information through Virtualization

Final Report Summary - VIDA (Transforming Raw Data into Information through Virtualization)

At the heart of every business infrastructure are database products. Despite that data management technology has evolved impressively in the past forty years, deriving actionable information and new knowledge using data analytics remains a technologically challenging and sometimes prohibitively expensive task. Existing database infrastructures require long preparation times before data is ready to be used in analysis, and the overhead is increased as a function of the data volume and heterogeneity. Consequently, in order to save time and resources during analysis, valuable data are oftentimes ignored, thereby severely undermining discovery in science and business sectors.

The ViDa project facilitates the transformation of data into useful information by eliminating the preparation time, thereby minimizing the interval between the time the data arrives to the time that the user can perform analytical tasks. ViDa uses data virtualization to homogenize variable data formats and make data look similar to each application. Furthermore, ViDa uses real-time code generation to produce continuously optimized filtering operations which are custom to each analytical query. As a result, the user can write a query combining any kind of data together without any data preparation, and ViDa synthesizes the infrastructure needed to answer the query on-the-fly. ViDa guarantees excellent performance by creating and maintaining intelligent caches which keep frequently used data and code, and retrieving it fast when needed.

ViDa enables much faster development and deployment of data-intensive applications of heterogeneous data sources in scientific and business domains. ViDa enables a business analyst to ask a question (using a common query language), combining an Excel spreadsheet that resides in a local repository with a server-stored machine log from a sensor and a remote operational relational database in the company’s headquarters. An experimental neuroscientist can derive insights from the log of last night’s simulations by comparing it with a derived of public anonymized patient records and a remote public MRI database, while at the same time introducing custom cleaning and normalizing transformations through the same analytical query. The paradigm shift brought forth through ViDa marks the beginning of a new era of intelligent, real-time, seamless data management which will scale efficiently with the demands of new applications.