Periodic Reporting for period 2 - BioPIM (Processing-in-memory architectures and programming libraries for bioinformatics algorithms)
Reporting period: 2023-05-01 to 2024-10-31
Currently, all biological data are analyzed using computation platforms that are general-purpose, i.e. they aim to solve a wide range of problems. This means that all current compute grids, servers, and cloud computing platforms are designed to be able to provide solutions for all “computable” problems with amortized efficiency. Analyzing massive amounts of biological data in large clusters and cloud platforms poses two problems. First, transferring the data from where it is generated (hospitals, clinics, or even small villages in the case of virus tracking) to these computer centers is both time and energy-consuming and requires a stable and fast internet connection. Second, these computer platforms themselves are energy-hungry, as the data moves between the processing unit and the memory on the same computer system, a considerable amount of energy is spent.
The BioPIM project aims to develop algorithms and specialized hardware together to improve the speed and cost of various bioinformatics analyses. The project focuses on two algorithm design techniques: combinatorial algorithms such as alignments, pattern matching, genome assembly, and other uses of graphs, as well as methods based on deep learning, machine learning, and AI such as genomic variation discovery. To achieve energy-efficient, cost-efficient, and ultra-fast bioinformatics analysis, the BioPIM project leverages the emerging processing-in-memory (PIM) architectures that couple processing capability with memory and storage devices, therefore minimizing time and energy spent in data transfer. We will also design our hardware to perform some of these analyses on mobile devices therefore enabling edge computing. BioPIM addresses the inability to perform genome analysis on the go to help in the timely investigation of clinical and research data, including viral and bacterial typing in remote locations with little or no access to conventional large-scale computing platforms.
BioPIM’s proposed research is flexible as it aims to develop PIM acceleration for various algorithms. Although the methods the project focuses on will be within the bioinformatics domain, most of these algorithms originated decades ago, and they are also being used for non-bioinformatics applications such as:
● String search and pattern matching (e.g. in natural language processing, data mining)
● Graph theory (e.g. data analytics, web indexing)
● General machine learning
● Specifically, neuromorphic computing (e.g. many applications of deep learning and artificial intelligence)
Additionally, most of our PIM developments will benefit data centers in terms of performance gain and energy efficiency; therefore, the project’s impact is expected to be far beyond our significant aims.
In the first year of the project, we evaluated the performance and behavior of several tools and data structures commonly used in bioinformatics. Our aim for this analysis was to understand the computational requirements of these tools and how to improve them using PIM architectures. We then determined several algorithms implemented in these tools to be the targets for our new hardware/algorithm co-design. In the remainder of the project, we will optimize these algorithms for PIM architectures.
M13-M30:
In the second period, we finished the performance analysis of the commonly used bioinformatics tools and classified them as targets for either WP3 or WP4. We also identified several tools that will likely not benefit from PIM acceleration and removed from further consideration. We developed PIM acceleration for several applications using either processing-near-memory (WP3) or processing-using-memory (WP4). We also developed simulation platforms (WP5) and an application programming interface library (WP6).
We have characterized the memory utilization and boundedness, and computational requirements of several algorithms and tools commonly used in bioinformatics. This characterization will guide us in WP3 and WP4 to better design novel algorithms and hardware architecture.
M13-M30:
We have developed several PIM-accelerated tools for bioinformatics applications. We have also improved the UPMEM DPU design to suit better the needs for genomics analysis tools and machine learning workloads. We provided simulation tools for the new design. Additionally, we provide the BPL (BioPIM Library) as an easy-to-program API for PIM devices.