Skip to main content
European Commission logo print header

Domain-optimised parallelisation by polymorphic language embeddings and rewritings

Final Report Summary - DOPPLER (Domain-optimised parallelisation by polymorphic language embeddings and rewritings)

How can we build high performance software with low effort when CPU clock speeds are stagnating and multi-core CPUs, SIMD units and GPUs are commonplace from data centers all the way down to mobile phones? Project DOPPLER set out to solve this challenge through cutting-edge PL technology (*polymorphic embedding*) that enables parallelization and optimization of embedded domain specific languages (DSLs).

Towards this goal we have made remarkable progress. Through DSLs and infrastructure built together with our collaborators at Stanford and other places, and evidenced by publications in top venues (journals: CACM, HOSC, TECS, IEEE Micro; conferences: POPL, PLDI, ICML, ECOOP 3x, PACT, PPoPP, GPCE 3x, DSL, Onward!, Euro-Par), we have validated the two primary claims:

1. embedded compiled DSLs achieve performance on par with hand-optimized code: orders of magnitude faster than ordinary libraries, but with the same high-level programming style

2. performance oriented DSLs do not need to be built from scratch: even though DSLs target a particular domain, there are many commonalities that can be leveraged through extensible compiler frameworks.

We have developed Lightweight Modular Staging (LMS), an extensible runtime compilation framework that serves as a common basis for all our DSLs. LMS considerably extends the polymorphic embedding idea as well as previous multi-stage programming approaches and was selected as a "research highlight" by CACM.

The Delite framework, developed jointly with the Stanford Pervasive Parallelism Lab (PPL), provides common parallel patterns and heterogeneous code generation on top of LMS.

What makes LMS particularly attractive for DSL development is deep linguistic reuse: in many cases DSL programs can reuse functionality of the host language (functions, objects, classes) without explicit support for these features on the DSL level. The program that the DSL compiler sees is much simpler than the original, multi-stage, program and easier to optimize. Hence the slogan "abstraction without regret".

Using LMS and Delite, our collaborators and we have developed DSLs for various application domains: OptiML for machine learning, OptiQL for query processing, OptiMesh for mesh-based PDE solvers, OptiGraph for graph processing, StagedSAC for multidimensional array programming, Jet for distributed batch processing, Odds for probabilistic programming, Spiral/S for generating numeric kernels, staged SQL, parser combinators and shortcut fusion for efficient data processing. These DSLs compile to different parallel architectures, matching the performance of hand-optimized code and outperforming pure-library baseline implementations by up to several orders of magnitude. We have recently extended these facilities to hardware generation.

To achieve these results, several key innovations were necessary.
The most important ones are:

1. Embedded DSLs need to redefine if-then-else statements and other control structures. To this end, we have introduced the concept of language virtualization and added corresponding facilities to the Scala language.
2. A DSL embedding should maintain the order of DSL statements as they appear in the program. This is not the case in embedding approaches based on quasi-quotation, but handled through a graph based intermediate representation in LMS.
3. Embracing functional programming in DSLs provides for easy parallelization but a naive implementation would be prohibitively inefficient due to many intermediate objects being created. We designed a uniform representation for collection-like DSL operations that lends itself to aggressive fusion, which eliminates most such intermediate results.
4. Efficient parallelization demands precise reasoning about effects to ensure flexibility in scheduling. We have developed models that track effects and aliasing information in a fine-grained way, making trade-offs that appear useful in practice.
5. For many DSL operations there is a trade-off between a symbolic representation and staging. The former enables optimization based on rewriting, whereas the latter removes abstraction overhead. We overcome this trade-off based on the observation that staging a DSL program interpreter yields a program transformer. We use LMS to build sophisticated translation pipelines of multiple successive DSLs. Thus, we can first apply symbolic optimizations and then stage away abstraction overhead, instead of transforming a low-level representation directly.
6. Realistic programs need to use multiple DSLs. We are able to perform optimizations across different DSL blocks by performing DSL-specific optimizations per block and then translating all blocks to a single, lower-level representation.

Two projects seek to improve the experience of DSL designers and users. Forge, developed at Stanford, is a meta-DSL that generates DSL implementations from a declarative specification. Yin-Yang, developed at EPFL, is a novel macro-based front-end for embedded DSLs, which enables systematic conversion from a shallow to a deep embedding. Yin-Yang eliminates any residual linguistic mismatch between the host language and the deep embedding, at the expense of eliminating programming patterns based on a more explicit stage distinction. Both projects facilitate automatic generation of boilerplate backend code, customized error reporting, and better IDE integration.

We have developed a sound design for Dependent Object Types (DOT), a type-theoretic foundation for languages like Scala. We derive DOT from System Fsub, in small steps:

1. translucency -- we add a lower bound to each type variable, in addition to its usual upper bound,
2. in System D -- we turn each type variable into a regular term variable containing a type,
3. for a full subtyping lattice, we add intersection and union types,
4. for objects, we consolidate all values into records,
5. for objects that close over a self, we introduce a recursive type, binding a self term variable,
6. for recursive types, we first extend the theory in typing and then also in subtyping.

Through this bottom-up exploration, we discovered a sound, uniform yet powerful design for DOT, and have anchored the foundations of Scala into known territory.

The project also resulted in the development of thwo high performance data - parallel collection frameworks, namely the Scala parallel collections library and Scala Blitz. The following innovations proved key to this development:

1. we developed new scheduling techniques for performing data-parallel operations, providing significant speedups for non-uniform workloads and removing the need for manual tuning of a scheduler;

2. we introduce work-stealing iterators that allow fine-grained and efficient work-stealing for common data-structures;

3. we overcome abstraction penalties through callsite specialization of data - parallel operation instances.