Langlet Style DSL Design and Implementation
EuroDSL will provide a convenient and easy to learn front end for users who wish to write code that involves some area of expertise or some standard pattern common in programming (for example GUIs or data visualization).
Take for example the area of fluid dynamics. A fluid dynamics Domain Specific Language (DSL) would encapsulate the hardware specific parts related to moving data within the machine, scheduling work onto compute nodes, and so forth, while the user's contribution would be details of the computations, such as the physics model applied at each point in the finite-element graph. The trick here is the decomposition, into aspects that are hardware specific, versus end-user specific. Hardware aspects are captured as a "framework", while end-user aspects plug in to this framework and represent the content of a small bit of work. (As an example, see the HWSim langlet, which provides a framework for event-driven simulations -- it was created for simulating computation architectures, but is general to any event driven simulation: HWSim)
The Langlet DSL presents the framework in a simple to understand way, and provides the opportunity to "plug in" behavior that is the heart of the computation (the meaning of what gets computed). The DSL provides the plumbing, the user plugs in the meaning. Behind the DSL lie a number of implementations of the framework, one for each hardware class. These are done via a set of internal interfaces defined by the EuroDSL project. For each interface, the DSL provides an implementation. These internal interfaces expose the aspects most critical to scheduling work, which are:
- the ability to on-the-fly modify/choose the size of a chunk of work, which is then assigned to a given compute resource (and have the resource be able to further sub-divide that chunk)
- to predict (or state the inherent non-predictability of) the time to complete a given chunk of work by that resource (performing this prediction is part of the EuroDSL research)
- to predict the time required to move the chunk of work to the resource, as well as the rate of usage of communication resources by the chunk of work.
We don't expect this approach to win Gordon Bell Awards! It's goal is not to achieve the absolute highest possible computation. Rather, it is to make parallel hardware easily accessible to those who would otherwise be locked out, due to the conceptual hurdle, or due to the practical costs of tuning software to a variety of hardware targets, or due to the cost of parallel development in relation to sequential code development. Our target users are those who do not have the technical ability or resources to perform heroic coding efforts. Most do not have the ability to even write the simplest of parallel code. The benefit is enabling such users to harness a variety of parallel hardware targets, while expending even less effort than they would if they wrote sequential code.
- To gain adoption, we are targeting an ultra easy to learn form that we call "langlets", where each langlet has only a few constructs, which conceptually match the patterns present in the domain of the user (and multiple of these langlets are mixed in to a sequential code base, fitting the normal sequential development process and, during a run, coherently sharing the computation resources).
In the case of the fluid dynamics example, the langlet would present a finite element grid as one of its core concepts, and provide a means for the user to plug in the computation that takes place at each element. This langlet would include a pattern for one element to communicate with other elements in the graph. Internally, the langlet implementation is written in terms of infrastructure created as part of EuroDSL. The infrastructure is patterned around those key elements mentioned above -- on-the-fly division of work, prediction of execution time, and prediction of communication needs. The langlet code base provides implementations of those interfaces. The EuroDSL infrastructure then includes scheduling approaches that use the provided predictors to search for the best choice of work sizes and assignment of work to hardware resources. A given langlet, will have a number of implementations created, one for each class of target hardware. Each implementation exports the interface needed by the EuroDSL infrastructure, and internally takes advantage of tricks available in the hardware. Finally, EuroDSL provides a number of toolchains, one for each class of target hardware. A given toolchain takes the user supplied code, combines it with the langlet implementation for that class of hardware, and compiles it into an executable form suitable for the target hardware. For some types of hardware, such as GPUs, the difficult part lies in the on-the-fly choice of size of work. Progress has been made on this (see the paper on DKU combined with polyhedral techniques, and the polyhedral JIT work at Passau University and elsewhere). We don't expect to solve all the hard problems within the EuroDSL project, but we do feel confident that we will succeed at providing a number of langlets that are easy for end-users who have no parallel programming experience, or do not have the resources for standard parallel code development. And we feel confident that we will succeed at creating an infrastructure that makes a single end-user code base capable of automatically taking advantage of several different kinds of hardware, with on-the-fly choice of which hardware is employed. We feel confident that we can deliver this for certain types of langlets that fit patterns that past research has made strides with. Going beyond this to more difficult cases will be part of the work performed within the EuroDSL project. We plan to include a variety of hardware targets, including Cloud based racks of servers, some with GPU accelerators, some with PHI based accelerators, as well as support desktop workstations with accelerators. We are considering also targeting HPC machines, via the FutureGrid project, which we have an account on, or one of the EU based research platforms. And finally, we are considering including hardware such as the Parallela card and the Kalray chips. The goal being to tie all project platforms together, with a scheduler that can on-the-fly choose a break up of the work and assign pieces to each of the platforms (which will only be reasonable for certain kinds of problem, and of large enough size).