The Case for embedded-style Domain Specific Languages (eDSL)
Approaches to parallel programming include auto-parallelising, full languages for parallel programming, library-based embedded languages like threading libraries, and the newly emerging embedded-style Domain Languages.
On auto-parallelising, programmers want it to be true so badly that it keeps getting attempted.. then dropped as the realization sets in that the programmer has information in their head that wasn't captured by the language. This includes organizational information about the work, how to divide it up, and how to communicate between the pieces. Attempting to re-create that missing information has so far proven insurmountable.. there might yet be hope with dependency detection coupled to machine learning techniques. See http://groups.inf.ed.ac.uk/pasta/pub_PACT10.html
The underlying issue, though, is performance. Parallel hardware has higher non-linearity than sequential, so small changes in code can cause dramatically large changes in performance.. to the tune of several orders of magnitude. This degree of sensitivity didn't exist in sequential hardware, so the programming community hasn't much experience with it yet. But it's at the heart of the difference between sequential and parallel.
This sensitivity forces rearranging the code, so that patterns in the code match to patterns in the hardware. So far this has been done by hand-tuning applications.
To facilitate this hand tuning, many languages have exposed hardware in some form.. whether it is "place" in UPC, Charm, X10, etc, or location in MPI, or simply exposing "virtual processors", which are threads, in pthreads, OpenMP, TBB, and so on. Myself, I would put all these languages into the same category.. they all place into the application some degree of explicit management over hardware..
A second category would be languages that fully remove hardware implications from the application. The only successful ones I know of in this category only apply to a narrow range of patterns within applications.. for example, this was the approach of high performance Fortran, which only handles matrix based patterns.. but it can fully remove hardware details from the application, for those.. CnC seems to be in this category also, but it hasn't really caught on, I would say because of the abstract mental model that feels unnatural. Also, functional languages fall in this category, but their semantics cause them to have high overhead, and they also don't capture needed parallel-performance information.
eDSLs as a possible solution
So, how to solve this?
One promising approach appears to be embedded style domain languages (eDSL), such as being done at Stanford http://ppl.stanford.edu and the work done here, at Open Source Research Institute.
eDSLs identify common patterns within a domain, and encode those as language constructs. They create custom syntax for those constructs, which embeds into a base sequential language, such as C or Java. So, code is mainly sequential, with the custom constructs intermixed. A pre-processor then translates the custom syntax into C, or Java, or whichever the base language is. It also inserts calls to a runtime system that implements the behaviors.
In this way, the programmer thinks in terms of the domain patterns, so it is natural for them.. at the same time, those patterns imply the parallel-performance related information needed by the toolchain and runtime system. These rearrange the domain pattern to match to the hardware patterns. But this rearrangement is inside the language implementation, so hidden from application programmers.
In the end, the programmer has a familiar sequential language environment. The custom syntax they learn is minimal, and natural to learn. The tools have all the information captured that they need in order to make the code efficient on a wide array of hardware. So, it's easy to learn, highly productive to use, and independent of hardware (with caveats).
Sounds ideal.. so what's the catch? First, this approach is just getting started. Second, the languages have to be designed very carefully, considering three widely different aspects of the language, and even parallelism experts normally only have expertise on one aspect. The language has to balance being free from hardware implications, against having convenient syntax that is easy to learn and productive to use, and those against capturing all the information needed to give high performance across all hardware. The creators usually are only good at one of these three.
The other caveat is that some forms of hardware have patterns so different from others that completely different algorithms must be used. Simple rearrangement of patterns isn't enough. That would be the case for GPU hardware versus CPUs.. despite claims for OpenCL, the code may function on both kinds, but the performance is a different story.
I see a bright future ahead for this approach.. especially when put together with infrastructure that supports making such languages quickly, and supports distributing the different executable versions needed by different machines.
With such distribution, the code can be put in one place, and then the tools run which rearrange its patterns to all the different hardware. It automatically creates a different executable for each category of hardware. Then to install an app, an end-user connects, uploads their machine details, and receives back the executable tuned to their machine.