CloudDSM.WorkPackages History

Show minor edits - Show changes to output - Cancel

March 25, 2014, at 11:49 AM by 80.114.134.224 -
March 25, 2014, at 11:49 AM by 80.114.134.224 -
Changed lines 37-42 from:
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --
to:
Changed lines 46-50 from:
* Milestone M3.: Month 24 --
* Deliverable D3.: Month 24 --
* Milestone M3.: Month 36 --
* Deliverable D3.: Month 36 --
to:
Changed lines 50-54 from:
* Milestone M3.: Month 24 --
* Deliverable D3.: Month 24 --
* Milestone M3.: Month 36 --
* Deliverable D3.: Month 36 --
to:
Changed lines 54-58 from:
* Milestone M3.: Month 24 --
* Deliverable D3.: Month 24 --
* Milestone M3.: Month 36 --
* Deliverable D3.: Month 36 --
to:
Changed lines 58-63 from:
* Milestone M3.: Month 24 --
* Deliverable D3.: Month 24 --
* Milestone M3.: Month 36 --
* Deliverable D3.: Month 36 --
to:
Changed lines 66-70 from:
* Milestone M5.: Month 24 --
* Deliverable D5.: Month 24 --
* Milestone M5.: Month 36 --
* Deliverable D5.: Month 36 --
to:
Changed lines 70-74 from:
* Milestone M5.: Month 24 --
* Deliverable D5.: Month 24 --
* Milestone M5.: Month 36 --
* Deliverable D5.: Month 36 --
to:
Changed lines 74-78 from:
* Milestone M5.: Month 24 --
* Deliverable D5.: Month 24 --
* Milestone M5.: Month 36 --
* Deliverable D5.: Month 36 --
to:
Changed lines 156-157 from:
* Deliverable D6.: Month 36 -- final set of tools needed by DC's main customers in order to develop CloudDSM applications
to:
* Deliverable D6.: Month 36 -- final set of tools needed by DC's main customers in order to develop CloudDSM applications
March 25, 2014, at 11:48 AM by 80.114.134.224 -
Changed lines 110-114 from:
* Milestone M5.: Month 24 --
* Deliverable D5.: Month 24 --
* Milestone M5.: Month 36 --
* Deliverable D5.: Month 36 --
to:
Changed lines 114-118 from:
* Milestone M5.: Month 24 --
* Deliverable D5.: Month 24 --
* Milestone M5.: Month 36 --
* Deliverable D5.: Month 36 --
to:
Changed lines 118-123 from:
* Milestone M5.: Month 24 --
* Deliverable D5.: Month 24 --
* Milestone M5.: Month 36 --
* Deliverable D5.: Month 36 --
to:
Changed lines 126-130 from:
* Milestone M5.: Month 24 --
* Deliverable D5.: Month 24 --
* Milestone M5.: Month 36 --
* Deliverable D5.: Month 36 --
to:
Changed lines 130-134 from:
* Milestone M5.: Month 24 --
* Deliverable D5.: Month 24 --
* Milestone M5.: Month 36 --
* Deliverable D5.: Month 36 --
to:
Changed lines 134-138 from:
* Milestone M5.: Month 24 --
* Deliverable D5.: Month 24 --
* Milestone M5.: Month 36 --
* Deliverable D5.: Month 36 --
to:
Changed lines 138-143 from:
* Milestone M5.: Month 24 --
* Deliverable D5.: Month 24 --
* Milestone M5.: Month 36 --
* Deliverable D5.: Month 36 --
to:
Changed lines 159-162 from:
* Milestone M6.: Month 24 --
* Deliverable D6.: Month 24 --
* Milestone M6.: Month 36 --
* Deliverable D6.: Month 36 --
to:
Deleted lines 162-165:
* Milestone M6.: Month 24 --
* Deliverable D6.: Month 24 --
* Milestone M6.: Month 36 --
* Deliverable D6.: Month 36 --
March 25, 2014, at 11:47 AM by 80.114.134.224 -
Changed lines 5-8 from:
* WP 3: Work Division and Scheduling -- ICL
* WP 4: DSM Runtime System -- Sean
* WP 5: Application visible interface Design -- Sean
* WP 6: Toolchain
-- IBM
to:
* WP 3: DSM Runtime System -- Sean
* WP 4: Application visible interface Design -- Sean
* WP 5: Toolchain -- IBM
* WP 6: Work Division and Scheduling
-- ICL
Changed lines 42-52 from:
!!! WP 3: Work Division and Scheduling

* Task1: Develop scheduler (ICL, XLab, Sean)
* Milestone M2.5: Month 12 -- First, simple, version of the full scheduler integrated into portal (replaces "dummy" version inside prototype portal)
* Deliverable D2.5: Month 12 -- code and related artifacts of the working scheduler
* Milestone M2.6: Month 24 -- Subset of advanced scheduler features functioning and integrated into portal.
* Deliverable D2.6: Month 24 -- code and related artifacts of the working second-stage scheduler
* Milestone M2.7: Month 36 -- All advanced scheduler features working within the integrated system.
* Deliverable D2.7: Month 36 -- code and related artifacts of the working final, advanced scheduler
to:
Changed lines 168-169 from:
!!! WP 6: End User Application -- Douglas Connect
to:
!!! WP 6: Work Division and Scheduling -- ICL
Added lines 171-183:
* Task1: Develop scheduler (ICL, XLab, Sean)
* Milestone M2.5: Month 12 -- First, simple, version of the full scheduler integrated into portal (replaces "dummy" version inside prototype portal)
* Deliverable D2.5: Month 12 -- code and related artifacts of the working scheduler
* Milestone M2.6: Month 24 -- Subset of advanced scheduler features functioning and integrated into portal.
* Deliverable D2.6: Month 24 -- code and related artifacts of the working second-stage scheduler
* Milestone M2.7: Month 36 -- All advanced scheduler features working within the integrated system.
* Deliverable D2.7: Month 36 -- code and related artifacts of the working final, advanced scheduler



!!! WP 7: End User Application -- Douglas Connect
[[CloudDSM.WP7 | details of WP7]]
Changed lines 208-209 from:
!!! WP 7: Dissemination and Exploitation -- Gigas
[[CloudDSM.WP7 | details of WP7]]
to:
!!! WP 8: Dissemination and Exploitation -- Gigas
[[CloudDSM.WP8 | details of WP8]]
March 25, 2014, at 11:40 AM by 80.114.134.224 -
Changed lines 5-11 from:
* WP 3: DSM Runtime System -- Sean
* WP 4: Application visible interface Design -- Sean
* WP 5: Toolchain -- INRIA
* WP 6: End User Application
-- Douglas Connect
*
WP 7: Dissemination and Exploitation -- TBD
to:
* WP 3: Work Division and Scheduling -- ICL
* WP 4: DSM Runtime System -- Sean
* WP 5: Application visible interface Design -- Sean
* WP 6: Toolchain
-- IBM
*
WP 7: End User Application -- Douglas Connect
* WP 8
: Dissemination and Exploitation -- TBD
Deleted lines 33-40:
* Task3: Develop scheduler (ICL, XLab, Sean)
* Milestone M2.5: Month 12 -- First, simple, version of the full scheduler integrated into portal (replaces "dummy" version inside prototype portal)
* Deliverable D2.5: Month 12 -- code and related artifacts of the working scheduler
* Milestone M2.6: Month 24 -- Subset of advanced scheduler features functioning and integrated into portal.
* Deliverable D2.6: Month 24 -- code and related artifacts of the working second-stage scheduler
* Milestone M2.7: Month 36 -- All advanced scheduler features working within the integrated system.
* Deliverable D2.7: Month 36 -- code and related artifacts of the working final, advanced scheduler
Added lines 42-50:
!!! WP 3: Work Division and Scheduling

* Task1: Develop scheduler (ICL, XLab, Sean)
* Milestone M2.5: Month 12 -- First, simple, version of the full scheduler integrated into portal (replaces "dummy" version inside prototype portal)
* Deliverable D2.5: Month 12 -- code and related artifacts of the working scheduler
* Milestone M2.6: Month 24 -- Subset of advanced scheduler features functioning and integrated into portal.
* Deliverable D2.6: Month 24 -- code and related artifacts of the working second-stage scheduler
* Milestone M2.7: Month 36 -- All advanced scheduler features working within the integrated system.
* Deliverable D2.7: Month 36 -- code and related artifacts of the working final, advanced scheduler
March 24, 2014, at 07:13 AM by 92.76.168.74 -
Changed lines 79-87 from:
* Task4: binary runtime specializer
* Milestone M3.: Month 12 --
* Deliverable D3.: Month 12 --
* Milestone M3.: Month 24 --
* Deliverable D3.: Month 24 --
* Milestone M3.: Month 36 --
* Deliverable D3.: Month 36 --

* Task7
: Integration testing (Gigas, DC, INRIA, XLAB)
to:
* Task4: Integration testing (Gigas, DC, INRIA, XLAB)
March 24, 2014, at 05:32 AM by 80.99.199.138 -
Added lines 205-210:

* Task4: Develop a software solution for de-novo genome assembly using De-Brujin graphs, which utilizes the CloudDSM system. (LarkBio)

* Task5: Validate the solution on a dataset which can be solved using currently available tools. (LarkBio)

* Task6: Validate the solution on large dataset, which is not feasible without CloudDSM. (LarkBio)
March 23, 2014, at 05:36 AM by 88.71.191.47 -
Added line 13:
[[CloudDSM.WP1 | details of WP1]]
Changed line 220 from:
* Deliverable D6.: Month 36 -- final set of tools needed by DC's main customers in order to develop CloudDSM applications
to:
* Deliverable D6.: Month 36 -- final set of tools needed by DC's main customers in order to develop CloudDSM applications
March 20, 2014, at 08:48 AM by 192.16.201.181 -
Changed line 147 from:
!!! WP 5: Compiler Toolchain -- INRIA
to:
!!! WP 5: Compiler Toolchain -- IBM
March 19, 2014, at 06:04 AM by 80.114.135.137 -
Changed line 18 from:
!!! WP 2: CloudDSM Portal -- XLAB
to:
!!! WP 2: Web Portal for CloudDSM -- XLAB or ICL
March 19, 2014, at 06:01 AM by 80.114.135.137 -
Deleted lines 1-2:

List of workpackages, and leader for each WP
March 19, 2014, at 06:01 AM by 80.114.135.137 -
Added lines 3-4:
List of workpackages, and leader for each WP
Changed lines 21-51 from:
[[CloudDSM.WP2 | link to details of WP2]]

!!! WP 3: DSM Runtime System -- Sean leads WP (at INRIA, or Bell Labs or CWI)

* Task1: Architecture of hierarchical DSM runtime system. Delivered runtime will be a federation. On each virtual machine, or in some cases physical machine, an individual runtime system will be in operation. Each of these individual runtime systems will interact with the others to form a collective, cooperative, overall runtime system, which presents a single address space abstraction. This task will define the architecture, interfaces, and protocols for this overall runtime system. Then for each class of hardware an individual task will implement the interfaces and protocols for that particular kind of hardware, as described below.
-- Sean leads task, partners for 2 through 6, plus Imperial, plus INRIA, plus XLAB participate on aspects that involve their individual piece. INRIA will advise on how the architecture choices impact the toolchain design and implementation. Imperial will advise on how the choices impact the algorithms for runtime work division and deployment. XLAB will advise on how the architecture impacts the deployment tool. Partners for 2 through 6 will provide input on how the architecture impacts the individual runtime system they are implementing.

* Task2: DSM runtime system on Kalray HW that presents an API that is manipulatable by WP2 Cloud deployment tool -- Kalray (with involvement by Sean and XLAB)

* Task3: DSM runtime system on FORTH Cube HW that presents an API that is manipulatable by WP2 Cloud deployment tool -- FORTH (with involvement by Sean and XLAB)

* Task4: DSM runtime system on a shared memory x86 based Cloud server instance -- Sean with involvement by XLAB

* Task5: DSM runtime system on a shared memory Power based Cloud server instance -- IBM plus Sean with involvement by XLAB

* Task6: binary runtime specializer -- IBM (Q: any chance to get open source release? Or to apply to other ISA?). Use fat binary runtime optimization techniques to adjust the granularity and layout during execution. This is specific to IBM Power architecture, but it provides the blue print for implementing for other ISAs as well.

* Task7: Integration testing. Deploy the various individual runtime systems on specific machines. Write test cases whose purpose is to expose performance issues and bugs. Execute the test cases, record the performance, report to the partners who developed the individual runtimes in tasks 2 through 6. Work with the partners to determine the reasons for particular performance anomalies, perhaps writing and running specific tests interactively. The partners for 2 through 6 will use the results to modify their individual runtime system implementations. -- Gigas leads -- partners for 2 through 6, plus XLAB are involved.

* Question: "Is the runtime going to be virtualized itself? Why not use standard bytecodes?" A: There won't be any bytecodes. CloudDSM is focused on high performance, and so the DSM runtime for many of the machines will be based on proto-runtime, due to its high performance and productivity benefits. Proto-runtime creates the equivalent of user-level "threads", which are called virtual processors because they capture a CPU context, and a CPU context virtualizes the bare processor pipeline. One of these VPs consists of its own stack, plus a data structure that holds a saved stack pointer, frame pointer, and program-counter contents. That set of state equals a point within the CPU's computation, and it can be used to switch the CPU around among different computation timelines. Each of those timelines behaves like its own, virtual, CPU processor pipeline.. hence, the name "virtual processor". But it's the lowest level, most bare bones "virtualization" possible.. there is no bytecode, no translation between ISAs, nothing but multiple CPU contexts.


[[CloudDSM.RuntimeSpecializationWP]]

!!! WP 4: Application visible interface design
* Task 1: Define more precisely the class of applications that CloudDSM targets
* Task 2: Define the needs of the toolchain, what degrees of freedom it needs in order to accomplish the desired transforms of the source code.
* Task 3: Define the needs of the runtime system, what characteristics it needs in the code generated by the toolchain in order to deliver high performance.
* Task 4: Define the needs of the application developer, what mental models and what syntax, and what debugging and code-checking support they desire.
* Task 5: Integrate the results of tasks 1 through 4 into a specification of the interfaces used by the application developer. There will be two levels of code annotation.. one that high level application developers see and use, of which there will be many variations. For example, Reo will have a different high level user interface than the pragma system for OpenMP. The second, lower level will be common to all versions of the higher level interface. This will be used directly by the toolchain to perform code transforms. Each of the higher level forms will be translated into the same, common, lower level form. This task only considers the top level forms of the code. WP 5 separately defines the common lower level form.
to:
[[CloudDSM.WP2 | details of WP2]]
* Task1: Develop portal (XLab, ICL, DC, Gigas, Sean)
* Milestone M2.1: Month 6 -- working, simple, prototype of portal, with simple implementations of each subsystem
* Deliverable D2.1: Month 6 -- code and related artifacts of a working simple portal prototype (ready for use by DC and INRIA)
* Milestone M2.2: Month 24 -- Upgrades to portal that support changes made in the Scheduler and Specialization harness
* Deliverable D2.2: Month 24 -- Technical Report, code, and other artifacts related to the implementation of the portal and supporting subsystems.
* Milestone M2.3: Month 36 -- Upgrades to portal that support changes made in the Scheduler and Specialization harness
* Deliverable D2.3: Month 36 -- Technical Report, code, and other artifacts related to the implementation of the portal and supporting subsystems.
Changed lines 30-46 from:
Tasks 1 through 5 will be performed iteratively, with multiple revisions during the first six months of the project. Each of the tasks will have an impact on the other tasks, and it will require a large amount of communication, via iterations, in order to find a suitable common ground interface that supports all the aspects well.

* Task 6: Development tools to support the writing of application code. This includes code checkers that enforce restrictions on what coding practices are allowed inside the portions of the application that employ the CloudDSM system. It also includes debugging aids, to detect bugs and to narrow down the portion of the application code causing discrepancies from specified behavior.

-- Sean and INRIA and Douglas Connect will lead, with input from XLAB, Imperial, and partners for WP3 tasks 2 through 6.


Question: "Why use OpenMP?" A: Many existing parallel libraries are written in OpenMP. One early project goal was to hit the ground running with code that is relevant to industry and readily usable by industry. Using OpenMP libraries during development of the low-level interface allows compiler work to begin almost immediately on relevant, ready-to-run code.

Comment and Question: "It may provide benefit desired by end-users if location-aware high-level annotations were also provided, such as regions, effects, etc. With these, the programmers will be able to communicate placement information. We found that for Myrmics, the programmer can know more about placement than the runtime/compiler can infer, and once placement/allocation is done properly, the shared-mem abstraction still helps with coding, except that the manual placement provides superior locality and performance. The programmer knows about and can say things about locality (like the X10 places, or regions in Fortress)." Question: "does it make sense to expose location within a Cloud-system that automatically and dynamically changes the number, type, and location of machines assigned to the computation?" Answer: you have nailed the heart of what this workpackage is all about. This will be an on-going discussion during the first six to nine months of the project. Indeed, one desire is to capture the understanding that the programmer has in their head and uses during the process of specifying pieces of work and choosing, themselves, where to place each piece. The goal of the WP is to discover an encoding of the process that the programmer does in their head, so that they encode that mental process, in a parameterized way. The automation then chooses the parameter values, and plugs them into what the programmer provided, which delivers pieces of work and the affinities among them. The WP content is the work of discovering programmer abstractions that get us as close to there as possible, in a way that we know how to implement..


Question: "any hints on what is the common low-level form of the source that is produced by the development toolchain? is it annotated source code?" Answer: Figuring this out is the content of WP4. Albert would like a source code annotation form for project logistics reasons. In that case by-hand modification of existing OpenMP libraries can begin at once, and act as a test bed for rapid iterations of what is the best low-level form. At the same time, compiler techniques can be tried, and also at the same time high level end-user annotations can be tested for how the "feel" to the application programmers. The DSM-specializer can be worked on in tight iterations with figuring out what the best low-level representation should be.. any desired changes in representation are just done quickly by hand -- no need to fiddle with IR, and reasonably decoupled from high level annotation form..

Question: "How does the IBM fat-binary specializer interact with the DSM runtime system?" Answer: AFAIU, the re-optimizer interrupts work in progress, changes the executable, then resumes work in progress. But it doesn't controls what work is assigned to what core. The optimizations it performs are single-thread optimizations, and also the re-optimizer may be told by the DSM runtime or by the portal to adjust the code such that the size of chunk of work performed between DSM calls is adjusted, or the layout or access pattern of data is adjusted. It is not clear yet whether the re-optimizer tool will make decisions on its own about the best chunk size. It might communicate performance feedback to the DSM runtime, and optimal chunk size decisions are made there. Or those decisions may be passed along to the portal. Wherever they are ultimately made, it will be up to the Dorit tool to modify the code such that it actually performs work in the chosen size of chunk. It still remains to work through the details of how the DSM runtime will interact with the Dorit tool.
to:
* Task2: Define Scheduling component within portal (ICL, XLab, Sean)
* Milestone M2.4: Month 6 -- detailed description of the final full featured scheduler
* Deliverable D2.4: Month 6 -- technical report that describes the interfaces to the scheduler and what functions it performs

* Task3: Develop scheduler (ICL, XLab, Sean)
* Milestone M2.5: Month 12 -- First, simple, version of the full scheduler integrated into portal (replaces "dummy" version inside prototype portal)
* Deliverable D2.5: Month 12 -- code and related artifacts of the working scheduler
* Milestone M2.6: Month 24 -- Subset of advanced scheduler features functioning and integrated into portal.
* Deliverable D2.6: Month 24 -- code and related artifacts of the working second-stage scheduler
* Milestone M2.7: Month 36 -- All advanced scheduler features working within the integrated system.
* Deliverable D2.7: Month 36 -- code and related artifacts of the working final, advanced scheduler

* Task4: Testing and performance tuning of full CloudDSM system (IBM, DC, INRIA, CWI, Gigas, ICL, XLab)
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --



!!! WP 3: DSM Runtime System (Sean, INRIA, CWI, ICL, XLab)
[[CloudDSM.WP3 | details of WP3]]


* Task1: Architecture of hierarchical DSM runtime system.
* Milestone M3.: Month 12 --
* Deliverable D3.: Month 12 --
* Milestone M3.: Month 24 --
* Deliverable D3.: Month 24 --
* Milestone M3.: Month 36 --
* Deliverable D3.: Month 36 --

* Task2: DSM runtime system on a shared memory x86 based Cloud server instance -- Sean with involvement by XLAB
* Milestone M3.: Month 12 --
* Deliverable D3.: Month 12 --
* Milestone M3.: Month 24 --
* Deliverable D3.: Month 24 --
* Milestone M3.: Month 36 --
* Deliverable D3.: Month 36 --

* Task3: DSM runtime system on a shared memory Power based Cloud server instance -- IBM plus Sean with involvement by XLAB
* Milestone M3.: Month 12 --
* Deliverable D3.: Month 12 --
* Milestone M3.: Month 24 --
* Deliverable D3.: Month 24 --
* Milestone M3.: Month 36 --
* Deliverable D3.: Month 36 --

* Task4: binary runtime specializer
* Milestone M3.: Month 12 --
* Deliverable D3.: Month 12 --
* Milestone M3.: Month 24 --
* Deliverable D3.: Month 24 --
* Milestone M3.: Month 36 --
* Deliverable D3.: Month 36 --

* Task7: Integration testing (Gigas, DC, INRIA, XLAB)
* Milestone M3.: Month 12 --
* Deliverable D3.: Month 12 -- compilation report of which integrations passed and which failed, and actions taken to fix the fails
* Milestone M3.: Month 24 --
* Deliverable D3.: Month 24 --
* Milestone M3.: Month 36 --
* Deliverable D3.: Month 36 --


!!! WP 4: Application visible interface design
[[CloudDSM.WP4 | details of WP4]]

* Task 1: Define more precisely the class of applications that CloudDSM targets
* Milestone M5.: Month 12 --
* Deliverable D5.: Month 12 --
* Milestone M5.: Month 24 --
* Deliverable D5.: Month 24 --
* Milestone M5.: Month 36 --
* Deliverable D5.: Month 36 --

* Task 2: Define the needs of the toolchain, what degrees of freedom it needs in order to accomplish the desired transforms of the source code.
* Milestone M5.: Month 12 --
* Deliverable D5.: Month 12 --
* Milestone M5.: Month 24 --
* Deliverable D5.: Month 24 --
* Milestone M5.: Month 36 --
* Deliverable D5.: Month 36 --

* Task 3: Define the needs of the runtime system, what characteristics it needs in the code generated by the toolchain in order to deliver high performance.
* Milestone M5.: Month 12 --
* Deliverable D5.: Month 12 --
* Milestone M5.: Month 24 --
* Deliverable D5.: Month 24 --
* Milestone M5.: Month 36 --
* Deliverable D5.: Month 36 --

* Task 4: Define the needs of the application developer, what mental models and what syntax, and what debugging and code-checking support they desire.
* Milestone M5.: Month 12 --
* Deliverable D5.: Month 12 --
* Milestone M5.: Month 24 --
* Deliverable D5.: Month 24 --
* Milestone M5.: Month 36 --
* Deliverable D5.: Month 36 --

* Task 5: Integrate the results of tasks 1 through 4 into a specification of the interfaces used by the application developer.
* Milestone M5.: Month 12 --
* Deliverable D5.: Month 12 --
* Milestone M5.: Month 24 --
* Deliverable D5.: Month 24 --
* Milestone M5.: Month 36 --
* Deliverable D5.: Month 36 --

* Task 6: Development tools to support the writing of application code.
* Milestone M5.: Month 12 --
* Deliverable D5.: Month 12 --
* Milestone M5.: Month 24 --
* Deliverable D5.: Month 24 --
* Milestone M5.: Month 36 --
* Deliverable D5.: Month 36 --
Added lines 150-151:
[[CloudDSM.WP5 | details of WP5]]
Added lines 153-159:
* Milestone M5.: Month 12 --
* Deliverable D5.: Month 12 --
* Milestone M5.: Month 24 --
* Deliverable D5.: Month 24 --
* Milestone M5.: Month 36 --
* Deliverable D5.: Month 36 --
Added lines 161-167:
* Milestone M5.: Month 12 --
* Deliverable D5.: Month 12 --
* Milestone M5.: Month 24 --
* Deliverable D5.: Month 24 --
* Milestone M5.: Month 36 --
* Deliverable D5.: Month 36 --
Changed lines 169-181 from:
* Task 4: Create transform tools that translate from the common lower level form into the final C form of the code. The final C form includes OS calls, DSM runtime system calls, and synchronization calls that are inserted by the tool. The final C code has a form of the application that performs large chunks of work in-between calls to the DSM system. Each target hardware platform will require its own variation of the transform tool, which is tuned to the details of that hardware, especially communication details. The tool may produce a single multi-versioned binary, or it may include a runtime specializer, or it may generate many independent versions of the binary. A large portion of the research will involve determining the best approach.

FORTH will contribute their compiler work that recreates locality and placement information when the programmer doesn't explicitly declare it (this is used to replace remote accesses with DMA-ing of whole pages for performance).


Comment and Question: "For the IBM fat binary specializer, there are three stages (1) development stage: static generic compilation on developer machine which produces a custom IR form plus generic executable (2) static specialization compilation on a server, or during load, which generates a Power executable specialized to a specific HW (3) runtime fat-binary based recompilation on the actual deployed HW" Question: "How does this fit into CloudDSM?" Answer: Stage 1 will remain on the developer machine, stage 2 will take place inside the CloudDSM portal, and stage 3 inside the Cloud server during execution.

Question: "how will stage 2 fit with the DSM specific specializations?" Answer: this is an open question, to be resolved during the WP. We need some pictures, to figure out what tools do what at which point..

Comment: "Stage 1 happens inside the development environment on a desktop machine. The low-level annotated source is then sent to the CloudDSM portal by the developer. This process registers the application and makes it available for the end-user to run. This registration process also causes the low-level annotated source to be given to a specialization 'harness'. That harness invokes a number of specializer modules. One specializer module is provided by IBM. This module re-runs stage 1 and then runs stage 2 several times, once for each potential Power HW configuration that the CloudDSM system could send the fat-binary to (the module may, in fact cause stage 1 and stage 2 to run remotely on Power ISA machines, inside their own Cloud VM). Lastly, after the user starts the application and issues a request for computation, the portal deploys a unit of work to a Cloud VM running on a Power ISA machine. That Cloud VM has the DSM runtime in it, and that is given the unit of work. The unit of work includes a function within the fat binary to perform. The fat binary is dynamically linked to the DSM runtime. During execution, the work suspends and the binary optimizer takes over, modifies the code, then resumes the work. When the work reaches a DSM call, the DSM runtime suspends the execution context. That context will remain suspended while communication of data takes place. The DSM runtime will switch the CPU to a different context, whose communication has completed and is ready to resume."

to:
* Milestone M5.: Month 12 --
* Deliverable D5.: Month 12 --
* Milestone M5.: Month 24 --
* Deliverable D5.: Month 24 --
* Milestone M5.: Month 36 --
* Deliverable D5.: Month 36 --

* Task 4: Create transform tools that translate from the common lower level form into the final C form
of the code.
* Milestone M5.: Month 12 --
* Deliverable D5.: Month 12 --
* Milestone M5.: Month 24 --
* Deliverable D5.: Month 24 --
* Milestone M5.: Month 36 --
* Deliverable D5
.: Month 36 --
Added lines 187-188:
[[CloudDSM.WP6 | details of WP6]]
Added lines 190-195:
* Milestone M6.: Month 12 --
* Deliverable D6.: Month 12 --
* Milestone M6.: Month 24 --
* Deliverable D6.: Month 24 --
* Milestone M6.: Month 36 --
* Deliverable D6.: Month 36 --
Added lines 197-203:
* Milestone M6.: Month 12 --
* Deliverable D6.: Month 12 --
* Milestone M6.: Month 24 --
* Deliverable D6.: Month 24 --
* Milestone M6.: Month 36 --
* Deliverable D6.: Month 36 --
Changed lines 205-210 from:
!!! WP 7: Dissemination and Exploitation
* Gigas will make the results of the project available to its customers
* Douglas Connect will use the results of the project within its Drug Discovery product, and make the project technology available to the startups that it incubates.

Question: "Are the 2 Power servers we have in our lab sufficient for the CloudDSM project?" Answer: probably, anyone have thoughts?
to:
* Milestone M7.: Month

!!! WP 7: Dissemination and Exploitation -- Gigas
[[CloudDSM.WP7 | details of WP7]]

* Task1: Make working system available to end customers (Gigas, DC)
* Milestone M7.: Month 3 -- Gigas makes rapid prototype system available to small set of alpha customers
* Milestone M7.: Month 6 -- Gigas makes final prototype system available to small set of beta customers
* Milestone M7.: Month 12 -- DC deploys its first prototype application on the prototype system hosted by Gigas
* Milestone M7.: Month 24 -- DC offers a fully developed application based on CloudDSM to its customers
* Milestone M7.: Month 36 -- DC offers the CloudDSM system for use by its customers and incubated companies to develop their own applications

* Task2: Create Development tools to enhance exploitation by DC customers and incubated companies
* Milestone M6.: Month 24 -- Completion of minimal set of development tools
* Deliverable D6.: Month 24 -- minimum set of tools needed by DC's alpha customers in order to develop CloudDSM applications
* Milestone M6.: Month 36 -- Completion of final set of development tools
* Deliverable D6.: Month 36 -- final set of tools needed by DC's main customers in order to develop CloudDSM applications
March 19, 2014, at 05:54 AM by 80.114.135.137 -
Deleted lines 2-3:
List of workpackages, and leader for each WP
Changed lines 19-26 from:
[[CloudDSM.WP2 | details of WP2]]
* Task1: Develop portal (XLab, ICL, DC, Gigas, Sean)
* Milestone M2.1: Month 6 -- working, simple, prototype of portal, with simple implementations of each subsystem
* Deliverable D2.1: Month 6 -- code and related artifacts of a working simple portal prototype (ready for use by DC and INRIA)
* Milestone M2.2: Month 24 -- Upgrades to portal that support changes made in the Scheduler and Specialization harness
* Deliverable D2.2: Month 24 -- Technical Report, code, and other artifacts related to the implementation of the portal and supporting subsystems.
* Milestone M2.3: Month 36 -- Upgrades to portal that support changes made in the Scheduler and Specialization harness
* Deliverable D2.3: Month 36 -- Technical Report, code, and other artifacts related to the implementation of the portal and supporting subsystems.
to:
[[CloudDSM.WP2 | link to details of WP2]]

!!! WP 3: DSM Runtime System -- Sean leads WP (at INRIA, or Bell Labs or CWI)

* Task1: Architecture of hierarchical DSM runtime system. Delivered runtime will be a federation. On each virtual machine, or in some cases physical machine, an individual runtime system will be in operation. Each of these individual runtime systems will interact with the others to form a collective, cooperative, overall runtime system, which presents a single address space abstraction. This task will define the architecture, interfaces, and protocols for this overall runtime system. Then for each class of hardware an individual task will implement the interfaces and protocols for that particular kind of hardware, as described below.
-- Sean leads task, partners for 2 through 6, plus Imperial, plus INRIA, plus XLAB participate on aspects that involve their individual piece. INRIA will advise on how the architecture choices impact the toolchain design and implementation. Imperial will advise on how the choices impact the algorithms for runtime work division and deployment. XLAB will advise on how the architecture impacts the deployment tool. Partners for 2 through 6 will provide input on how the architecture impacts the individual runtime system they are implementing.

* Task2: DSM runtime system on Kalray HW that presents an API that is manipulatable by WP2 Cloud deployment tool -- Kalray (with involvement by Sean and XLAB)

* Task3: DSM runtime system on FORTH Cube HW that presents an API that is manipulatable by WP2 Cloud deployment tool -- FORTH (with involvement by Sean and XLAB)

* Task4: DSM runtime system on a shared memory x86 based Cloud server instance -- Sean with involvement by XLAB

* Task5: DSM runtime system on a shared memory Power based Cloud server instance -- IBM plus Sean with involvement by XLAB

* Task6: binary runtime specializer -- IBM (Q: any chance to get open source release? Or to apply to other ISA?). Use fat binary runtime optimization techniques to adjust the granularity and layout during execution. This is specific to IBM Power architecture, but it provides the blue print for implementing for other ISAs as well.

* Task7: Integration testing. Deploy the various individual runtime systems on specific machines. Write test cases whose purpose is to expose performance issues and bugs. Execute the test cases, record the performance, report to the partners who developed the individual runtimes in tasks 2 through 6. Work with the partners to determine the reasons for particular performance anomalies, perhaps writing and running specific tests interactively. The partners for 2 through 6 will use the results to modify their individual runtime system implementations. -- Gigas leads -- partners for 2 through 6, plus XLAB are involved.

* Question: "Is the runtime going to be virtualized itself? Why not use standard bytecodes?" A: There won't be any bytecodes. CloudDSM is focused on high performance, and so the DSM runtime for many of the machines will be based on proto-runtime, due to its high performance and productivity benefits. Proto-runtime creates the equivalent of user-level "threads", which are called virtual processors because they capture a CPU context, and a CPU context virtualizes the bare processor pipeline. One of these VPs consists of its own stack, plus a data structure that holds a saved stack pointer, frame pointer, and program-counter contents. That set of state equals a point within the CPU's computation, and it can be used to switch the CPU around among different computation timelines. Each of those timelines behaves like its own, virtual, CPU processor pipeline.. hence, the name "virtual processor". But it's the lowest level, most bare bones "virtualization" possible.. there is no bytecode, no translation between ISAs, nothing but multiple CPU contexts.


[[CloudDSM.RuntimeSpecializationWP]]

!!! WP 4: Application visible interface design
* Task 1: Define more precisely the class of applications that CloudDSM targets
* Task 2: Define the needs of the toolchain, what degrees of freedom it needs in order to accomplish the desired transforms of the source code.
* Task 3: Define the needs of the runtime system, what characteristics it needs in the code generated by the toolchain in order to deliver high performance.
* Task 4: Define the needs of the application developer, what mental models and what syntax, and what debugging and code-checking support they desire.
* Task 5: Integrate the results of tasks 1 through 4 into a specification of the interfaces used by the application developer. There will be two levels of code annotation.. one that high level application developers see and use, of which there will be many variations. For example, Reo will have a different high level user interface than the pragma system for OpenMP. The second, lower level will be common to all versions of the higher level interface. This will be used directly by the toolchain to perform code transforms. Each of the higher level forms will be translated into the same, common, lower level form. This task only considers the top level forms of the code. WP 5 separately defines the common lower level form.
Changed lines 51-169 from:
* Task2: Define Scheduling component within portal (ICL, XLab, Sean)
* Milestone M2.4: Month 6 -- detailed description of the final full featured scheduler
* Deliverable D2.4: Month 6 -- technical report that describes the interfaces to the scheduler and what functions it performs

* Task3: Develop scheduler (ICL, XLab, Sean)
* Milestone M2.5: Month 12 -- First, simple, version of the full scheduler integrated into portal (replaces "dummy" version inside prototype portal)
* Deliverable D2.5: Month 12 -- code and related artifacts of the working scheduler
* Milestone M2.6: Month 24 -- Subset of advanced scheduler features functioning and integrated into portal.
* Deliverable D2.6: Month 24 -- code and related artifacts of the working second-stage scheduler
* Milestone M2.7: Month 36 -- All advanced scheduler features working within the integrated system.
* Deliverable D2.7: Month 36 -- code and related artifacts of the working final, advanced scheduler

* Task4: Testing and performance tuning of full CloudDSM system (IBM, DC, INRIA, CWI, Gigas, ICL, XLab)
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --



!!! WP 3: DSM Runtime System (Sean, INRIA, CWI, ICL, XLab)
[[CloudDSM.WP3 | details of WP3]]


* Task1: Architecture of hierarchical DSM runtime system.
* Milestone M3.: Month 12 --
* Deliverable D3.: Month 12 --
* Milestone M3.: Month 24 --
* Deliverable D3.: Month 24 --
* Milestone M3.: Month 36 --
* Deliverable D3.: Month 36 --

* Task2: DSM runtime system on a shared memory x86 based Cloud server instance -- Sean with involvement by XLAB
* Milestone M3.: Month 12 --
* Deliverable D3.: Month 12 --
* Milestone M3.: Month 24 --
* Deliverable D3.: Month 24 --
* Milestone M3.: Month 36 --
* Deliverable D3.: Month 36 --

* Task3: DSM runtime system on a shared memory Power based Cloud server instance -- IBM plus Sean with involvement by XLAB
* Milestone M3.: Month 12 --
* Deliverable D3.: Month 12 --
* Milestone M3.: Month 24 --
* Deliverable D3.: Month 24 --
* Milestone M3.: Month 36 --
* Deliverable D3.: Month 36 --

* Task4: binary runtime specializer
* Milestone M3.: Month 12 --
* Deliverable D3.: Month 12 --
* Milestone M3.: Month 24 --
* Deliverable D3.: Month 24 --
* Milestone M3.: Month 36 --
* Deliverable D3.: Month 36 --

* Task7: Integration testing (Gigas, DC, INRIA, XLAB)
* Milestone M3.: Month 12 --
* Deliverable D3.: Month 12 -- compilation report of which integrations passed and which failed, and actions taken to fix the fails
* Milestone M3.: Month 24 --
* Deliverable D3.: Month 24 --
* Milestone M3.: Month 36 --
* Deliverable D3.: Month 36 --


!!! WP 4: Application visible interface design
[[CloudDSM.WP4 | details of WP4]]

* Task 1: Define more precisely the class of applications that CloudDSM targets
* Milestone M5.: Month 12 --
* Deliverable D5.: Month 12 --
* Milestone M5.: Month 24 --
* Deliverable D5.: Month 24 --
* Milestone M5.: Month 36 --
* Deliverable D5.: Month 36 --

* Task 2: Define the needs of the toolchain, what degrees of freedom it needs in order to accomplish the desired transforms of the source code.
* Milestone M5.: Month 12 --
* Deliverable D5.: Month 12 --
* Milestone M5.: Month 24 --
* Deliverable D5.: Month 24 --
* Milestone M5.: Month 36 --
* Deliverable D5.: Month 36 --

* Task 3: Define the needs of the runtime system, what characteristics it needs in the code generated by the toolchain in order to deliver high performance.
* Milestone M5.: Month 12 --
* Deliverable D5.: Month 12 --
* Milestone M5.: Month 24 --
* Deliverable D5.: Month 24 --
* Milestone M5.: Month 36 --
* Deliverable D5.: Month 36 --

* Task 4: Define the needs of the application developer, what mental models and what syntax, and what debugging and code-checking support they desire.
* Milestone M5.: Month 12 --
* Deliverable D5.: Month 12 --
* Milestone M5.: Month 24 --
* Deliverable D5.: Month 24 --
* Milestone M5.: Month 36 --
* Deliverable D5.: Month 36 --

* Task 5: Integrate the results of tasks 1 through 4 into a specification of the interfaces used by the application developer.
* Milestone M5.: Month 12 --
* Deliverable D5.: Month 12 --
* Milestone M5.: Month 24 --
* Deliverable D5.: Month 24 --
* Milestone M5.: Month 36 --
* Deliverable D5.: Month 36 --

* Task 6: Development tools to support the writing of application code.
* Milestone M5.: Month 12 --
* Deliverable D5.: Month 12 --
* Milestone M5.: Month 24 --
* Deliverable D5.: Month 24 --
* Milestone M5.: Month 36 --
* Deliverable D5.: Month 36 --
to:
Tasks 1 through 5 will be performed iteratively, with multiple revisions during the first six months of the project. Each of the tasks will have an impact on the other tasks, and it will require a large amount of communication, via iterations, in order to find a suitable common ground interface that supports all the aspects well.

* Task 6: Development tools to support the writing of application code. This includes code checkers that enforce restrictions on what coding practices are allowed inside the portions of the application that employ the CloudDSM system. It also includes debugging aids, to detect bugs and to narrow down the portion of the application code causing discrepancies from specified behavior.

-- Sean and INRIA and Douglas Connect will lead, with input from XLAB, Imperial, and partners for WP3 tasks 2 through 6.


Question: "Why use OpenMP?" A: Many existing parallel libraries are written in OpenMP. One early project goal was to hit the ground running with code that is relevant to industry and readily usable by industry. Using OpenMP libraries during development of the low-level interface allows compiler work to begin almost immediately on relevant, ready-to-run code.

Comment and Question: "It may provide benefit desired by end-users if location-aware high-level annotations were also provided, such as regions, effects, etc. With these, the programmers will be able to communicate placement information. We found that for Myrmics, the programmer can know more about placement than the runtime/compiler can infer, and once placement/allocation is done properly, the shared-mem abstraction still helps with coding, except that the manual placement provides superior locality and performance. The programmer knows about and can say things about locality (like the X10 places, or regions in Fortress)." Question: "does it make sense to expose location within a Cloud-system that automatically and dynamically changes the number, type, and location of machines assigned to the computation?" Answer: you have nailed the heart of what this workpackage is all about. This will be an on-going discussion during the first six to nine months of the project. Indeed, one desire is to capture the understanding that the programmer has in their head and uses during the process of specifying pieces of work and choosing, themselves, where to place each piece. The goal of the WP is to discover an encoding of the process that the programmer does in their head, so that they encode that mental process, in a parameterized way. The automation then chooses the parameter values, and plugs them into what the programmer provided, which delivers pieces of work and the affinities among them. The WP content is the work of discovering programmer abstractions that get us as close to there as possible, in a way that we know how to implement..


Question: "any hints on what is the common low-level form of the source that is produced by the development toolchain? is it annotated source code?" Answer: Figuring this out is the content of WP4. Albert would like a source code annotation form for project logistics reasons. In that case by-hand modification of existing OpenMP libraries can begin at once, and act as a test bed for rapid iterations of what is the best low-level form. At the same time, compiler techniques can be tried, and also at the same time high level end-user annotations can be tested for how the "feel" to the application programmers. The DSM-specializer can be worked on in tight iterations with figuring out what the best low-level representation should be.. any desired changes in representation are just done quickly by hand -- no need to fiddle with IR, and reasonably decoupled from high level annotation form..

Question: "How does the IBM fat-binary specializer interact with the DSM runtime system?" Answer: AFAIU, the re-optimizer interrupts work in progress, changes the executable, then resumes work in progress. But it doesn't controls what work is assigned to what core. The optimizations it performs are single-thread optimizations, and also the re-optimizer may be told by the DSM runtime or by the portal to adjust the code such that the size of chunk of work performed between DSM calls is adjusted, or the layout or access pattern of data is adjusted. It is not clear yet whether the re-optimizer tool will make decisions on its own about the best chunk size. It might communicate performance feedback to the DSM runtime, and optimal chunk size decisions are made there. Or those decisions may be passed along to the portal. Wherever they are ultimately made, it will be up to the Dorit tool to modify the code such that it actually performs work in the chosen size of chunk. It still remains to work through the details of how the DSM runtime will interact with the Dorit tool.
Deleted lines 68-69:
[[CloudDSM.WP5 | details of WP5]]
Deleted lines 69-75:
* Milestone M5.: Month 12 --
* Deliverable D5.: Month 12 --
* Milestone M5.: Month 24 --
* Deliverable D5.: Month 24 --
* Milestone M5.: Month 36 --
* Deliverable D5.: Month 36 --
Deleted lines 70-76:
* Milestone M5.: Month 12 --
* Deliverable D5.: Month 12 --
* Milestone M5.: Month 24 --
* Deliverable D5.: Month 24 --
* Milestone M5.: Month 36 --
* Deliverable D5.: Month 36 --
Changed lines 72-88 from:
* Milestone M5.: Month 12 --
* Deliverable D5.: Month 12 --
* Milestone M5.: Month 24 --
* Deliverable D5.: Month 24 --
* Milestone M5.: Month 36 --
* Deliverable D5.: Month 36 --

* Task 4: Create transform tools that translate from the common lower level form into the final C form
of the code.
* Milestone M5.: Month 12 --
* Deliverable D5.: Month 12 --
* Milestone M5.: Month 24 --
* Deliverable D5.: Month 24 --
* Milestone M5.: Month 36 --
* Deliverable D5
.: Month 36 --
to:
* Task 4: Create transform tools that translate from the common lower level form into the final C form of the code. The final C form includes OS calls, DSM runtime system calls, and synchronization calls that are inserted by the tool. The final C code has a form of the application that performs large chunks of work in-between calls to the DSM system. Each target hardware platform will require its own variation of the transform tool, which is tuned to the details of that hardware, especially communication details. The tool may produce a single multi-versioned binary, or it may include a runtime specializer, or it may generate many independent versions of the binary. A large portion of the research will involve determining the best approach.

FORTH will contribute their compiler work that recreates locality and placement information when the programmer doesn't explicitly declare it (this is used to replace remote accesses with DMA-ing of whole pages for performance).


Comment and Question: "For the IBM fat binary specializer, there are three stages (1) development stage: static generic compilation on developer machine which produces a custom IR form plus generic executable (2) static specialization compilation on a server, or during load, which generates a Power executable specialized to a specific HW (3) runtime fat-binary based recompilation on the actual deployed HW" Question: "How does this fit into CloudDSM?" Answer: Stage 1 will remain on the developer machine, stage 2 will take place inside the CloudDSM portal, and stage 3 inside the Cloud server during execution.

Question: "how will stage 2 fit with the DSM specific specializations?" Answer: this is an open question, to be resolved during the WP. We need some pictures, to figure out what tools do what at which point..

Comment: "Stage 1 happens inside the development environment on a desktop machine. The low-level annotated source is then sent to the CloudDSM portal by the developer. This process registers the application and makes it available for the end-user to run. This registration process also causes the low-level annotated source to be given to a specialization 'harness'. That harness invokes a number of specializer modules. One specializer module is provided by IBM. This module re-runs stage 1 and then runs stage 2 several times, once for each potential Power HW configuration that the CloudDSM system could send the fat-binary to (the module may, in fact cause stage 1 and stage 2 to run remotely on Power ISA machines, inside their own Cloud VM). Lastly, after the user starts the application and issues a request for computation, the portal deploys a unit of work to a Cloud VM running on a Power ISA machine. That Cloud VM has the DSM runtime in it, and that is given the unit of work. The unit of work includes a function within the fat binary to perform. The fat binary is dynamically linked to the DSM runtime. During execution, the work suspends and the binary optimizer takes over, modifies the code, then resumes the work. When the work reaches a DSM call, the DSM runtime suspends the execution context. That context will remain suspended while communication of data takes place. The DSM runtime will switch the CPU to a different context, whose communication has completed and is ready to resume."

Deleted lines 85-86:
[[CloudDSM.WP6 | details of WP6]]
Deleted lines 86-91:
* Milestone M6.: Month 12 --
* Deliverable D6.: Month 12 --
* Milestone M6.: Month 24 --
* Deliverable D6.: Month 24 --
* Milestone M6.: Month 36 --
* Deliverable D6.: Month 36 --
Deleted lines 87-93:
* Milestone M6.: Month 12 --
* Deliverable D6.: Month 12 --
* Milestone M6.: Month 24 --
* Deliverable D6.: Month 24 --
* Milestone M6.: Month 36 --
* Deliverable D6.: Month 36 --
Changed lines 89-107 from:
* Milestone M7.: Month!!! WP 7: Dissemination and Exploitation -- Gigas
[[CloudDSM.WP7 | details of WP7]]

* Task1: Make working system available to end customers (Gigas, DC)
* Milestone M7.: Month 3 -- Gigas makes rapid prototype system available to small set of alpha customers
* Milestone M7.: Month 6 -- Gigas makes final prototype system available to small set of beta customers
* Milestone M7.: Month 12 -- DC deploys its first prototype application on the prototype system hosted by Gigas
* Milestone M7.: Month 24 -- DC offers a fully developed application based on CloudDSM to its customers
* Milestone M7.: Month 36 -- DC offers the CloudDSM system for use by its customers and incubated companies to develop their own applications

* Task2: Create Development tools to enhance exploitation by DC customers and incubated companies
* Milestone M6.: Month 24 -- Completion of minimal set of development tools
* Deliverable D6.: Month 24 -- minimum set of tools needed by DC's alpha customers in order to develop CloudDSM applications
* Milestone M6.: Month 36 -- Completion of final set of development tools
* Deliverable D6.: Month 36 -- final set of tools needed by DC's main customers in order to develop CloudDSM applications

Open questions:

-] which WP to place the research regarding best division of work and best deployment of it onto available hardware
?
to:
!!! WP 7: Dissemination and Exploitation
* Gigas will make the results of the project available to its customers
* Douglas Connect will use the results of the project within its Drug Discovery product, and make the project technology available to the startups that it incubates.

Question: "Are the 2 Power servers we have in our lab sufficient for the CloudDSM project?" Answer: probably, anyone have thoughts
?
March 19, 2014, at 05:49 AM by 80.114.135.137 -
Changed line 80 from:
* Task4: binary runtime specializer -- IBM (Q: any chance to get open source release? Or to apply to other ISA?). Use fat binary runtime optimization techniques to adjust the granularity and layout during execution. This is specific to IBM Power architecture, but it provides the blue print for implementing for other ISAs as well.
to:
* Task4: binary runtime specializer
Changed lines 101-107 from:
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --
to:
* Milestone M5.: Month 12 --
* Deliverable D5.: Month 12 --
* Milestone M5.: Month 24 --
* Deliverable D5.: Month 24 --
* Milestone M5.: Month 36 --
* Deliverable D5.: Month 36 --
Changed lines 109-115 from:
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --
to:
* Milestone M5.: Month 12 --
* Deliverable D5.: Month 12 --
* Milestone M5.: Month 24 --
* Deliverable D5.: Month 24 --
* Milestone M5.: Month 36 --
* Deliverable D5.: Month 36 --
Changed lines 117-123 from:
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --
to:
* Milestone M5.: Month 12 --
* Deliverable D5.: Month 12 --
* Milestone M5.: Month 24 --
* Deliverable D5.: Month 24 --
* Milestone M5.: Month 36 --
* Deliverable D5.: Month 36 --
Deleted lines 124-160:
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --

* Task 5: Integrate the results of tasks 1 through 4 into a specification of the interfaces used by the application developer. There will be two levels of code annotation.. one that high level application developers see and use, of which there will be many variations. For example, Reo will have a different high level user interface than the pragma system for OpenMP. The second, lower level will be common to all versions of the higher level interface. This will be used directly by the toolchain to perform code transforms. Each of the higher level forms will be translated into the same, common, lower level form. This task only considers the top level forms of the code. WP 5 separately defines the common lower level form.
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --


Tasks 1 through 5 will be performed iteratively, with multiple revisions during the first six months of the project. Each of the tasks will have an impact on the other tasks, and it will require a large amount of communication, via iterations, in order to find a suitable common ground interface that supports all the aspects well.

* Task 6: Development tools to support the writing of application code. This includes code checkers that enforce restrictions on what coding practices are allowed inside the portions of the application that employ the CloudDSM system. It also includes debugging aids, to detect bugs and to narrow down the portion of the application code causing discrepancies from specified behavior.

-- Sean and INRIA and Douglas Connect will lead, with input from XLAB, Imperial, and partners for WP3 tasks 2 through 6.


Question: "Why use OpenMP?" A: Many existing parallel libraries are written in OpenMP. One early project goal was to hit the ground running with code that is relevant to industry and readily usable by industry. Using OpenMP libraries during development of the low-level interface allows compiler work to begin almost immediately on relevant, ready-to-run code.

Comment and Question: "It may provide benefit desired by end-users if location-aware high-level annotations were also provided, such as regions, effects, etc. With these, the programmers will be able to communicate placement information. We found that for Myrmics, the programmer can know more about placement than the runtime/compiler can infer, and once placement/allocation is done properly, the shared-mem abstraction still helps with coding, except that the manual placement provides superior locality and performance. The programmer knows about and can say things about locality (like the X10 places, or regions in Fortress)." Question: "does it make sense to expose location within a Cloud-system that automatically and dynamically changes the number, type, and location of machines assigned to the computation?" Answer: you have nailed the heart of what this workpackage is all about. This will be an on-going discussion during the first six to nine months of the project. Indeed, one desire is to capture the understanding that the programmer has in their head and uses during the process of specifying pieces of work and choosing, themselves, where to place each piece. The goal of the WP is to discover an encoding of the process that the programmer does in their head, so that they encode that mental process, in a parameterized way. The automation then chooses the parameter values, and plugs them into what the programmer provided, which delivers pieces of work and the affinities among them. The WP content is the work of discovering programmer abstractions that get us as close to there as possible, in a way that we know how to implement..


Question: "any hints on what is the common low-level form of the source that is produced by the development toolchain? is it annotated source code?" Answer: Figuring this out is the content of WP4. Albert would like a source code annotation form for project logistics reasons. In that case by-hand modification of existing OpenMP libraries can begin at once, and act as a test bed for rapid iterations of what is the best low-level form. At the same time, compiler techniques can be tried, and also at the same time high level end-user annotations can be tested for how the "feel" to the application programmers. The DSM-specializer can be worked on in tight iterations with figuring out what the best low-level representation should be.. any desired changes in representation are just done quickly by hand -- no need to fiddle with IR, and reasonably decoupled from high level annotation form..

Question: "How does the IBM fat-binary specializer interact with the DSM runtime system?" Answer: AFAIU, the re-optimizer interrupts work in progress, changes the executable, then resumes work in progress. But it doesn't controls what work is assigned to what core. The optimizations it performs are single-thread optimizations, and also the re-optimizer may be told by the DSM runtime or by the portal to adjust the code such that the size of chunk of work performed between DSM calls is adjusted, or the layout or access pattern of data is adjusted. It is not clear yet whether the re-optimizer tool will make decisions on its own about the best chunk size. It might communicate performance feedback to the DSM runtime, and optimal chunk size decisions are made there. Or those decisions may be passed along to the portal. Wherever they are ultimately made, it will be up to the Dorit tool to modify the code such that it actually performs work in the chosen size of chunk. It still remains to work through the details of how the DSM runtime will interact with the Dorit tool.


!!! WP 5: Compiler Toolchain -- INRIA
[[CloudDSM.WP5 | details of WP5]]

* Task 1: participate in WP 4 task 2, as part of arriving at the interface that WP 5 will take as input.
Changed lines 131-132 from:
* Task 2: Define intermediate, low level form of code annotation. The interfaces defined in WP 4 will be translated into this common lower level form.
to:
* Task 5: Integrate the results of tasks 1 through 4 into a specification of the interfaces used by the application developer.
Changed line 140 from:
* Task 3: Create tools that transform from each form of higher level code annotation into the common lower level code annotation form.
to:
* Task 6: Development tools to support the writing of application code.
Changed lines 148-152 from:
* Task 4: Create transform tools that translate from the common lower level form into the final C form of the code. The final C form includes OS calls, DSM runtime system calls, and synchronization calls that are inserted by the tool. The final C code has a form of the application that performs large chunks of work in-between calls to the DSM system. Each target hardware platform will require its own variation of the transform tool, which is tuned to the details of that hardware, especially communication details. The tool may produce a single multi-versioned binary, or it may include a runtime specializer, or it may generate many independent versions of the binary. A large portion of the research will involve determining the best approach.
to:
!!! WP 5: Compiler Toolchain -- INRIA
[[CloudDSM.WP5 | details of WP5]]

* Task 1: participate in WP 4 task 2, as part of arriving at the interface that WP 5 will take as input
.
Changed lines 160-170 from:
FORTH will contribute their compiler work that recreates locality and placement information when the programmer doesn't explicitly declare it (this is used to replace remote accesses with DMA-ing of whole pages for performance).


Comment and Question: "For the IBM fat binary specializer, there are three stages (1) development stage: static generic compilation on developer machine which produces a custom IR form plus generic executable (2) static specialization compilation on a server, or during load, which generates a Power executable specialized to a specific HW (
3) runtime fat-binary based recompilation on the actual deployed HW" Question: "How does this fit into CloudDSM?" Answer: Stage 1 will remain on the developer machine, stage 2 will take place inside the CloudDSM portal, and stage 3 inside the Cloud server during execution.

Question
: "how will stage 2 fit with the DSM specific specializations?" Answer: this is an open question, to be resolved during the WP. We need some pictures, to figure out what tools do what at which point..

Comment: "Stage 1 happens inside the development environment on a desktop machine. The low-level annotated source is then sent to the CloudDSM portal by the developer.
This process registers the application and makes it available for the end-user to run. This registration process also causes the low-level annotated source to be given to a specialization 'harness'. That harness invokes a number of specializer modules. One specializer module is provided by IBM. This module re-runs stage 1 and then runs stage 2 several times, once for each potential Power HW configuration that the CloudDSM system could send the fat-binary to (the module may, in fact cause stage 1 and stage 2 to run remotely on Power ISA machines, inside their own Cloud VM). Lastly, after the user starts the application and issues a request for computation, the portal deploys a unit of work to a Cloud VM running on a Power ISA machine. That Cloud VM has the DSM runtime in it, and that is given the unit of work. The unit of work includes a function within the fat binary to perform. The fat binary is dynamically linked to the DSM runtime. During execution, the work suspends and the binary optimizer takes over, modifies the code, then resumes the work. When the work reaches a DSM call, the DSM runtime suspends the execution context. That context will remain suspended while communication of data takes place. The DSM runtime will switch the CPU to a different context, whose communication has completed and is ready to resume."

to:
* Task 2: Define intermediate, low level form of code annotation. The interfaces defined in WP 4 will be translated into this common lower level form.
* Milestone M5
.: Month 12 --
* Deliverable D5.: Month 12 --

* Milestone M5.: Month 24 --
* Deliverable D5.: Month 24 --

* Milestone M5.: Month 36 --
* Deliverable D5.: Month 36 --


* Task
3: Create tools that transform from each form of higher level code annotation into the common lower level code annotation form.
* Milestone M5.: Month 12 --
* Deliverable D5.: Month 12 --
* Milestone M5
.: Month 24 --
* Deliverable D5.: Month 24 --
* Milestone M5.: Month 36 --
* Deliverable D5.: Month 36 --

* Task 4: Create transform tools that translate from the common lower level form into the final C form of the code.
* Milestone M5.: Month 12 --
* Deliverable D5.: Month 12 --
* Milestone M5
.: Month 24 --
* Deliverable D5.: Month 24 --
* Milestone
M5.: Month 36 --
* Deliverable D5.: Month 36 --
Added line 203:
Changed lines 205-212 from:
* Milestone M6.: Month 12 --
* Deliverable D6.: Month 12 --
* Milestone M6.: Month 24 --
* Deliverable D6.: Month 24 --
* Milestone M6.: Month 36 --
* Deliverable D6.: Month 36 --

!!! WP 7: Dissemination and Exploitation
to:
* Milestone M7.: Month!!! WP 7: Dissemination and Exploitation -- Gigas
Changed lines 208-221 from:
* Task1:
* Milestone M7.: Month 12 --
* Deliverable D7.: Month 12 --
* Milestone M7.: Month 24 --
* Deliverable D7.: Month 24 --

* Milestone M7.: Month 36 --
* Deliverable D7.: Month 36 --


* Gigas will make the results of the project available to its customers
* Douglas Connect will use
the results of the project within its Drug Discovery product, and make the project technology available to the startups that it incubates.

Question: "Are the 2 Power servers we have in our lab sufficient for the CloudDSM project?" Answer: probably, anyone have thoughts?
to:
* Task1: Make working system available to end customers (Gigas, DC)
* Milestone M7.: Month 3 -- Gigas makes rapid prototype system available to small set of alpha customers
* Milestone
M7.: Month 6 -- Gigas makes final prototype system available to small set of beta customers
* Milestone M7.: Month 12 -- DC deploys its first prototype application on
the prototype system hosted by Gigas
* Milestone M7.: Month 24 -- DC offers a fully developed application based on CloudDSM to its customers
* Milestone M7.: Month 36 -- DC offers the CloudDSM system for use by its customers and incubated companies to develop their own applications

* Task2: Create Development tools to enhance exploitation by DC customers and incubated companies
* Milestone M6.: Month 24 -- Completion of minimal set of development tools
* Deliverable D6.: Month 24 -- minimum set of tools needed by DC's alpha customers in order to develop CloudDSM applications
* Milestone M6.: Month 36 -- Completion of final set of development tools
* Deliverable D6.: Month 36 -- final set of tools needed by DC's main customers in order to develop CloudDSM applications
Changed lines 223-225 from:
-] which WP to place development tools, such as Eclipse plug-in and code style checker?

-] which WP to place the research regarding best division of work and best deployment of it onto available hardware?
to:
-] which WP to place the research regarding best division of work and best deployment of it onto available hardware?
March 19, 2014, at 05:26 AM by 80.114.135.137 -
Changed line 64 from:
* Task2: DSM runtime system on Kalray HW that presents an API that is manipulatable by WP2 Cloud deployment tool -- Kalray (with involvement by Sean and XLAB)
to:
* Task2: DSM runtime system on a shared memory x86 based Cloud server instance -- Sean with involvement by XLAB
Changed line 72 from:
* Task3: DSM runtime system on FORTH Cube HW that presents an API that is manipulatable by WP2 Cloud deployment tool -- FORTH (with involvement by Sean and XLAB)
to:
* Task3: DSM runtime system on a shared memory Power based Cloud server instance -- IBM plus Sean with involvement by XLAB
Changed line 80 from:
* Task4: DSM runtime system on a shared memory x86 based Cloud server instance -- Sean with involvement by XLAB
to:
* Task4: binary runtime specializer -- IBM (Q: any chance to get open source release? Or to apply to other ISA?). Use fat binary runtime optimization techniques to adjust the granularity and layout during execution. This is specific to IBM Power architecture, but it provides the blue print for implementing for other ISAs as well.
Changed line 88 from:
* Task5: DSM runtime system on a shared memory Power based Cloud server instance -- IBM plus Sean with involvement by XLAB
to:
* Task7: Integration testing (Gigas, DC, INRIA, XLAB)
Changed line 90 from:
* Deliverable D3.: Month 12 --
to:
* Deliverable D3.: Month 12 -- compilation report of which integrations passed and which failed, and actions taken to fix the fails
Deleted lines 95-113:
* Task6: binary runtime specializer -- IBM (Q: any chance to get open source release? Or to apply to other ISA?). Use fat binary runtime optimization techniques to adjust the granularity and layout during execution. This is specific to IBM Power architecture, but it provides the blue print for implementing for other ISAs as well.
* Milestone M3.: Month 12 --
* Deliverable D3.: Month 12 --
* Milestone M3.: Month 24 --
* Deliverable D3.: Month 24 --
* Milestone M3.: Month 36 --
* Deliverable D3.: Month 36 --

* Task7: Integration testing (Gigas, DC, INRIA, XLAB)
* Milestone M3.: Month 12 --
* Deliverable D3.: Month 12 -- compilation report of which integrations passed and which failed, and actions taken to fix the fails
* Milestone M3.: Month 24 --
* Deliverable D3.: Month 24 --
* Milestone M3.: Month 36 --
* Deliverable D3.: Month 36 --



[[CloudDSM.RuntimeSpecializationWP]]
March 19, 2014, at 05:21 AM by 80.114.135.137 -
Changed lines 181-186 from:
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --
to:
* Milestone M5.: Month 12 --
* Deliverable D5.: Month 12 --
* Milestone M5.: Month 24 --
* Deliverable D5.: Month 24 --
* Milestone M5.: Month 36 --
* Deliverable D5.: Month 36 --
Changed lines 188-193 from:
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --
to:
* Milestone M5.: Month 12 --
* Deliverable D5.: Month 12 --
* Milestone M5.: Month 24 --
* Deliverable D5.: Month 24 --
* Milestone M5.: Month 36 --
* Deliverable D5.: Month 36 --
Changed lines 196-201 from:
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --
to:
* Milestone M5.: Month 12 --
* Deliverable D5.: Month 12 --
* Milestone M5.: Month 24 --
* Deliverable D5.: Month 24 --
* Milestone M5.: Month 36 --
* Deliverable D5.: Month 36 --
Changed lines 204-210 from:
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --
to:
* Milestone M5.: Month 12 --
* Deliverable D5.: Month 12 --
* Milestone M5.: Month 24 --
* Deliverable D5.: Month 24 --
* Milestone M5.: Month 36 --
* Deliverable D5.: Month 36 --
Changed lines 226-231 from:
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --
to:
* Milestone M6.: Month 12 --
* Deliverable D6.: Month 12 --
* Milestone M6.: Month 24 --
* Deliverable D6.: Month 24 --
* Milestone M6.: Month 36 --
* Deliverable D6.: Month 36 --
Changed lines 233-238 from:
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --
to:
* Milestone M6.: Month 12 --
* Deliverable D6.: Month 12 --
* Milestone M6.: Month 24 --
* Deliverable D6.: Month 24 --
* Milestone M6.: Month 36 --
* Deliverable D6.: Month 36 --
Changed lines 240-246 from:
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --
to:
* Milestone M6.: Month 12 --
* Deliverable D6.: Month 12 --
* Milestone M6.: Month 24 --
* Deliverable D6.: Month 24 --
* Milestone M6.: Month 36 --
* Deliverable D6.: Month 36 --
Changed lines 251-256 from:
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --
to:
* Milestone M7.: Month 12 --
* Deliverable D7.: Month 12 --
* Milestone M7.: Month 24 --
* Deliverable D7.: Month 24 --
* Milestone M7.: Month 36 --
* Deliverable D7.: Month 36 --
March 19, 2014, at 05:17 AM by 80.114.135.137 -
Added lines 54-112:


* Task1: Architecture of hierarchical DSM runtime system.
* Milestone M3.: Month 12 --
* Deliverable D3.: Month 12 --
* Milestone M3.: Month 24 --
* Deliverable D3.: Month 24 --
* Milestone M3.: Month 36 --
* Deliverable D3.: Month 36 --

* Task2: DSM runtime system on Kalray HW that presents an API that is manipulatable by WP2 Cloud deployment tool -- Kalray (with involvement by Sean and XLAB)
* Milestone M3.: Month 12 --
* Deliverable D3.: Month 12 --
* Milestone M3.: Month 24 --
* Deliverable D3.: Month 24 --
* Milestone M3.: Month 36 --
* Deliverable D3.: Month 36 --

* Task3: DSM runtime system on FORTH Cube HW that presents an API that is manipulatable by WP2 Cloud deployment tool -- FORTH (with involvement by Sean and XLAB)
* Milestone M3.: Month 12 --
* Deliverable D3.: Month 12 --
* Milestone M3.: Month 24 --
* Deliverable D3.: Month 24 --
* Milestone M3.: Month 36 --
* Deliverable D3.: Month 36 --

* Task4: DSM runtime system on a shared memory x86 based Cloud server instance -- Sean with involvement by XLAB
* Milestone M3.: Month 12 --
* Deliverable D3.: Month 12 --
* Milestone M3.: Month 24 --
* Deliverable D3.: Month 24 --
* Milestone M3.: Month 36 --
* Deliverable D3.: Month 36 --

* Task5: DSM runtime system on a shared memory Power based Cloud server instance -- IBM plus Sean with involvement by XLAB
* Milestone M3.: Month 12 --
* Deliverable D3.: Month 12 --
* Milestone M3.: Month 24 --
* Deliverable D3.: Month 24 --
* Milestone M3.: Month 36 --
* Deliverable D3.: Month 36 --

* Task6: binary runtime specializer -- IBM (Q: any chance to get open source release? Or to apply to other ISA?). Use fat binary runtime optimization techniques to adjust the granularity and layout during execution. This is specific to IBM Power architecture, but it provides the blue print for implementing for other ISAs as well.
* Milestone M3.: Month 12 --
* Deliverable D3.: Month 12 --
* Milestone M3.: Month 24 --
* Deliverable D3.: Month 24 --
* Milestone M3.: Month 36 --
* Deliverable D3.: Month 36 --

* Task7: Integration testing (Gigas, DC, INRIA, XLAB)
* Milestone M3.: Month 12 --
* Deliverable D3.: Month 12 -- compilation report of which integrations passed and which failed, and actions taken to fix the fails
* Milestone M3.: Month 24 --
* Deliverable D3.: Month 24 --
* Milestone M3.: Month 36 --
* Deliverable D3.: Month 36 --
March 19, 2014, at 05:10 AM by 80.114.135.137 -
Deleted lines 53-115:

* Task1: Architecture of hierarchical DSM runtime system.
* Milestone M3.: Month 12 --
* Deliverable D3.: Month 12 --
* Milestone M3.: Month 24 --
* Deliverable D3.: Month 24 --
* Milestone M3.: Month 36 --
* Deliverable D3.: Month 36 --


* Task1: Architecture of hierarchical DSM runtime system. Delivered runtime will be a federation. On each virtual machine, or in some cases physical machine, an individual runtime system will be in operation. Each of these individual runtime systems will interact with the others to form a collective, cooperative, overall runtime system, which presents a single address space abstraction. This task will define the architecture, interfaces, and protocols for this overall runtime system. Then for each class of hardware an individual task will implement the interfaces and protocols for that particular kind of hardware, as described below.
-- Sean leads task, partners for 2 through 6, plus Imperial, plus INRIA, plus XLAB participate on aspects that involve their individual piece. INRIA will advise on how the architecture choices impact the toolchain design and implementation. Imperial will advise on how the choices impact the algorithms for runtime work division and deployment. XLAB will advise on how the architecture impacts the deployment tool. Partners for 2 through 6 will provide input on how the architecture impacts the individual runtime system they are implementing.

* Task2: DSM runtime system on Kalray HW that presents an API that is manipulatable by WP2 Cloud deployment tool -- Kalray (with involvement by Sean and XLAB)
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --

* Task3: DSM runtime system on FORTH Cube HW that presents an API that is manipulatable by WP2 Cloud deployment tool -- FORTH (with involvement by Sean and XLAB)
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --

* Task4: DSM runtime system on a shared memory x86 based Cloud server instance -- Sean with involvement by XLAB
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --

* Task5: DSM runtime system on a shared memory Power based Cloud server instance -- IBM plus Sean with involvement by XLAB
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --

* Task6: binary runtime specializer -- IBM (Q: any chance to get open source release? Or to apply to other ISA?). Use fat binary runtime optimization techniques to adjust the granularity and layout during execution. This is specific to IBM Power architecture, but it provides the blue print for implementing for other ISAs as well.
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --

* Task7: Integration testing. Deploy the various individual runtime systems on specific machines. Write test cases whose purpose is to expose performance issues and bugs. Execute the test cases, record the performance, report to the partners who developed the individual runtimes in tasks 2 through 6. Work with the partners to determine the reasons for particular performance anomalies, perhaps writing and running specific tests interactively. The partners for 2 through 6 will use the results to modify their individual runtime system implementations. -- Gigas leads -- partners for 2 through 6, plus XLAB are involved.
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --

* Question: "Is the runtime going to be virtualized itself? Why not use standard bytecodes?" A: There won't be any bytecodes. CloudDSM is focused on high performance, and so the DSM runtime for many of the machines will be based on proto-runtime, due to its high performance and productivity benefits. Proto-runtime creates the equivalent of user-level "threads", which are called virtual processors because they capture a CPU context, and a CPU context virtualizes the bare processor pipeline. One of these VPs consists of its own stack, plus a data structure that holds a saved stack pointer, frame pointer, and program-counter contents. That set of state equals a point within the CPU's computation, and it can be used to switch the CPU around among different computation timelines. Each of those timelines behaves like its own, virtual, CPU processor pipeline.. hence, the name "virtual processor". But it's the lowest level, most bare bones "virtualization" possible.. there is no bytecode, no translation between ISAs, nothing but multiple CPU contexts.
March 19, 2014, at 05:10 AM by 80.114.135.137 -
Changed lines 3-4 from:
List of workpackages, and leader for the WP
to:
List of workpackages, and leader for each WP
Changed lines 53-54 from:
to:
[[CloudDSM.WP3 | details of WP3]]
Added lines 121-122:
[[CloudDSM.WP4 | details of WP4]]
Added lines 182-183:
[[CloudDSM.WP5 | details of WP5]]
Added lines 225-226:
[[CloudDSM.WP6 | details of WP6]]
Added lines 250-251:
[[CloudDSM.WP7 | details of WP7]]
March 19, 2014, at 05:07 AM by 80.114.135.137 -
Changed lines 23-24 from:
* Milestone M2.1: Month 6 -- working, simple, prototype of portal, with simple implementations of each subsystem
* Deliverable D2.1: Month 6 -- code and related artifacts of a working simple portal prototype (ready for use by DC and INRIA)
to:
* Milestone M2.1: Month 6 -- working, simple, prototype of portal, with simple implementations of each subsystem
* Deliverable D2.1: Month 6 -- code and related artifacts of a working simple portal prototype (ready for use by DC and INRIA)
* Milestone M2.2: Month 24 -- Upgrades to portal that support changes made in the Scheduler and Specialization harness
* Deliverable D2.2: Month 24 -- Technical Report, code, and other artifacts related to the implementation of the portal and supporting subsystems.
* Milestone M2.3: Month 36 -- Upgrades to portal that support changes made in the Scheduler and Specialization harness
* Deliverable D2.3: Month 36 -- Technical Report, code, and other artifacts related to the implementation of the portal and supporting subsystems.
Changed lines 31-35 from:
* Milestone M2.2: Month 6 -- detailed description of the final full featured scheduler

Deliverable D2.3: (27 Months after the start of the project) Technical Report and Open Source Software that implements the PT.
* Deliverable D2.2
: Month 6 -- technical report that describes the interfaces to the scheduler and what functions it performs
to:
* Milestone M2.4: Month 6 -- detailed description of the final full featured scheduler
* Deliverable D2.4: Month 6 -- technical report that describes the interfaces to the scheduler and what functions it performs
Changed lines 35-47 from:
* Milestone M2.3: Month 12 -- First, simple, version of the full scheduler integrated into portal (replaces "dummy" version inside prototype portal)
* Deliverable D2.2: Month 12 -- code and related artifacts of the working scheduler
* Milestone M2.4: Month 24 -- Subset of advanced scheduler features functioning and integrated into portal.
* Deliverable D2.1: Month 24 -- code and related artifacts of the working second-stage scheduler
* Milestone M2.5: Month 36 -- All advanced scheduler features working within the integrated system.
* Deliverable D2.1: Month 36 -- code and related artifacts of the working final, advanced scheduler

* Task5: Testing and performance tuning (IBM, DC, INRIA, CWI, Gigas, ICL, XLab)



!!! WP 3: DSM Runtime System -- Sean leads WP (at INRIA, or Bell Labs or
CWI)
to:
* Milestone M2.5: Month 12 -- First, simple, version of the full scheduler integrated into portal (replaces "dummy" version inside prototype portal)
* Deliverable D2.5: Month 12 -- code and related artifacts of the working scheduler
* Milestone M2.6: Month 24 -- Subset of advanced scheduler features functioning and integrated into portal.
* Deliverable D2.6: Month 24 -- code and related artifacts of the working second-stage scheduler
* Milestone M2.7: Month 36 -- All advanced scheduler features working within the integrated system.
* Deliverable D2.7: Month 36 -- code and related artifacts of the working final, advanced scheduler

* Task4: Testing and performance tuning of full CloudDSM system (IBM, DC, INRIA, CWI, Gigas, ICL, XLab)
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --



!!! WP 3: DSM Runtime System (Sean, INRIA,
CWI, ICL, XLab)

* Task1: Architecture of hierarchical DSM runtime system.
* Milestone M3.: Month 12 --
* Deliverable D3.: Month 12 --
* Milestone M3.: Month 24 --
* Deliverable D3.: Month 24 --
* Milestone M3.: Month 36 --
* Deliverable D3.: Month 36 --
Changed lines 67-73 from:
to:
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --
Changed lines 75-81 from:
to:
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --
Changed lines 83-89 from:
to:
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --
Changed lines 91-97 from:
to:
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --
Changed lines 99-105 from:
to:
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --
Changed lines 107-113 from:
to:
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --
Added lines 121-127:
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --
Added lines 129-135:
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --
Added lines 137-143:
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --
Added lines 145-151:
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --
Changed lines 153-159 from:
to:
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --
Added lines 180-185:
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --
Added lines 187-192:
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --
Added lines 194-199:
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --
Changed lines 201-207 from:
to:
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --
Added lines 221-226:
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --
Added lines 228-233:
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --
Changed lines 235-241 from:
to:
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --
Added lines 243-250:
* Task1:
* Milestone M2.: Month 12 --
* Deliverable D2.: Month 12 --
* Milestone M2.: Month 24 --
* Deliverable D2.: Month 24 --
* Milestone M2.: Month 36 --
* Deliverable D2.: Month 36 --
March 19, 2014, at 04:49 AM by 80.114.135.137 -
Changed lines 13-20 from:
Open questions:

-] which WP to place development tools, such as Eclipse plug-in and code style checker?

-] which WP to place the research regarding best division of work and best deployment of it onto available hardware?

to:
Added lines 23-24:
* Milestone M2.1: Month 6 -- working, simple, prototype of portal, with simple implementations of each subsystem
* Deliverable D2.1: Month 6 -- code and related artifacts of a working simple portal prototype (ready for use by DC and INRIA)
Changed lines 26-28 from:
* Milestone M2.2: Month 12 -- Working simple prototype scheduler and detailed description of the final full featured scheduler
to:
* Milestone M2.2: Month 6 -- detailed description of the final full featured scheduler

Deliverable D2.3: (27 Months after the start of the project) Technical Report and Open Source Software that implements the PT.
* Deliverable D2.2: Month 6 -- technical report that describes the interfaces to the
scheduler and what functions it performs
Changed lines 32-41 from:
* *Milestone M2.3: Month 24 -- Subset of advanced features functioning inside scheduler.
* *Milestone M2.4: Month 36 -- All advanced scheduler features working within the integrated system
.

* Task4
: Develop portal, including all sub-systems (Leader XLab with all partners; ICL 12PMs): The portal contains a specialization harness, a scheduler, a deployer, repository of specialized binaries, repository of application characteristics relevant to scheduling, statistics on application resource usage, and continuously updated state of the hardware resources. The portal receives instructions directly from human users, and observes resources and interacts with lower system levels, and also observes runing applications, and makes decisions for tasks. As such, the portal contains a scheduler that self-optimises the system in real time. Thus the portal and scheduler must be efficient and avoid complex computations. The portal will accomplish responsiveness by being highly distributed and by exploiting inputs from multiple agents that operate simultantously around different time scales, and around different resources, which share the gathered information with the scheduler. This will be the philosophy driving the development of the portal so that it is made of many distributed/concurrent lightweight components with only lightweight synchronisation while maximising information sharing.

Deliverable D2.3: (27 Months after the start of the project) Technical Report and Open Source Software that implements the PT.
* Task5: Testing and performance tuning (IBM with all partners; ICL 6 PMs
)
to:
* Milestone M2.3: Month 12 -- First, simple, version of the full scheduler integrated into portal (replaces "dummy" version inside prototype portal)
* Deliverable D2
.2: Month 12 -- code and related artifacts of the working scheduler
* Milestone M2.4: Month 24 -- Subset
of advanced scheduler features functioning and integrated into portal.
* Deliverable D2.1: Month 24 -- code and related artifacts
of the working second-stage scheduler
* Milestone M2.5: Month 36 -- All advanced scheduler features working within the integrated system.
* Deliverable D2.1: Month 36 -- code and related artifacts of the working final, advanced scheduler

* Task5: Testing and performance tuning (IBM, DC, INRIA, CWI, Gigas, ICL, XLab
)
Changed lines 116-123 from:
Question: "Are the 2 Power servers we have in our lab sufficient for the CloudDSM project?" Answer: probably, anyone have thoughts?
to:
Question: "Are the 2 Power servers we have in our lab sufficient for the CloudDSM project?" Answer: probably, anyone have thoughts?


Open questions:

-] which WP to place development tools, such as Eclipse plug-in and code style checker?

-] which WP to place the research regarding best division of work and best deployment of it onto available hardware
?
March 19, 2014, at 04:30 AM by 80.114.135.137 -
Changed lines 28-43 from:
[[CloudDSM.WP2 | link to details of WP2]]
to:
[[CloudDSM.WP2 | details of WP2]]
* Task1: Develop portal (XLab, ICL, DC, Gigas, Sean)
* Task2: Define Scheduling component within portal (ICL, XLab, Sean)

* Milestone M2.2: Month 12 -- Working simple prototype scheduler and detailed description of the final full featured scheduler

* Task3: Develop scheduler (ICL, XLab, Sean)
* *Milestone M2.3: Month 24 -- Subset of advanced features functioning inside scheduler.
* *Milestone M2.4: Month 36 -- All advanced scheduler features working within the integrated system.

* Task4: Develop portal, including all sub-systems (Leader XLab with all partners; ICL 12PMs): The portal contains a specialization harness, a scheduler, a deployer, repository of specialized binaries, repository of application characteristics relevant to scheduling, statistics on application resource usage, and continuously updated state of the hardware resources. The portal receives instructions directly from human users, and observes resources and interacts with lower system levels, and also observes runing applications, and makes decisions for tasks. As such, the portal contains a scheduler that self-optimises the system in real time. Thus the portal and scheduler must be efficient and avoid complex computations. The portal will accomplish responsiveness by being highly distributed and by exploiting inputs from multiple agents that operate simultantously around different time scales, and around different resources, which share the gathered information with the scheduler. This will be the philosophy driving the development of the portal so that it is made of many distributed/concurrent lightweight components with only lightweight synchronisation while maximising information sharing.

Deliverable D2.3: (27 Months after the start of the project) Technical Report and Open Source Software that implements the PT.
* Task5: Testing and performance tuning (IBM with all partners; ICL 6 PMs)
March 19, 2014, at 04:22 AM by 80.114.135.137 -
Changed lines 28-75 from:
* Task1: Develop portal -- gather specifications, produce design, make prototype, test with end-user, iterate (Leader XLab, major contribution by ICL, with input by DC, Gigas, and Sean; ICL 2 PMs)
* Task2: Define Scheduling component within portal -- interface with other components inside portal, which supply information upon which to base decision and produces a scheduler to be carried out (Leader ICL, major contribution by XLab, with input by DC, Gigas, and Sean; ICL 21PMs): The scheduler will exploit a self-aware dynamic analysis approach which in an on-line manner and in real-time, will receive as input the state of all system resources, and calculate resource availability and expected execution times for tasks currently active within the portal and expected to be launched based on predictions derived from usage patterns. Scheduling decisions will combine detailed status information regarding expected response times, internal network delays, possible security risks, and possible reliability problems. The self-aware dynamic analysis will return a "short list" of the best instantaneous provisioning decisions, ranked according to performance, security, reliability, energy consumption and other relevant metrics. Based on the task to be provisioned, the scheduler will be able to decide on provisioning rapidly. The portal and collection of DSM runtime systems include monitoring of whether the performance objectives of the task are being met, which will re-trigger the scheduler to make a decision again if the observations indicate an unsatisfactory outcome, which will be based on the new system state as well as the overhead related to any changes in provisioning.

Milestone M2.2: (12 Months after the start of the project) A working simple prototype scheduler, plus a detailed description of the final full featured scheduler. The prototype will implement all interfaces and successfully generate a schedule, but using simple algorithms. The targeted final scheduler Will be described in a Technical Report defining the scheduler functionality, structure, and all of its components, and describing its interfaces to and integration with the portal subsystems that monitor hardware and application features and with the DSM runtime system.

* Task3: Develop scheduler (Leader ICL with contributions to integration by XLab and Sean)
Milestone M2.3: (24 Months after the start of the project) A working scheduler that includes some advanced features. Measured performance improvements over the simple prototype scheduler.
Milestone M2.4: (36 Months after the start of the project) All advanced features of scheduler added and demonstrated working within the system, with measured performance improvements delivered to end-user applications, which are due to the dynamic monitoring and re-provisioning.

* Task4: Develop portal, including all sub-systems (Leader XLab with all partners; ICL 12PMs): The portal contains a specialization harness, a scheduler, a deployer, repository of specialized binaries, repository of application characteristics relevant to scheduling, statistics on application resource usage, and continuously updated state of the hardware resources. The portal receives instructions directly from human users, and observes resources and interacts with lower system levels, and also observes runing applications, and makes decisions for tasks. As such, the portal contains a scheduler that self-optimises the system in real time. Thus the portal and scheduler must be efficient and avoid complex computations. The portal will accomplish responsiveness by being highly distributed and by exploiting inputs from multiple agents that operate simultantously around different time scales, and around different resources, which share the gathered information with the scheduler. This will be the philosophy driving the development of the portal so that it is made of many distributed/concurrent lightweight components with only lightweight synchronisation while maximising information sharing.

Deliverable D2.3: (27 Months after the start of the project) Technical Report and Open Source Software that implements the PT.
* Task5: Testing and performance tuning (IBM with all partners; ICL 6 PMs)


Question: "Why would a single Cloud VM be used to perform computation tasks from multiple applications?" Answer: from an overhead perspective, sharing a single "worker" VM among multiple applications is the most efficient way to go. But there must be isolation between applications, so this raises security concerns. Those have to be addressed, which requires extra development effort. So there's a balance between performance (overhead), security, and effort (IE, add security features to CloudDSM runtime is extra effort).

Note that when a DSM command is issued by application code, the application context is suspended. If there are no ready contexts within the application, then the proto-runtime system can switch to work from a different application within nano-seconds. This allows fine-grained interleaving of work with communication. In contrast, if the time to switch among applications is on the order of hundreds of micro-seconds, which it would be if the CPU had to switch to a different Cloud level VM, then the processor is better off simply sitting idle for any non-overlapped communication that is shorter than the switch-application time. In many situations, this will be the case, causing much lost performance and lost hardware utilization, and the customer being charged for idle CPU time. It is this loss that will be prevented by allowing a single VM, with its single DSM runtime instance inside it, to run work for multiple applications, and have the DSM runtime switch among the applications in its very fast way.

Question: "Doesn't the hypervisor automatically handle suspending an idle VM under the covers? Why is an explicit sleep/wake command needed in the CloudAPI?" Answer: When an application is started, the portal may over-provision, starting VMs on many machines, and then putting them to sleep. As soon as the application issues a high computation request, the VMs are woken and given work to do, with expected duration on the order of seconds to at most a minute. When the computation is done, most of the VMs are put back to sleep until the next time. The time to suspend and resume is less than the time to start the VM.

Comment: "A VM can yield a physical processor through a POWER Hypervisor call (instruction?). It puts idle VMs into a hibernation state so that they do not consume any physical CPU resources."

Comment: "For IBM, our binary-reoptimizer tool has a special loader that runs the application as a thread within the same process as the optimizer and kicks off dynamic monitoring and recompilation. So there is one dynamic optimization process for each application running in a VM."

Comment: "In summary, there are two cases: (1) A given application has no work ready, with which to overlap communication (2) The whole DSM runtime system has no work ready. In case 1, the sharing of the hardware is more efficiently performed inside the DSM runtime, where it happens on the order of nanoseconds, which is why it's better for the DSM runtime to handle switching among applications. In case 2, the hypervisor has no way to know that the DSM runtime has no work! It sees the polling the DSM does while waiting for new incoming work as a busy application, so the hypervisor keeps the DSM going. The DSM runtime needs a way to tell the hypervisor to suspend the idle VM. After all, the polling consumes cpu time which the end-user pays money for! At the moment, proto-runtime simply uses pthread yield when it is idle, polling for new incoming work.. but it's not clear that will be enough to get the VM to stop re-scheduling it.. Also, from a higher level, the deployment tool knows periods when there's no work for a specific DSM runtime instance, and can issue a sleep/yield to the VM, which is faster than a full suspend-to-disk..

Comment: "On IBM there are two virtualization layers: Power hypervisor, and Cloud stack hypervisor. The Power hypervisor is not visible to the cloud stack, and is also much faster."


Question: "Why is it advantageous to have Gigas's KVM based Cloud stack, which allows an entire machine to be given exclusively to a given computation task. Isn't this wasteful?" Answer: This allows the DSM runtime to manage the assignment of work onto cores, which happens on the order of nano-seconds, far faster than a hypervisor could manage assignment of work onto cores. This assignment of whole machine to a VM is how EC2 from Amazon works, as well, for example. It doesn't waste, because the machine is fully occupied by the computation.


Question: "How does the cloud VM layer relate to the DSM runtime and the CloudDSM portal?" Answer: There is one instance of the DSM runtime inside each Cloud level VM. The DSM might directly use hypervisor commands to cause the VM it is inside of to fast-yield/sleep, at the point the DSM runtime detects that it has no ready work. The portal, though, decides when the VM should be long-term suspended or shutdown. The Cloud VM is given all the cores of a machine whenever possible, then the DSM within directly manages assigning application work to the cores, which includes suspending execution contexts at the point they perform a DSM call or synchronization. The portal runs inside its own Cloud VMs. It performs Cloud level control of creating VMs, suspending them to disk, starting DSM runtimes inside them, receiving command requests from application front-ends, and starting work-units within chosen VMs. The portal knows about the physical providers, and what hardware is at each, and what Cloud API commands to use to start VMs at them, and what Cloud API commands to use to learn how busy each location is.. hopefully the Cloud stack API has some way of informing the portal about uptime, or load, within a given physical location. The portal has a set of VMs that run code that performs the decision making about how to divide the work of a request among the providers and among the VMs created within a given provider. It may be advantageous for the portal to have Cloud APIs available that expose the characteristics of the hardware within the provider, and allows a measure of control over assignment of VMs to the hardware. The DSMs report status to the portal, which may decide to take work away from poor performing VMs and give it to others, perhaps even VMs in a different physical location. (It is unlikely that a VM itself will be migrated, but rather the work assigned to the DSM runtime inside the VM).

Question: "Is it possible that the cloud would want to move a VM from one core to another?" Answer: control over cores is inside the DSM runtime system. It's too fine grained for the Cloud stack or portal to manage. When a VM is created, all the cores of the physical machine should be given to that VM, and the hypervisor should let that VM own the hardware for as long as possible, ideally until the DSM runtime signals that it has run out of work.

Question: "How do units of work map onto Cloud level entities?" Answer: I see a hierarchical division of work.. at the highest level, work is divided among physical locations.. then at a given location, the work it receives is divided again, among VMs created within that Cloud host. If the Cloud stack at a host allows entire machines to be allocated, in the way Gigas and Amazon EC2 do, then a further division of work is performed, among the CPUs in the machine, which is managed by the DSM runtime. Hence, the portal invokes APIs to manage work chunks starting execution at the level of providers and at the level of Cloud VMs. The DSM system inside a given VM manages work chunks starting on individual cores inside a VM.


Within a given VM, the DSM runtime will create its own "virtual processors", or VPs, which are analogous to threads.. all VPs are inside the same hardware-enforced coherent virtual address space. The DSM will switch among these, in order to overlap communications to remote DSM instances, which are running inside different VMs.

Question: "What components of the CloudDSM system care about best resource allocation from the cloud perspective?" Answer: The calculation of best cloud level resource allocation is encapsulated inside a module that the portal runs. The portal collects information from the application, which the toolchain packages, and the portal collects information about the hardware at each provider, and about the current load on that hardware, and it collects statistics on each of the commands invoked by a given application, and it gives all of this information to the resource-calculation module. That module determines the best way to break up the work represented by the user command, and distribute the pieces across providers and across VMs. Within a VM, the DSM runtime system independently and dynamically decides allocation among the CPUs. Erol Gelenbe at Imperial College will be handling the algorithms for the Cloud level resource calculations.


Comment and Question: "if the DSM detects that it is running late, it can issue a request to the portal to add more machines to the on-going computation. The resource tool re-calculates the division of work." Question: "What information is the portal going to base its decisions on?" Answer: the DSM runtime running inside a given VM communicates status to the portal. Annotations might be inserted into the low-level source form that help the DSM runtime with this task, or the DSM runtime may end up handling this all by itself.
to:
[[CloudDSM.WP2 | link to details of WP2]]
March 09, 2014, at 09:23 PM by 80.114.135.137 -
Changed line 6 from:
* WP 2: Cloud Deployment tool -- XLAB
to:
* WP 2: Web Portal -- XLAB
Changed lines 31-32 from:
Milestone M2.2: (12 Months after the start of the project) A detailed description of the scheduler, as described in a Technical Report defining the scheduler functionality, structure, and all of its components, and describing its interfaces to and integration with the portal subsystems that monitor hardware and application features and with the DSM runtime system.
to:
Milestone M2.2: (12 Months after the start of the project) A working simple prototype scheduler, plus a detailed description of the final full featured scheduler. The prototype will implement all interfaces and successfully generate a schedule, but using simple algorithms. The targeted final scheduler Will be described in a Technical Report defining the scheduler functionality, structure, and all of its components, and describing its interfaces to and integration with the portal subsystems that monitor hardware and application features and with the DSM runtime system.
Changed lines 34-38 from:
* Task4: Develop portal, including all sub-systems -- manipulates each Cloud API (Leader XLab with all partners; ICL 12PMs): The PT itself is middleware that receives instructions from higher system levels and directly from human users, and that observes resources and lower system levels, and also observes runing applications, and make decisions for other tasks. As such, the PT is self-optimising in real time. Thus the PT must be efficient and avoiding complex computations. However the PT will make up for this by being highly distributed and by exploiting inputs from multiple agents that operate simultantously around different time scales, around different resources, and sharing the gathered information with the PT. This will be the philosophy driving the development of the PT so that it is made of many distributed/concurrent lightweight components with hardly any synchronisation but maximising information sharing.
to:
Milestone M2.3: (24 Months after the start of the project) A working scheduler that includes some advanced features. Measured performance improvements over the simple prototype scheduler.
Milestone M2.4: (36 Months after the start of the project) All advanced features of scheduler added
and demonstrated working within the system, with measured performance improvements delivered to end-user applications, which are due to the dynamic monitoring and re-provisioning.

* Task4: Develop portal, including all sub-systems (Leader XLab with all partners; ICL 12PMs): The portal contains a specialization harness, a scheduler, a deployer, repository of specialized binaries, repository of application characteristics relevant to scheduling
, statistics on application resource usage, and continuously updated state of the hardware resources. The portal receives instructions directly from human users, and observes resources and interacts with lower system levels, and also observes runing applications, and makes decisions for tasks. As such, the portal contains a scheduler that self-optimises the system in real time. Thus the portal and scheduler must be efficient and avoid complex computations. The portal will accomplish responsiveness by being highly distributed and by exploiting inputs from multiple agents that operate simultantously around different time scales, and around different resources, which share the gathered information with the scheduler. This will be the philosophy driving the development of the portal so that it is made of many distributed/concurrent lightweight components with only lightweight synchronisation while maximising information sharing.
March 09, 2014, at 09:05 PM by 80.114.135.137 -
Changed lines 31-33 from:
Milestone M2.2: (12 Months after the start of the project) A detailed description of the PT.
Deliverable D2.2: (21 Months after the start of
the project) Technical Report defining the PT functionality, structure, and all of its components, and describing its integration as DSM subsystem
* Task3: Develop provisioning tool -- manipulates each Cloud API (Leader ICL
with all partners; ICL 12PMs): The PT itself is middleware that receives instructions from higher system levels and directly from human users, and that observes resources and lower system levels, and also observes runing applications, and make decisions for other tasks. As such, the PT is self-optimising in real time. Thus the PT must be efficient and avoiding complex computations. However the PT will make up for this by being highly distributed and by exploiting inputs from multiple agents that operate simultantously around different time scales, around different resources, and sharing the gathered information with the PT. This will be the philosophy driving the development of the PT so that it is made of many distributed/concurrent lightweight components with hardly any synchronisation but maximising information sharing.
to:
Milestone M2.2: (12 Months after the start of the project) A detailed description of the scheduler, as described in a Technical Report defining the scheduler functionality, structure, and all of its components, and describing its interfaces to and integration with the portal subsystems that monitor hardware and application features and with the DSM runtime system.
* Task3: Develop scheduler (Leader ICL with contributions to integration by XLab and Sean)
* Task4: Develop portal, including all sub-systems -- manipulates each Cloud API (Leader XLab
with all partners; ICL 12PMs): The PT itself is middleware that receives instructions from higher system levels and directly from human users, and that observes resources and lower system levels, and also observes runing applications, and make decisions for other tasks. As such, the PT is self-optimising in real time. Thus the PT must be efficient and avoiding complex computations. However the PT will make up for this by being highly distributed and by exploiting inputs from multiple agents that operate simultantously around different time scales, around different resources, and sharing the gathered information with the PT. This will be the philosophy driving the development of the PT so that it is made of many distributed/concurrent lightweight components with hardly any synchronisation but maximising information sharing.
Changed line 35 from:
* Task4: Testing and performance tuning (IBM with all partners; ICL 6 PMs)
to:
* Task5: Testing and performance tuning (IBM with all partners; ICL 6 PMs)
March 09, 2014, at 08:54 PM by 80.114.135.137 -
Changed lines 27-31 from:
!!! WP 2: Cloud Deployment tool -- XLAB
* Task1: Develop portal -- gather specifications, produce design, make prototype, test with end-user, iterate
* Task2: Define provisioning tool -- interface with portal, with DSM runtime system, with end-user client, and with Cloud API for each kind of machine
* Task3: Develop provisioning tool -- manipulates each Cloud API
* Task4: Testing and performance tuning
to:
!!! WP 2: CloudDSM Portal -- XLAB
* Task1: Develop portal -- gather specifications, produce design, make prototype, test with end-user, iterate (Leader XLab, major contribution by ICL, with input by DC, Gigas, and Sean; ICL 2 PMs)
* Task2: Define Scheduling component within portal -- interface with other components inside portal, which supply information upon which to base decision and produces a scheduler to be carried out (Leader ICL, major contribution by XLab, with input by DC, Gigas, and Sean; ICL 21PMs): The scheduler will exploit a self-aware dynamic analysis approach which in an on-line manner and in real-time, will receive as input the state of all system resources, and calculate resource availability and expected execution times for tasks currently active within the portal and expected to be launched based on predictions derived from usage patterns. Scheduling decisions will combine detailed status information regarding expected response times, internal network delays, possible security risks, and possible reliability problems. The self-aware dynamic analysis will return a "short list" of the best instantaneous provisioning decisions, ranked according to performance, security, reliability, energy consumption and other relevant metrics. Based on the task to be provisioned, the scheduler will be able to decide on provisioning rapidly. The portal and collection of DSM runtime systems include monitoring of whether the performance objectives of the task are being met, which will re-trigger the scheduler to make a decision again if the observations indicate an unsatisfactory outcome, which will be based on the new system state as well as the overhead related to any changes in provisioning.

Milestone M2.2: (12 Months after the start of the project) A detailed description of the PT.
Deliverable D2.2: (21 Months after the start of the project) Technical Report defining the PT functionality, structure, and all of its components, and describing its integration as DSM subsystem
* Task3: Develop provisioning tool -- manipulates each Cloud API (Leader ICL with all partners; ICL 12PMs): The PT itself is middleware that receives instructions from higher system levels and directly from human users, and that observes resources and lower system levels, and also observes runing applications, and make decisions for other tasks. As such, the PT is self-optimising in real time. Thus the PT must be efficient and avoiding complex computations. However the PT will make up for this by being highly distributed and by exploiting inputs from multiple agents that operate simultantously around different time scales, around different resources, and sharing the gathered information with the PT. This will be the philosophy driving the development of the PT so that it is made of many distributed/concurrent lightweight components with hardly any synchronisation but maximising information sharing.
Deliverable D2.3: (27 Months after the start of the project) Technical Report and Open Source Software that implements the PT.
* Task4: Testing and performance tuning (IBM with all partners; ICL 6 PMs)
February 27, 2014, at 08:57 AM by Mariano - WP1 tasks; XLAB
Changed lines 5-6 from:
* WP 1: Coordination and Management -- XLab
* WP 2: Cloud Deployment tool -- XLab
to:
* WP 1: Coordination and Management -- XLAB
* WP 2: Cloud Deployment tool -- XLAB
Changed lines 15-24 from:
-] which wp to place development tools, such as Eclipse plugin and code style checker?

-] which wp to place the research regarding best division of work and best deployment of it onto available hardware?



!!! WP 1: Coordination and Management -- XLab
-]
Task1: to be filled in

!!! WP 2
: Cloud Deployment tool -- XLab
to:
-] which WP to place development tools, such as Eclipse plug-in and code style checker?

-] which WP to place the research regarding best division of work and best deployment of it onto available hardware?



!!! WP 1: Coordination and Management -- XLAB
*
Task1: Project management and logistics - XLAB
* Task2
: Administrative management and resource monitoring - XLAB
* Task3: Intellectual property management - ?


!!! WP 2: Cloud Deployment tool -- XLAB
Changed lines 71-80 from:
-- Sean leads task, partners for 2 through 6, plus Imperial, plus INRIA, plus XLab participate on aspects that involve their individual piece. INRIA will advise on how the architecture choices impact the toolchain design and implementation. Imperial will advise on how the choices impact the algorithms for runtime work division and deployment. XLab will advise on how the architecture impacts the deployment tool. Partners for 2 through 6 will provide input on how the architecture impacts the individual runtime system they are implementing.

* Task2: DSM runtime system on Kalray HW that presents an API that is manipulatable by WP2 Cloud deployment tool -- Kalray (with involvement by Sean and XLab)

* Task3: DSM runtime system on FORTH Cube HW that presents an API that is manipulatable by WP2 Cloud deployment tool -- FORTH (with involvement by Sean and XLab)

* Task4: DSM runtime system on a shared memory x86 based Cloud server instance -- Sean with involvement by XLab

* Task5: DSM runtime system on a shared memory Power based Cloud server instance -- IBM plus Sean with involvement by XLab
to:
-- Sean leads task, partners for 2 through 6, plus Imperial, plus INRIA, plus XLAB participate on aspects that involve their individual piece. INRIA will advise on how the architecture choices impact the toolchain design and implementation. Imperial will advise on how the choices impact the algorithms for runtime work division and deployment. XLAB will advise on how the architecture impacts the deployment tool. Partners for 2 through 6 will provide input on how the architecture impacts the individual runtime system they are implementing.

* Task2: DSM runtime system on Kalray HW that presents an API that is manipulatable by WP2 Cloud deployment tool -- Kalray (with involvement by Sean and XLAB)

* Task3: DSM runtime system on FORTH Cube HW that presents an API that is manipulatable by WP2 Cloud deployment tool -- FORTH (with involvement by Sean and XLAB)

* Task4: DSM runtime system on a shared memory x86 based Cloud server instance -- Sean with involvement by XLAB

* Task5: DSM runtime system on a shared memory Power based Cloud server instance -- IBM plus Sean with involvement by XLAB
Changed lines 83-84 from:
* Task7: Integration testing. Deploy the various individual runtime systems on specific machines. Write test cases whose purpose is to expose performance issues and bugs. Execute the test cases, record the performance, report to the partners who developed the individual runtimes in tasks 2 through 6. Work with the partners to determine the reasons for particular performance anomalies, perhaps writing and running specific tests interactively. The partners for 2 through 6 will use the results to modify their individual runtime system implementations. -- Gigas leads -- partners for 2 through 6, plus XLab are involved.
to:
* Task7: Integration testing. Deploy the various individual runtime systems on specific machines. Write test cases whose purpose is to expose performance issues and bugs. Execute the test cases, record the performance, report to the partners who developed the individual runtimes in tasks 2 through 6. Work with the partners to determine the reasons for particular performance anomalies, perhaps writing and running specific tests interactively. The partners for 2 through 6 will use the results to modify their individual runtime system implementations. -- Gigas leads -- partners for 2 through 6, plus XLAB are involved.
Changed line 102 from:
-- Sean and INRIA and Douglas Connect will lead, with input from XLab, Imperial, and partners for WP3 tasks 2 through 6.
to:
-- Sean and INRIA and Douglas Connect will lead, with input from XLAB, Imperial, and partners for WP3 tasks 2 through 6.
February 26, 2014, at 07:05 PM by 192.16.201.181 -
Added line 66:
Deleted lines 93-94:
Question: "Why use OpenMP?" A: Many existing parallel libraries are written in OpenMP. One early project goal was to hit the ground running with code that is relevant to industry and readily usable by industry. Using OpenMP libraries during development of the low-level interface allows compiler work to begin almost immediately on relevant, ready-to-run code.
Added lines 101-102:

Question: "Why use OpenMP?" A: Many existing parallel libraries are written in OpenMP. One early project goal was to hit the ground running with code that is relevant to industry and readily usable by industry. Using OpenMP libraries during development of the low-level interface allows compiler work to begin almost immediately on relevant, ready-to-run code.
February 26, 2014, at 07:03 PM by 192.16.201.181 -
Changed line 61 from:
Comment and Question: "if the DSM detects that it is running late, it can issue a request to the portal to add more machines to the on-going computation. The resource tool re-calculates the division of work." Question: "What information is the portal going to base it's decisions on?" Answer: the DSM runtime running inside a given VM communicates status to the portal. Annotations might be inserted into the low-level source form that help the DSM runtime with this task, or the DSM runtime may end up handling this all by itself.
to:
Comment and Question: "if the DSM detects that it is running late, it can issue a request to the portal to add more machines to the on-going computation. The resource tool re-calculates the division of work." Question: "What information is the portal going to base its decisions on?" Answer: the DSM runtime running inside a given VM communicates status to the portal. Annotations might be inserted into the low-level source form that help the DSM runtime with this task, or the DSM runtime may end up handling this all by itself.
February 26, 2014, at 07:01 PM by 192.16.201.181 -
Changed lines 37-39 from:
Comment: "A virtual processor can yield a physical processor through a POWER Hypervisor call. It puts idle virtual processors into a hibernation state so that they do not consume any resources."

Comment: "For IBM, our special loader runs the application as a thread within the same process as the optimizer and kicks off dynamic monitoring and recompilation. So one dynamic optimization process per application in a VM."
to:
Comment: "A VM can yield a physical processor through a POWER Hypervisor call (instruction?). It puts idle VMs into a hibernation state so that they do not consume any physical CPU resources."

Comment: "For IBM, our binary-reoptimizer tool has a special loader that runs the application as a thread within the same process as the optimizer and kicks off dynamic monitoring and recompilation. So there is one dynamic optimization process for each application running in a VM."
Changed lines 43-47 from:
Comment: On IBM there are two virtualization layers: Power hypervisor, and Cloud stack hypervisor. The Power hypervisor is not visible to the cloud stack, and is also much faster.


Question: "Why is it advantageous to have Gigas's KVM based Cloud stack, which allows an entire machine to be given exclusively to a given computation task." Answer: This allows the DSM runtime to manage the assignment of work onto cores, which happens on the order of nano-seconds, far faster than a hypervisor could manage assignment of work onto cores. This is how EC2 from Amazon works, as well, for example. It doesn't waste, because the machine is fully occupied by the computation..
to:
Comment: "On IBM there are two virtualization layers: Power hypervisor, and Cloud stack hypervisor. The Power hypervisor is not visible to the cloud stack, and is also much faster."


Question: "Why is it advantageous to have Gigas's KVM based Cloud stack, which allows an entire machine to be given exclusively to a given computation task. Isn't this wasteful?" Answer: This allows the DSM runtime to manage the assignment of work onto cores, which happens on the order of nano-seconds, far faster than a hypervisor could manage assignment of work onto cores. This assignment of whole machine to a VM is how EC2 from Amazon works, as well, for example. It doesn't waste, because the machine is fully occupied by the computation.
February 26, 2014, at 06:55 PM by 192.16.201.181 -
Changed line 33 from:
Note that when a DSM command is issued by application code, the application context is suspended. If there are no ready contexts within the application, then the proto-runtime system can switch to work from a different application within nano-seconds. This allows fine-grained interleaving of work with communication. In contrast, if the time to switch among applications is on the order of hundreds of micro-seconds, which it would be if the CPU had to switch to a different Cloud level VM, then the processor is better off simply sitting idle for any non-overlapped communication that is shorter than the switch-application time. In many situations, this will be the case, causing much lost performance and lost hardware utilization, and the customer being charged for idle CPU time.
to:
Note that when a DSM command is issued by application code, the application context is suspended. If there are no ready contexts within the application, then the proto-runtime system can switch to work from a different application within nano-seconds. This allows fine-grained interleaving of work with communication. In contrast, if the time to switch among applications is on the order of hundreds of micro-seconds, which it would be if the CPU had to switch to a different Cloud level VM, then the processor is better off simply sitting idle for any non-overlapped communication that is shorter than the switch-application time. In many situations, this will be the case, causing much lost performance and lost hardware utilization, and the customer being charged for idle CPU time. It is this loss that will be prevented by allowing a single VM, with its single DSM runtime instance inside it, to run work for multiple applications, and have the DSM runtime switch among the applications in its very fast way.
February 26, 2014, at 06:51 PM by 192.16.201.181 -
Changed line 33 from:
Note that when a DSM command is issued, the application context is suspended. If there are no ready contexts within the application, then the proto-runtime system can switch to work from a different application within nano-seconds. This allows fine-grained interleaving of work with communication. In contrast, if the time to switch among applications is on the order of hundreds of micro-seconds, which it would be if the CPU had to switch to a different Cloud level VM, then the processor is better off simply sitting idle for any non-overlapped communication that is shorter than the switch-application time. In many situations, this will be the case, causing much lost performance and lost hardware utilization, and the customer being charged for idle CPU time.
to:
Note that when a DSM command is issued by application code, the application context is suspended. If there are no ready contexts within the application, then the proto-runtime system can switch to work from a different application within nano-seconds. This allows fine-grained interleaving of work with communication. In contrast, if the time to switch among applications is on the order of hundreds of micro-seconds, which it would be if the CPU had to switch to a different Cloud level VM, then the processor is better off simply sitting idle for any non-overlapped communication that is shorter than the switch-application time. In many situations, this will be the case, causing much lost performance and lost hardware utilization, and the customer being charged for idle CPU time.
February 26, 2014, at 06:50 PM by 192.16.201.181 - second Dorit question, in WP 5
Changed lines 50-59 from:
6) What do we assume in terms of the relationship between the cloud admin and the DSM system?
Do we view
the cloud as just a pool of resources that once assigned to us we have exclusive control of?
The portal knows about the physical providers, and what hardware is at each, and what Cloud API commands to use to start VMs at them
, and what Cloud API commands to use to learn how busy each location is. The DSM runtime can only report how much work it has inside it, but cannot know how busy the machines are.. hopefully the CloudStack API has some way of informing the portal about uptime, or load, within a given physical location. Then the portal can choose among the available providers, to decide where best to run the work.

Or might the cloud management layer (either automatically or via the cloud admin) move
VMs around or reallocate resources (for workload consolidation, or other considerations)?

I would like for the portal/deployment-tool to have the ability
to re-allocate work. I don't know whether CloudStack allows controlling which physical machine a VM is assigned to, via its API.. but the work can always be taken away from a poorly performing VM and given to a better performing one, or one started at a different physical location.

Is it possible
that the cloud would want to move a VM from one core to another?
That is more inside
the runtime system. It's too fine grained for the portal to manage. This is the reason that the KVM approach of giving the DSM the entire machine is good.. then the DSM runtime controls what work runs on what core.
to:
Question: "How does the cloud VM layer relate to the DSM runtime and the CloudDSM portal?" Answer: There is one instance of the DSM runtime inside each Cloud level VM. The DSM might directly use hypervisor commands to cause the VM it is inside of to fast-yield/sleep, at the point the DSM runtime detects that it has no ready work. The portal, though, decides when the VM should be long-term suspended or shutdown. The Cloud VM is given all the cores of a machine whenever possible, then the DSM within directly manages assigning application work to the cores, which includes suspending execution contexts at the point they perform a DSM call or synchronization. The portal runs inside its own Cloud VMs. It performs Cloud level control of creating VMs, suspending them to disk, starting DSM runtimes inside them, receiving command requests from application front-ends, and starting work-units within chosen VMs. The portal knows about the physical providers, and what hardware is at each, and what Cloud API commands to use to start VMs at them, and what Cloud API commands to use to learn how busy each location is.. hopefully the Cloud stack API has some way of informing the portal about uptime, or load, within a given physical location. The portal has a set of VMs that run code that performs the decision making about how to divide the work of a request among the providers and among the VMs created within a given provider. It may be advantageous for the portal to have Cloud APIs available that expose the characteristics of the hardware within the provider, and allows a measure of control over assignment of VMs to the hardware. The DSMs report status to the portal, which may decide to take work away from poor performing VMs and give it to others, perhaps even VMs in a different physical location. (It is unlikely that a VM itself will be migrated, but rather the work assigned to the DSM runtime inside the VM).

Question: "Is it possible that the cloud would want to move a VM from one core to another?" Answer: control over cores is inside the DSM runtime system. It's too fine grained for the Cloud stack or portal to manage. When a VM is created, all the cores of the physical machine should be given to that VM, and the hypervisor should let that VM own the hardware for as long as possible, ideally until the DSM runtime signals that it has run out of work
.
Changed lines 54-58 from:
Might there be a situation of "ping-pong" between the decisions the DSM makes and the decisions the cloud makes, resulting in a VM being constantly migrated back and forth?
Or are the VMs created and orchestrated by
the DSM system not visible to the cloud?
I see a hierarchical division of work
.. at the highest level, work is divided among physical locations.. then at a given location, the work it receives is divided among VMs created within that Cloud host. If the host allows entire machines to be allocated, in the way Gigas and Amazon do, then a further division of work can be performed, which the DSM system manages. Hence, only one level of work division takes place at the VM level.

For your tool, the optimization I see is in adjusting the size of work
.. it's not clear whether the Dorit tool will make decisions about how to divide the work. Those decisions will likely happen inside the DSM runtime.. then it will be up to the Dorit tool to modify the code such that it actually performs work in the DSM-chosen size of chunk. We still need to work through the details of how the DSM runtime will interact with the Dorit tool :-)
to:
Question: "How do units of work map onto Cloud level entities?" Answer: I see a hierarchical division of work.. at the highest level, work is divided among physical locations.. then at a given location, the work it receives is divided again, among VMs created within that Cloud host. If the Cloud stack at a host allows entire machines to be allocated, in the way Gigas and Amazon EC2 do, then a further division of work is performed, among the CPUs in the machine, which is managed by the DSM runtime. Hence, the portal invokes APIs to manage work chunks starting execution at the level of providers and at the level of Cloud VMs. The DSM system inside a given VM manages work chunks starting on individual cores inside a VM.
Changed lines 57-73 from:

-- If the cloud is aware of our VMs and can play with their allocation/resources
, we may want to introduce some APIs between the DSM system and the cloud -- for better collaboration and/or to accept notifications from the cloud upon changes it makes.
Hmmm.. I'm not clear on the "our" in "our
VMs"..? There is a portal that is responsible for creating VMs, via the Cloud API. Each VM is within one physical hosting location. The Cloud stacks work such that each location has a particular URL used for creating VMs within that location AFAIU. The portal will only use these Cloud stack supplied mechanisms to create VMs and manage their location/allocation.

Within a given VM, the DSM runtime will create its own "virtual processors", or VPs, which are equivalent to threads.. all VPs are inside
the same hardware-enforced coherent virtual address space. The DSM will switch among these, in order to overlap communications to remote DSM instances, which are running inside different VMs.

There should only be one layer of VMs, which are created by the portal..

Perhaps we should talk more about how you see the binary optimizer interacting with these other parts?

-- It sounds like our focus is only on the application side
. We don't care about best resource allocation from the cloud perspective. Such dual optimization target problem is very difficult indeed and may be out of scope. I just wanted to make sure this is an informed decision (to focus only on the application side) and that the call doesn't expect the other aspect (general cloud utilization) to be considered as well.

The resource allocation will be done hierarchically. At the top level, the portal will cooperate with a DSM instance to choose the best division among locations, and then within a location to decide the best division among VMs. Erol Gelenbe will be handling the algorithms for this part of things.

Within a given VM, another level of work-division and resource allocation will be done, in isolation from the rest. This is the core-by-core level of allocation. The DSM runtime will handle this level of work-division, and the allocation of work-units to cores.

The Dorit tool may interact with the DSM system.. it's not clear yet just how they will interact..
to:
Within a given VM, the DSM runtime will create its own "virtual processors", or VPs, which are analogous to threads.. all VPs are inside the same hardware-enforced coherent virtual address space. The DSM will switch among these, in order to overlap communications to remote DSM instances, which are running inside different VMs.

Question: "What components of the CloudDSM system care about best resource allocation from
the cloud perspective?" Answer: The calculation of best cloud level resource allocation is encapsulated inside a module that the portal runs. The portal collects information from the application, which the toolchain packages, and the portal collects information about the hardware at each provider, and about the current load on that hardware, and it collects statistics on each of the commands invoked by a given application, and it gives all of this information to the resource-calculation module. That module determines the best way to break up the work represented by the user command, and distribute the pieces across providers and across VMs. Within a VM, the DSM runtime system independently and dynamically decides allocation among the CPUs. Erol Gelenbe at Imperial College will be handling the algorithms for the Cloud level resource calculations.
Changed lines 62-73 from:
7) About: "if the DSM detects that it is running late, it can issue a request to the deployment tool to
add more machines to the on-going computation. The DSM system re-calculates the division of work."

- What
information is the DSM system going to base it's decisions on? Maybe it would be useful if we provide feedback information (from monitoring the application from within the VM) to help the DSM system in this process ?

That may, indeed, be helpful.. for x86, I was planning to use either Linux perftool performance counters or simply the x86 time-stamp instruction that returns CPU cycles
. The proto-runtime system is architected for such measurements..

I've been viewing the Dorit tool as something that interrupts work in progress, changes the executable, then resumes work in progress.. I haven't been seeing it as a runtime system that controls what work is assigned to what core.. was that how you've been thinking?

Wow.. awesome discussion :-) Thank you :-) This is how progress happens!
to:
Comment and Question: "if the DSM detects that it is running late, it can issue a request to the portal to add more machines to the on-going computation. The resource tool re-calculates the division of work." Question: "What information is the portal going to base it's decisions on?" Answer: the DSM runtime running inside a given VM communicates status to the portal. Annotations might be inserted into the low-level source form that help the DSM runtime with this task, or the DSM runtime may end up handling this all by itself.
Changed lines 109-111 from:
to:
Question: "How does the IBM fat-binary specializer interact with the DSM runtime system?" Answer: AFAIU, the re-optimizer interrupts work in progress, changes the executable, then resumes work in progress. But it doesn't controls what work is assigned to what core. The optimizations it performs are single-thread optimizations, and also the re-optimizer may be told by the DSM runtime or by the portal to adjust the code such that the size of chunk of work performed between DSM calls is adjusted, or the layout or access pattern of data is adjusted. It is not clear yet whether the re-optimizer tool will make decisions on its own about the best chunk size. It might communicate performance feedback to the DSM runtime, and optimal chunk size decisions are made there. Or those decisions may be passed along to the portal. Wherever they are ultimately made, it will be up to the Dorit tool to modify the code such that it actually performs work in the chosen size of chunk. It still remains to work through the details of how the DSM runtime will interact with the Dorit tool.
Changed line 138 from:
Question: "Are the 2 Power servers we have in our lab sufficient for the CloudDSM project?" Answer: probably, anyone have thoughts?
to:
Question: "Are the 2 Power servers we have in our lab sufficient for the CloudDSM project?" Answer: probably, anyone have thoughts?
February 26, 2014, at 04:44 PM by 192.16.201.181 - second Dorit question, in WP 5
Changed lines 20-27 from:
On Sun, Feb 23, 2014 at 6:32 AM, Dorit Nuzman <DORIT@il.ibm.com> wrote:



3) About: " 7.
2.1. Sharing a VM among multiple applications." :
I'm wondering why this item is under IBM only? isn't it a generic issue?

Yes. It is a generic issue.. think it was under IBM because OpenStack has an
API extension that allows suspend/resume of VMs..
to:

!!! WP 1: Coordination and Management -- XLab
-] Task1: to be filled in

!!! WP
2: Cloud Deployment tool -- XLab
* Task1
: Develop portal -- gather specifications, produce design, make prototype, test with end-user, iterate
* Task2: Define provisioning tool -- interface with portal, with DSM runtime system, with end-user client, and with Cloud
API for each kind of machine
* Task3: Develop provisioning tool -- manipulates each Cloud API
* Task4: Testing and performance tuning


Question: "Why would a single Cloud VM be used to perform computation tasks from multiple applications?" Answer: from an overhead perspective, sharing a single "worker" VM among multiple applications is the most efficient way to go
. But there must be isolation between applications, so this raises security concerns. Those have to be addressed, which requires extra development effort. So there's a balance between performance (overhead), security, and effort (IE, add security features to CloudDSM runtime is extra effort).

Note that when a DSM command is issued, the application context is suspended. If there are no ready contexts within the application, then the proto-runtime system can switch to work from a different application within nano-seconds. This allows fine-grained interleaving of work with communication. In contrast, if the time to switch among applications is on the order of hundreds of micro-seconds, which it would be if the CPU had to switch to a different Cloud level VM, then the processor is better off simply sitting idle for any non-overlapped communication that is shorter than the switch-application time. In many situations, this will be the case, causing much lost performance and lost hardware utilization, and the customer being charged for idle CPU time.

Question: "Doesn't the hypervisor automatically handle suspending an idle VM under the covers? Why is an explicit sleep/wake command needed in the CloudAPI?" Answer: When an application is started, the portal may over-provision, starting VMs on many machines, and then putting them to sleep. As soon as the application issues a high computation request, the VMs are woken and given work to do, with expected duration on the order of seconds to at most a minute. When the computation is done, most of the VMs are put back to sleep until the next time. The time to suspend and resume is less than the time to start the VM.

Comment: "A virtual processor can yield a physical processor through a POWER Hypervisor call. It puts idle virtual processors into a hibernation state so that they do not consume any resources."

Comment: "For IBM, our special loader runs the application as a thread within the same process as the optimizer and kicks off dynamic monitoring and recompilation. So one dynamic optimization process per application in a VM."
Changed lines 41-50 from:
4) Still on the same topic:

" 7.2.
1 Details still must be worked out around whether a
single VM will be used to perform computation tasks from multiple applications or not
."

-- there's no problem to run different applications on a single VM (I think I don't understand the question... there's a full OS running on the VM which of course can run multiple applications. Each application will be loaded with our special loader that runs the application as a thread within
the same process as the optimizer and kicks of dynamic monitoring and recompilation. So one dynamic optimization process per application in a VM).

Sure..
the issue is security.. different end-users want their applications to be secure from each other.. a VM completely isolates from all other VMs, which is what makes Cloud acceptable to corporates..

However, from an overhead perspective, sharing a single "worker" VM among multiple applications is the most efficient way to go.. so it's a balance between performance (overhead), security, and effort (IE, add security features
to CloudDSM runtime is extra effort)
to:
Comment: "In summary, there are two cases: (1) A given application has no work ready, with which to overlap communication (2) The whole DSM runtime system has no work ready. In case 1, the sharing of the hardware is more efficiently performed inside the DSM runtime, where it happens on the order of nanoseconds, which is why it's better for the DSM runtime to handle switching among applications. In case 2, the hypervisor has no way to know that the DSM runtime has no work! It sees the polling the DSM does while waiting for new incoming work as a busy application, so the hypervisor keeps the DSM going. The DSM runtime needs a way to tell the hypervisor to suspend the idle VM. After all, the polling consumes cpu time which the end-user pays money for! At the moment, proto-runtime simply uses pthread yield when it is idle, polling for new incoming work.. but it's not clear that will be enough to get the VM to stop re-scheduling it.. Also, from a higher level, the deployment tool knows periods when there's no work for a specific DSM runtime instance, and can issue a sleep/yield to the VM, which is faster than a full suspend-to-disk..

Comment: On IBM there are two virtualization layers: Power hypervisor, and Cloud stack hypervisor. The Power hypervisor is not visible to the cloud stack, and is also much faster.


Question: "Why is it advantageous to have Gigas's KVM based Cloud stack, which allows an entire machine to be given exclusively to a given computation task." Answer: This allows the DSM runtime to manage the assignment of work onto cores, which happens on the order of nano-seconds, far faster than a hypervisor could manage assignment of work onto cores. This is how EC2 from Amazon works, as well, for example. It doesn't waste, because the machine is fully occupied by the computation..



6
) What do we assume in terms of the relationship between the cloud admin and the DSM system?
Do we view the cloud as just a pool of resources that once assigned to us we have exclusive control of?
The portal knows about the physical providers, and what hardware is at each, and what Cloud API commands to use to start VMs at them, and what Cloud API commands to use to learn how busy each location is. The DSM runtime can only report how much work it has inside it, but cannot know how busy the machines are.. hopefully the CloudStack API has some way of informing the portal about uptime, or load, within a given physical location. Then the portal can choose among the available providers, to decide where best to run the work.

Or might the cloud management layer (either automatically or via the cloud admin) move VMs around or reallocate resources (for workload consolidation, or other considerations)?

I would like for the portal/deployment-tool to have the ability to re-allocate work. I don't know whether CloudStack allows controlling which physical machine a VM is assigned to, via its API.. but the work can always be taken away from a poorly performing VM and given to a better performing one, or one started at a different physical location.

Is it possible that the cloud would want to move a VM from one core to another?
That is more inside the runtime system. It's too fine grained for the portal to manage. This is the reason that the KVM approach of giving the DSM the entire machine is good.. then the DSM runtime controls what work runs on what core.
Changed lines 61-86 from:

" If not, then hopefully the IBM cloud stack includes a high performance way of suspending/putting to sleep a VM
and then quickly waking it (on the order of microseconds to suspend/resume).
When an application is started, the portal may over-provision, starting many VMs and DSMruntimes
inside them, and then put most of them to sleep.. only when a command comes in from the user-client
that the portal knows will require high computation does it wake up the VMs. A VM remains awake
for the duration of the request, expected to be on the order of seconds to at most a few minutes.
When the computation is done, the VMS is put back to sleep, until the application issues the next
high computation request/command."

-- AFAIU, the Power virtualization mechanism sort of takes care of this automatically under the covers. If there is competition for additional processing capacity among several VMs, the Hypervisor distributes unused processor capacity to the eligible VMs;
Also, to optimize physical processor utilization, a virtual processor (fraction of a physical processor assigned to a VM) will yield a physical processor if it has no work to run. A virtual processor can yield a physical processor through a POWER Hypervisor call. I think this is called Virtual Processor Folding: the ability to put idle virtual processors into a hibernation state so that they do not consume any resources. It is configurable via a set of commands. Every second, the kernel scheduler evaluates the number of virtual processors that should be activated to accommodate the physical utilization of the VM.If the number yields a high virtual processor utilization, the base number of virtual processors required is incremented to enable the workload to expand. If the number is less than the number of virtual processors that are currently activated, a virtual processor is deactivated.


Right.. the issue is the DSM runtime system.. it knows when it has no work, and knows when it can sleep.. but if there is no command to cause the VM to sleep, then the runtime just busy-waits, polling for work that never arrives.. The VM sees that busy-wait polling as application activity.. it doesn't know the difference.. so it consumes VM time for the busy waiting, which is charged to the end-user. And the end-user pays money for it. Cloud is pay-per-cpu-hour!

Perhaps there are other, better, mechanisms by which the DSM runtime can get the VM to turn off.. right now, it simply uses pthread yield.. but it's not clear that will be enough to get the VM to stop re-scheduling it..

From a higher level, the deployment tool knows exactly when there's no work for a specific DSM runtime instance, and can issue a suspend to the VM..


In addition there is a capability to Suspend a VM, which indicates it's in standby/hibernated state, and all of its resources can be used by other partitions. I don't know how long it takes to suspend/resume a VM. And there are also several pre-requisites for this capability that I think we don't have in our lab and we'll have to figure out how to make them available for the project. I'll try to find out... (I also want to understand when the automatic mechanisms I mentioned above are not sufficient, and we really need to actually suspend a VM...).

BTW, What do you think should be the size of our Power cloud for the project? Say we have a Power7 server with 8 cores 4-way SMT each. Are 2 such servers sufficient? Or is it preferably to obtain many more/much larger servers?

Two servers should be enough to demonstrate the value.. maybe ask during a telco, to get opinions of others..
to:
Might there be a situation of "ping-pong" between the decisions the DSM makes and the decisions the cloud makes, resulting in a VM being constantly migrated back and forth?
Or are the VMs created and orchestrated by the DSM system not visible to the cloud?
I see a hierarchical division of work.. at the highest level, work is divided among physical locations.. then at a given location, the work it receives is divided among VMs created within that Cloud host. If the host allows entire machines to be allocated, in the way Gigas and Amazon do, then a further division of work can be performed, which the DSM system manages. Hence, only one level of work division takes place at the VM level.

For your tool, the optimization I see is in adjusting the size of work.. it's not clear whether the Dorit tool will make decisions about how to divide the work. Those decisions will likely happen inside the DSM runtime.. then it will be up to the Dorit tool to modify the code such that it actually performs work in the DSM-chosen size of chunk. We still need to work through the details of how the DSM runtime will interact with the Dorit tool :-)
Changed lines 68-70 from:
One thing to mention is that any VMs that we create/suspend/resume directly via the Power virtualization layer, and not via the cloud management layer, will not be visible to the cloud. If we did want to create/suspend VMs via the cloud management layer, that would take much longer.

Hmmm.. I must have missed something
.. so, there are two separate virtualization layers? Does the CloudStack layer run on top of some IBM-specific virtualization? In any case, it's a performance thing, so we can get to it during the heart of the project.. won't be a fail-point, just be listed as a "todo detail, if want to make the technology commercial"
to:

-- If the cloud
is aware of our VMs and can play with their allocation/resources, we may want to introduce some APIs between the DSM system and the cloud -- for better collaboration and/or to accept notifications from the cloud upon changes it makes.
Hmmm.. I'm not clear on the "our" in "our VMs"
..? There is a portal that is responsible for creating VMs, via the Cloud API. Each VM is within one physical hosting location. The Cloud stacks work such that each location has a particular URL used for creating VMs within that location AFAIU. The portal will only use these Cloud stack supplied mechanisms to create VMs and manage their location/allocation.

Within a given VM, the DSM runtime will create its own
"virtual processors", or VPs, which are equivalent to threads.. all VPs are inside the same hardware-enforced coherent virtual address space. The DSM will switch among these, in order to overlap communications to remote DSM instances, which are running inside different VMs.

There should only be one layer of VMs, which are created by the portal..

Perhaps we should talk more about how you see the binary optimizer interacting with these other parts?

-- It sounds like our focus is only on the application side. We don't care about best resource allocation from the cloud perspective. Such dual optimization target problem is very difficult indeed and may be out of scope. I just wanted to make sure this is an informed decision (to focus only on the application side) and that the call doesn't expect the other aspect (general cloud utilization) to be considered as well.

The resource allocation will be done hierarchically. At the top level, the portal will cooperate with a DSM instance to choose the best division among locations, and then within a location to decide the best division among VMs. Erol Gelenbe will be handling the algorithms for this part of things.

Within a given VM, another level of work-division and resource allocation will be done, in isolation from the rest. This is the core-by-core level of allocation. The DSM runtime will handle this level of work-division, and the allocation of work-units to cores.

The Dorit tool may interact with the DSM system.. it's not clear yet just how they will interact..
Added lines 86-127:
7) About: "if the DSM detects that it is running late, it can issue a request to the deployment tool to
add more machines to the on-going computation. The DSM system re-calculates the division of work."

- What information is the DSM system going to base it's decisions on? Maybe it would be useful if we provide feedback information (from monitoring the application from within the VM) to help the DSM system in this process ?

That may, indeed, be helpful.. for x86, I was planning to use either Linux perftool performance counters or simply the x86 time-stamp instruction that returns CPU cycles. The proto-runtime system is architected for such measurements..

I've been viewing the Dorit tool as something that interrupts work in progress, changes the executable, then resumes work in progress.. I haven't been seeing it as a runtime system that controls what work is assigned to what core.. was that how you've been thinking?

Wow.. awesome discussion :-) Thank you :-) This is how progress happens!


!!! WP 3: DSM Runtime System -- Sean leads WP (at INRIA, or Bell Labs or CWI)
* Task1: Architecture of hierarchical DSM runtime system. Delivered runtime will be a federation. On each virtual machine, or in some cases physical machine, an individual runtime system will be in operation. Each of these individual runtime systems will interact with the others to form a collective, cooperative, overall runtime system, which presents a single address space abstraction. This task will define the architecture, interfaces, and protocols for this overall runtime system. Then for each class of hardware an individual task will implement the interfaces and protocols for that particular kind of hardware, as described below.
-- Sean leads task, partners for 2 through 6, plus Imperial, plus INRIA, plus XLab participate on aspects that involve their individual piece. INRIA will advise on how the architecture choices impact the toolchain design and implementation. Imperial will advise on how the choices impact the algorithms for runtime work division and deployment. XLab will advise on how the architecture impacts the deployment tool. Partners for 2 through 6 will provide input on how the architecture impacts the individual runtime system they are implementing.

* Task2: DSM runtime system on Kalray HW that presents an API that is manipulatable by WP2 Cloud deployment tool -- Kalray (with involvement by Sean and XLab)

* Task3: DSM runtime system on FORTH Cube HW that presents an API that is manipulatable by WP2 Cloud deployment tool -- FORTH (with involvement by Sean and XLab)

* Task4: DSM runtime system on a shared memory x86 based Cloud server instance -- Sean with involvement by XLab

* Task5: DSM runtime system on a shared memory Power based Cloud server instance -- IBM plus Sean with involvement by XLab

* Task6: binary runtime specializer -- IBM (Q: any chance to get open source release? Or to apply to other ISA?). Use fat binary runtime optimization techniques to adjust the granularity and layout during execution. This is specific to IBM Power architecture, but it provides the blue print for implementing for other ISAs as well.

* Task7: Integration testing. Deploy the various individual runtime systems on specific machines. Write test cases whose purpose is to expose performance issues and bugs. Execute the test cases, record the performance, report to the partners who developed the individual runtimes in tasks 2 through 6. Work with the partners to determine the reasons for particular performance anomalies, perhaps writing and running specific tests interactively. The partners for 2 through 6 will use the results to modify their individual runtime system implementations. -- Gigas leads -- partners for 2 through 6, plus XLab are involved.

* Question: "Is the runtime going to be virtualized itself? Why not use standard bytecodes?" A: There won't be any bytecodes. CloudDSM is focused on high performance, and so the DSM runtime for many of the machines will be based on proto-runtime, due to its high performance and productivity benefits. Proto-runtime creates the equivalent of user-level "threads", which are called virtual processors because they capture a CPU context, and a CPU context virtualizes the bare processor pipeline. One of these VPs consists of its own stack, plus a data structure that holds a saved stack pointer, frame pointer, and program-counter contents. That set of state equals a point within the CPU's computation, and it can be used to switch the CPU around among different computation timelines. Each of those timelines behaves like its own, virtual, CPU processor pipeline.. hence, the name "virtual processor". But it's the lowest level, most bare bones "virtualization" possible.. there is no bytecode, no translation between ISAs, nothing but multiple CPU contexts.


[[CloudDSM.RuntimeSpecializationWP]]

!!! WP 4: Application visible interface design
* Task 1: Define more precisely the class of applications that CloudDSM targets
* Task 2: Define the needs of the toolchain, what degrees of freedom it needs in order to accomplish the desired transforms of the source code.
* Task 3: Define the needs of the runtime system, what characteristics it needs in the code generated by the toolchain in order to deliver high performance.
* Task 4: Define the needs of the application developer, what mental models and what syntax, and what debugging and code-checking support they desire.
* Task 5: Integrate the results of tasks 1 through 4 into a specification of the interfaces used by the application developer. There will be two levels of code annotation.. one that high level application developers see and use, of which there will be many variations. For example, Reo will have a different high level user interface than the pragma system for OpenMP. The second, lower level will be common to all versions of the higher level interface. This will be used directly by the toolchain to perform code transforms. Each of the higher level forms will be translated into the same, common, lower level form. This task only considers the top level forms of the code. WP 5 separately defines the common lower level form.

Question: "Why use OpenMP?" A: Many existing parallel libraries are written in OpenMP. One early project goal was to hit the ground running with code that is relevant to industry and readily usable by industry. Using OpenMP libraries during development of the low-level interface allows compiler work to begin almost immediately on relevant, ready-to-run code.
Changed lines 129-156 from:
5) About: "- 7.1 Gigas has a KVM based Cloud stack, which allows control over physical machines. An entire machine
can be given exclusive use for a given computation task, if desired. This allows, for example, and
application to allocate all 64 cores and the entire memory space of their largest machine. This is done
via
the Cloud interface Gigas presents.
This will be useful for
the CloudDSM runtime system, allowing it to take full advantage of the
available hardware, and minimize overhead imposed by
the hypervisor switching among multiple users
sharing
the machine."

... obviously this may be wasteful from the general cloud point of view... I assume we're not restricting ourselves to this scenario?



I'm not sure whether it would be wasteful.. This is how EC2
from Amazon works, for example.. it's important to get the entire machine assigned to high computation tasks.. It doesn't waste, because the machine is fully occupied by the computation..

This goes back to
the question above.. performance vs security vs effort.. to get the performance, want this capability, to prevent waste, want to share among multiple applications.. but then get security issues into the picture.. and the added effort of security measures to ensure that separate applications can't interact..

This bring me to another question:

6) What do we assume in terms of the relationship between the cloud admin and the DSM system?
Do we view the cloud as just a pool of resources that once assigned to us we have exclusive control of?
The portal knows about the physical providers, and what hardware is at each, and what Cloud API commands to use to start VMs at them, and what Cloud API commands to use to learn how busy each location
is. The DSM runtime can only report how much work it has inside it, but cannot know how busy the machines are.. hopefully the CloudStack API has some way of informing the portal about uptime, or load, within a given physical location. Then the portal can choose among the available providers, to decide where best to run the work.

Or might
the cloud management layer (either automatically or via the cloud admin) move VMs around or reallocate resources (for workload consolidation, or other considerations)?

I would like for
the portal/deployment-tool to have the ability to re-allocate work. I don't know whether CloudStack allows controlling which physical machine a VM is assigned to, via its API.. but the work can always be taken away from a poorly performing VM and given to a better performing one, or one started at a different physical location.

Is it possible that the cloud would want to move a VM from one core to another?
That is more inside the runtime system. It's too fine grained for the portal to manage. This is the reason that
the KVM approach of giving the DSM the entire machine is good.. then the DSM runtime controls what work runs on what core.
to:
Tasks 1 through 5 will be performed iteratively, with multiple revisions during the first six months of the project. Each of the tasks will have an impact on the other tasks, and it will require a large amount of communication, via iterations, in order to find a suitable common ground interface that supports all the aspects well.

* Task 6: Development tools to support
the writing of application code. This includes code checkers that enforce restrictions on what coding practices are allowed inside the portions of the application that employ the CloudDSM system. It also includes debugging aids, to detect bugs and to narrow down the portion of the application code causing discrepancies from specified behavior.

-- Sean and INRIA and Douglas Connect will lead, with input
from XLab, Imperial, and partners for WP3 tasks 2 through 6.


Comment and Question: "It may provide benefit desired by end-users if location-aware high-level annotations were also provided, such as regions, effects, etc
. With these, the programmers will be able to communicate placement information. We found that for Myrmics, the programmer can know more about placement than the runtime/compiler can infer, and once placement/allocation is done properly, the shared-mem abstraction still helps with coding, except that the manual placement provides superior locality and performance. The programmer knows about and can say things about locality (like the X10 places, or regions in Fortress)." Question: "does it make sense to expose location within a Cloud-system that automatically and dynamically changes the number, type, and location of machines assigned to the computation?" Answer: you have nailed the heart of what this workpackage is all about. This will be an on-going discussion during the first six to nine months of the project. Indeed, one desire is to capture the understanding that the programmer has in their head and uses during the process of specifying pieces of work and choosing, themselves, where to place each piece. The goal of the WP is to discover an encoding of the process that the programmer does in their head, so that they encode that mental process, in a parameterized way. The automation then chooses the parameter values, and plugs them into what the programmer provided, which delivers pieces of work and the affinities among them. The WP content is the work of discovering programmer abstractions that get us as close to there as possible, in a way that we know how to implement..


Question: "any hints on what is
the common low-level form of the source that is produced by the development toolchain? is it annotated source code?" Answer: Figuring this out is the content of WP4. Albert would like a source code annotation form for project logistics reasons. In that case by-hand modification of existing OpenMP libraries can begin at once, and act as a test bed for rapid iterations of what is the best low-level form. At the same time, compiler techniques can be tried, and also at the same time high level end-user annotations can be tested for how the "feel" to the application programmers. The DSM-specializer can be worked on in tight iterations with figuring out what the best low-level representation should be.. any desired changes in representation are just done quickly by hand -- no need to fiddle with IR, and reasonably decoupled from high level annotation form..


!!! WP 5: Compiler Toolchain -- INRIA
* Task 1: participate in WP 4 task 2, as part of arriving at the interface that WP 5 will take as input.
* Task 2: Define intermediate, low level form of code annotation. The interfaces defined in WP 4 will be translated into this common lower level form.
* Task 3: Create tools that transform from each form of higher level code annotation into the common lower level code annotation form.
* Task 4: Create transform tools that translate from the common lower level form into the final C form of the code. The final C form includes OS calls, DSM runtime system calls, and synchronization calls that are inserted by the tool. The final C code has a form of the application that performs large chunks of work in-between calls to the DSM system. Each target hardware platform will require its own variation of the transform tool, which is tuned to the details of that hardware, especially communication details. The tool may produce a single multi-versioned binary, or it may include a runtime specializer, or it may generate many independent versions of the binary. A large portion of the research will involve determining the best approach.

FORTH will contribute their compiler work that recreates locality and placement information when the programmer doesn't explicitly declare it (this is used to replace remote accesses with DMA-ing of whole pages for performance).


Comment and Question: "For the IBM fat binary specializer, there are three stages (1) development stage: static generic compilation on developer machine which produces a custom IR form plus generic executable (2) static specialization compilation on a server, or during load, which generates a Power executable specialized to a specific HW (3) runtime fat-binary based recompilation on the actual deployed HW" Question: "How does this fit into CloudDSM?" Answer: Stage 1 will remain on the developer machine, stage 2 will take place inside the CloudDSM portal, and stage 3 inside the Cloud server during execution.

Question: "how will stage 2 fit with the DSM specific specializations?" Answer: this is an open question, to be resolved during the WP. We need some pictures, to figure out what tools do what at which point.
.
Changed lines 155-159 from:
Might there be a situation of "ping-pong" between the decisions the DSM makes and the decisions the cloud makes, resulting in a VM being constantly migrated back and forth?
Or are
the VMs created and orchestrated by the DSM system not visible to the cloud?
I see a hierarchical division of work.. at the highest level, work is divided among physical locations.. then at
a given location, the work it receives is divided among VMs created within that Cloud host. If the host allows entire machines to be allocated, in the way Gigas and Amazon do, then a further division of work can be performed, which the DSM system manages. Hence, only one level of work division takes place at the VM level.

For your tool, the optimization I see is in adjusting
the size of work.. it's not clear whether the Dorit tool will make decisions about how to divide the work. Those decisions will likely happen inside the DSM runtime.. then it will be up to the Dorit tool to modify the code such that it actually performs work in the DSM-chosen size of chunk. We still need to work through the details of how the DSM runtime will interact with the Dorit tool :-)
to:
Comment: "Stage 1 happens inside the development environment on a desktop machine. The low-level annotated source is then sent to the CloudDSM portal by the developer. This process registers the application and makes it available for the end-user to run. This registration process also causes the low-level annotated source to be given to a specialization 'harness'. That harness invokes a number of specializer modules. One specializer module is provided by IBM. This module re-runs stage 1 and then runs stage 2 several times, once for each potential Power HW configuration that the CloudDSM system could send the fat-binary to (the module may, in fact cause stage 1 and stage 2 to run remotely on Power ISA machines, inside their own Cloud VM). Lastly, after the user starts the application and issues a request for computation, the portal deploys a unit of work to a Cloud VM running on a Power ISA machine. That Cloud VM has the DSM runtime in it, and that is given the unit of work. The unit of work includes a function within the fat binary to perform. The fat binary is dynamically linked to the DSM runtime. During execution, the work suspends and the binary optimizer takes over, modifies the code, then resumes the work. When the work reaches a DSM call, the DSM runtime suspends the execution context. That context will remain suspended while communication of data takes place. The DSM runtime will switch the CPU to a different context, whose communication has completed and is ready to resume."
Deleted lines 158-262:

-- If the cloud is aware of our VMs and can play with their allocation/resources, we may want to introduce some APIs between the DSM system and the cloud -- for better collaboration and/or to accept notifications from the cloud upon changes it makes.
Hmmm.. I'm not clear on the "our" in "our VMs"..? There is a portal that is responsible for creating VMs, via the Cloud API. Each VM is within one physical hosting location. The Cloud stacks work such that each location has a particular URL used for creating VMs within that location AFAIU. The portal will only use these Cloud stack supplied mechanisms to create VMs and manage their location/allocation.

Within a given VM, the DSM runtime will create its own "virtual processors", or VPs, which are equivalent to threads.. all VPs are inside the same hardware-enforced coherent virtual address space. The DSM will switch among these, in order to overlap communications to remote DSM instances, which are running inside different VMs.

There should only be one layer of VMs, which are created by the portal..

Perhaps we should talk more about how you see the binary optimizer interacting with these other parts?

-- It sounds like our focus is only on the application side. We don't care about best resource allocation from the cloud perspective. Such dual optimization target problem is very difficult indeed and may be out of scope. I just wanted to make sure this is an informed decision (to focus only on the application side) and that the call doesn't expect the other aspect (general cloud utilization) to be considered as well.

The resource allocation will be done hierarchically. At the top level, the portal will cooperate with a DSM instance to choose the best division among locations, and then within a location to decide the best division among VMs. Erol Gelenbe will be handling the algorithms for this part of things.

Within a given VM, another level of work-division and resource allocation will be done, in isolation from the rest. This is the core-by-core level of allocation. The DSM runtime will handle this level of work-division, and the allocation of work-units to cores.

The Dorit tool may interact with the DSM system.. it's not clear yet just how they will interact..

7) About: "if the DSM detects that it is running late, it can issue a request to the deployment tool to
add more machines to the on-going computation. The DSM system re-calculates the division of work."

- What information is the DSM system going to base it's decisions on? Maybe it would be useful if we provide feedback information (from monitoring the application from within the VM) to help the DSM system in this process ?

That may, indeed, be helpful.. for x86, I was planning to use either Linux perftool performance counters or simply the x86 time-stamp instruction that returns CPU cycles. The proto-runtime system is architected for such measurements..

I've been viewing the Dorit tool as something that interrupts work in progress, changes the executable, then resumes work in progress.. I haven't been seeing it as a runtime system that controls what work is assigned to what core.. was that how you've been thinking?

Wow.. awesome discussion :-) Thank you :-) This is how progress happens!




!!! WP 1: Coordination and Management -- XLab
-] Task1: to be filled in

!!! WP 2: Cloud Deployment tool -- XLab
* Task1: Develop portal -- gather specifications, produce design, make prototype, test with end-user, iterate
* Task2: Define provisioning tool -- interface with portal, with DSM runtime system, with end-user client, and with Cloud API for each kind of machine
* Task3: Develop provisioning tool -- manipulates each Cloud API
* Task4: Testing and performance tuning


!!! WP 3: DSM Runtime System -- Sean leads WP (at INRIA, or Bell Labs or CWI)
* Task1: Architecture of hierarchical DSM runtime system. Delivered runtime will be a federation. On each virtual machine, or in some cases physical machine, an individual runtime system will be in operation. Each of these individual runtime systems will interact with the others to form a collective, cooperative, overall runtime system, which presents a single address space abstraction. This task will define the architecture, interfaces, and protocols for this overall runtime system. Then for each class of hardware an individual task will implement the interfaces and protocols for that particular kind of hardware, as described below.
-- Sean leads task, partners for 2 through 6, plus Imperial, plus INRIA, plus XLab participate on aspects that involve their individual piece. INRIA will advise on how the architecture choices impact the toolchain design and implementation. Imperial will advise on how the choices impact the algorithms for runtime work division and deployment. XLab will advise on how the architecture impacts the deployment tool. Partners for 2 through 6 will provide input on how the architecture impacts the individual runtime system they are implementing.

* Task2: DSM runtime system on Kalray HW that presents an API that is manipulatable by WP2 Cloud deployment tool -- Kalray (with involvement by Sean and XLab)

* Task3: DSM runtime system on FORTH Cube HW that presents an API that is manipulatable by WP2 Cloud deployment tool -- FORTH (with involvement by Sean and XLab)

* Task4: DSM runtime system on a shared memory x86 based Cloud server instance -- Sean with involvement by XLab

* Task5: DSM runtime system on a shared memory Power based Cloud server instance -- IBM plus Sean with involvement by XLab

* Task6: binary runtime specializer -- IBM (Q: any chance to get open source release? Or to apply to other ISA?). Use fat binary runtime optimization techniques to adjust the granularity and layout during execution. This is specific to IBM Power architecture, but it provides the blue print for implementing for other ISAs as well.

* Task7: Integration testing. Deploy the various individual runtime systems on specific machines. Write test cases whose purpose is to expose performance issues and bugs. Execute the test cases, record the performance, report to the partners who developed the individual runtimes in tasks 2 through 6. Work with the partners to determine the reasons for particular performance anomalies, perhaps writing and running specific tests interactively. The partners for 2 through 6 will use the results to modify their individual runtime system implementations. -- Gigas leads -- partners for 2 through 6, plus XLab are involved.

* Question: "Is the runtime going to be virtualized itself? Why not use standard bytecodes?" A: There won't be any bytecodes. CloudDSM is focused on high performance, and so the DSM runtime for many of the machines will be based on proto-runtime, due to its high performance and productivity benefits. Proto-runtime creates the equivalent of user-level "threads", which are called virtual processors because they capture a CPU context, and a CPU context virtualizes the bare processor pipeline. One of these VPs consists of its own stack, plus a data structure that holds a saved stack pointer, frame pointer, and program-counter contents. That set of state equals a point within the CPU's computation, and it can be used to switch the CPU around among different computation timelines. Each of those timelines behaves like its own, virtual, CPU processor pipeline.. hence, the name "virtual processor". But it's the lowest level, most bare bones "virtualization" possible.. there is no bytecode, no translation between ISAs, nothing but multiple CPU contexts.


[[CloudDSM.RuntimeSpecializationWP]]

!!! WP 4: Application visible interface design
* Task 1: Define more precisely the class of applications that CloudDSM targets
* Task 2: Define the needs of the toolchain, what degrees of freedom it needs in order to accomplish the desired transforms of the source code.
* Task 3: Define the needs of the runtime system, what characteristics it needs in the code generated by the toolchain in order to deliver high performance.
* Task 4: Define the needs of the application developer, what mental models and what syntax, and what debugging and code-checking support they desire.
* Task 5: Integrate the results of tasks 1 through 4 into a specification of the interfaces used by the application developer. There will be two levels of code annotation.. one that high level application developers see and use, of which there will be many variations. For example, Reo will have a different high level user interface than the pragma system for OpenMP. The second, lower level will be common to all versions of the higher level interface. This will be used directly by the toolchain to perform code transforms. Each of the higher level forms will be translated into the same, common, lower level form. This task only considers the top level forms of the code. WP 5 separately defines the common lower level form.

Question: "Why use OpenMP?" A: Many existing parallel libraries are written in OpenMP. One early project goal was to hit the ground running with code that is relevant to industry and readily usable by industry. Using OpenMP libraries during development of the low-level interface allows compiler work to begin almost immediately on relevant, ready-to-run code.


Tasks 1 through 5 will be performed iteratively, with multiple revisions during the first six months of the project. Each of the tasks will have an impact on the other tasks, and it will require a large amount of communication, via iterations, in order to find a suitable common ground interface that supports all the aspects well.

* Task 6: Development tools to support the writing of application code. This includes code checkers that enforce restrictions on what coding practices are allowed inside the portions of the application that employ the CloudDSM system. It also includes debugging aids, to detect bugs and to narrow down the portion of the application code causing discrepancies from specified behavior.

-- Sean and INRIA and Douglas Connect will lead, with input from XLab, Imperial, and partners for WP3 tasks 2 through 6.


Comment and Question: "It may provide benefit desired by end-users if location-aware high-level annotations were also provided, such as regions, effects, etc. With these, the programmers will be able to communicate placement information. We found that for Myrmics, the programmer can know more about placement than the runtime/compiler can infer, and once placement/allocation is done properly, the shared-mem abstraction still helps with coding, except that the manual placement provides superior locality and performance. The programmer knows about and can say things about locality (like the X10 places, or regions in Fortress)." Question: "does it make sense to expose location within a Cloud-system that automatically and dynamically changes the number, type, and location of machines assigned to the computation?" Answer: you have nailed the heart of what this workpackage is all about. This will be an on-going discussion during the first six to nine months of the project. Indeed, one desire is to capture the understanding that the programmer has in their head and uses during the process of specifying pieces of work and choosing, themselves, where to place each piece. The goal of the WP is to discover an encoding of the process that the programmer does in their head, so that they encode that mental process, in a parameterized way. The automation then chooses the parameter values, and plugs them into what the programmer provided, which delivers pieces of work and the affinities among them. The WP content is the work of discovering programmer abstractions that get us as close to there as possible, in a way that we know how to implement..


Question: "any hints on what is the common low-level form of the source that is produced by the development toolchain? is it annotated source code?" Answer: Figuring this out is the content of WP4. Albert would like a source code annotation form for project logistics reasons. In that case by-hand modification of existing OpenMP libraries can begin at once, and act as a test bed for rapid iterations of what is the best low-level form. At the same time, compiler techniques can be tried, and also at the same time high level end-user annotations can be tested for how the "feel" to the application programmers. The DSM-specializer can be worked on in tight iterations with figuring out what the best low-level representation should be.. any desired changes in representation are just done quickly by hand -- no need to fiddle with IR, and reasonably decoupled from high level annotation form..


!!! WP 5: Compiler Toolchain -- INRIA
* Task 1: participate in WP 4 task 2, as part of arriving at the interface that WP 5 will take as input.
* Task 2: Define intermediate, low level form of code annotation. The interfaces defined in WP 4 will be translated into this common lower level form.
* Task 3: Create tools that transform from each form of higher level code annotation into the common lower level code annotation form.
* Task 4: Create transform tools that translate from the common lower level form into the final C form of the code. The final C form includes OS calls, DSM runtime system calls, and synchronization calls that are inserted by the tool. The final C code has a form of the application that performs large chunks of work in-between calls to the DSM system. Each target hardware platform will require its own variation of the transform tool, which is tuned to the details of that hardware, especially communication details. The tool may produce a single multi-versioned binary, or it may include a runtime specializer, or it may generate many independent versions of the binary. A large portion of the research will involve determining the best approach.

FORTH will contribute their compiler work that recreates locality and placement information when the programmer doesn't explicitly declare it (this is used to replace remote accesses with DMA-ing of whole pages for performance).


Comment and Question: "For the IBM fat binary specializer, there are three stages (1) development stage: static generic compilation on developer machine which produces a custom IR form plus generic executable (2) static specialization compilation on a server, or during load, which generates a Power executable specialized to a specific HW (3) runtime fat-binary based recompilation on the actual deployed HW" Question: "How does this fit into CloudDSM?" Answer: Stage 1 will remain on the developer machine, stage 2 will take place inside the CloudDSM portal, and stage 3 inside the Cloud server during execution.

Question: "how will stage 2 fit with the DSM specific specializations?" Answer: this is an open question, to be resolved during the WP. We need some pictures, to figure out what tools do what at which point..

Comment: "Stage 1 happens inside the development environment on a desktop machine. The low-level annotated source is then sent to the CloudDSM portal by the developer. This process registers the application and makes it available for the end-user to run. This registration process also causes the low-level annotated source to be given to a specialization 'harness'. That harness invokes a number of specializer modules. One specializer module is provided by IBM. This module re-runs stage 1 and then runs stage 2 several times, once for each potential Power HW configuration that the CloudDSM system could send the fat-binary to. Lastly, after the user starts the application and issues a request for computation, the portal deploys a unit of work to a Cloud VM running on a Power ISA machine. That Cloud VM has the DSM runtime in it, and that is given the unit of work. The unit of work includes a function within the fat binary to perform. The fat binary is dynamically linked to the DSM runtime. During execution, the work suspends and the binary optimizer takes over, modifies the code, then resumes the work. When the work reaches a DSM call, the DSM runtime suspends the execution context. That context will remain suspended while communication of data takes place. The DSM runtime will switch the CPU to a different context, whose communication has completed and is ready to resume."


the harness can just invoke your toolchain, from start to final fat binary, in whatever way it is that you do it right now.. except it may have to be either a cross-compiler that runs on x86 but generates Power binary.. or if that's a problem, we can give the specialization harness the ability to run specialization jobs on any of the VMs available to the CloudDSM system..
Changed lines 166-168 from:
* Douglas Connect will use the results of the project within its Drug Discovery product, and make the project technology available to the startups that it incubates.
to:
* Douglas Connect will use the results of the project within its Drug Discovery product, and make the project technology available to the startups that it incubates.

Question: "Are the 2 Power servers we have in our lab sufficient for the CloudDSM project?" Answer: probably, anyone have thoughts?
February 26, 2014, at 03:21 PM by 192.16.201.181 - second Dorit question, in WP 5
Changed lines 20-60 from:

!!! WP 1: Coordination and Management -- XLab
-] Task1: to be filled in

!!! WP 2: Cloud Deployment tool -- XLab
* Task1: Develop portal -- gather specifications, produce design, make prototype, test with end-user, iterate
* Task2: Define provisioning tool -- interface with portal, with DSM runtime system, with end-user client, and with Cloud API for each kind of machine
* Task3: Develop provisioning tool -- manipulates each Cloud API
* Task4: Testing and performance tuning


!!! WP 3: DSM Runtime System -- Sean leads WP (at INRIA, or Bell Labs or CWI)
* Task1: Architecture of hierarchical DSM runtime system. Delivered runtime will be a federation. On each virtual machine, or in some cases physical machine, an individual runtime system will be in operation. Each of these individual runtime systems will interact with the others to form a collective, cooperative, overall runtime system, which presents a single address space abstraction. This task will define the architecture, interfaces, and protocols for this overall runtime system. Then for each class of hardware an individual task will implement the interfaces and protocols for that particular kind of hardware, as described below.
-- Sean leads task, partners for 2 through 6, plus Imperial, plus INRIA, plus XLab participate on aspects that involve their individual piece. INRIA will advise on how the architecture choices impact the toolchain design and implementation. Imperial will advise on how the choices impact the algorithms for runtime work division and deployment. XLab will advise on how the architecture impacts the deployment tool. Partners for 2 through 6 will provide input on how the architecture impacts the individual runtime system they are implementing.

* Task2: DSM runtime system on Kalray HW that presents an API that is manipulatable by WP2 Cloud deployment tool -- Kalray (with involvement by Sean and XLab)

* Task3: DSM runtime system on FORTH Cube HW that presents an API that is manipulatable by WP2 Cloud deployment tool -- FORTH (with involvement by Sean and XLab)

* Task4: DSM runtime system on a shared memory x86 based Cloud server instance -- Sean with involvement by XLab

* Task5: DSM runtime system on a shared memory Power based Cloud server instance -- IBM plus Sean with involvement by XLab

* Task6: binary runtime specializer -- IBM (Q: any chance to get open source release? Or to apply to other ISA?). Use fat binary runtime optimization techniques to adjust the granularity and layout during execution. This is specific to IBM Power architecture, but it provides the blue print for implementing for other ISAs as well.

* Task7: Integration testing. Deploy the various individual runtime systems on specific machines. Write test cases whose purpose is to expose performance issues and bugs. Execute the test cases, record the performance, report to the partners who developed the individual runtimes in tasks 2 through 6. Work with the partners to determine the reasons for particular performance anomalies, perhaps writing and running specific tests interactively. The partners for 2 through 6 will use the results to modify their individual runtime system implementations. -- Gigas leads -- partners for 2 through 6, plus XLab are involved.

* Question: "Is the runtime going to be virtualized itself? Why not use standard bytecodes?" A: There won't be any bytecodes. CloudDSM is focused on high performance, and so the DSM runtime for many of the machines will be based on proto-runtime, due to its high performance and productivity benefits. Proto-runtime creates the equivalent of user-level "threads", which are called virtual processors because they capture a CPU context, and a CPU context virtualizes the bare processor pipeline. One of these VPs consists of its own stack, plus a data structure that holds a saved stack pointer, frame pointer, and program-counter contents. That set of state equals a point within the CPU's computation, and it can be used to switch the CPU around among different computation timelines. Each of those timelines behaves like its own, virtual, CPU processor pipeline.. hence, the name "virtual processor". But it's the lowest level, most bare bones "virtualization" possible.. there is no bytecode, no translation between ISAs, nothing but multiple CPU contexts.


[[CloudDSM.RuntimeSpecializationWP]]

!!! WP 4: Application visible interface design
* Task 1: Define more precisely the class of applications that CloudDSM targets
* Task 2: Define the needs of the toolchain, what degrees of freedom it needs in order to accomplish the desired transforms of the source code.
* Task 3: Define the needs of the runtime system, what characteristics it needs in the code generated by the toolchain in order to deliver high performance.
* Task 4: Define the needs of the application developer, what mental models and what syntax, and what debugging and code-checking support they desire.
* Task 5: Integrate the results of tasks 1 through 4 into a specification of the interfaces used by the application developer. There will be two levels of code annotation.. one that high level application developers see and use, of which there will be many variations. For example, Reo will have a different high level user interface than the pragma system for OpenMP. The second, lower level will be common to all versions of the higher level interface. This will be used directly by the toolchain to perform code transforms. Each of the higher level forms will be translated into the same, common, lower level form. This task only considers the top level forms of the code. WP 5 separately defines the common lower level form.

Question: "Why use OpenMP?" A: Many existing parallel libraries are written in OpenMP. One early project goal was to hit the ground running with code that is relevant to industry and readily usable by industry. Using OpenMP libraries during development of the low-level interface allows compiler work to begin almost immediately on relevant, ready-to-run code.
to:
On Sun, Feb 23, 2014 at 6:32 AM, Dorit Nuzman <DORIT@il.ibm.com> wrote:



3) About: " 7.2.1. Sharing a VM among multiple applications." :
I'm wondering why this item is under IBM only? isn't it a generic issue?

Yes. It is a generic issue.. think it was under IBM because OpenStack has an API extension that allows suspend/resume of VMs..
Added lines 29-181:
4) Still on the same topic:

" 7.2.1 Details still must be worked out around whether a
single VM will be used to perform computation tasks from multiple applications or not."

-- there's no problem to run different applications on a single VM (I think I don't understand the question... there's a full OS running on the VM which of course can run multiple applications. Each application will be loaded with our special loader that runs the application as a thread within the same process as the optimizer and kicks of dynamic monitoring and recompilation. So one dynamic optimization process per application in a VM).

Sure.. the issue is security.. different end-users want their applications to be secure from each other.. a VM completely isolates from all other VMs, which is what makes Cloud acceptable to corporates..

However, from an overhead perspective, sharing a single "worker" VM among multiple applications is the most efficient way to go.. so it's a balance between performance (overhead), security, and effort (IE, add security features to CloudDSM runtime is extra effort)


" If not, then hopefully the IBM cloud stack includes a high performance way of suspending/putting to sleep a VM
and then quickly waking it (on the order of microseconds to suspend/resume).
When an application is started, the portal may over-provision, starting many VMs and DSMruntimes
inside them, and then put most of them to sleep.. only when a command comes in from the user-client
that the portal knows will require high computation does it wake up the VMs. A VM remains awake
for the duration of the request, expected to be on the order of seconds to at most a few minutes.
When the computation is done, the VMS is put back to sleep, until the application issues the next
high computation request/command."

-- AFAIU, the Power virtualization mechanism sort of takes care of this automatically under the covers. If there is competition for additional processing capacity among several VMs, the Hypervisor distributes unused processor capacity to the eligible VMs;
Also, to optimize physical processor utilization, a virtual processor (fraction of a physical processor assigned to a VM) will yield a physical processor if it has no work to run. A virtual processor can yield a physical processor through a POWER Hypervisor call. I think this is called Virtual Processor Folding: the ability to put idle virtual processors into a hibernation state so that they do not consume any resources. It is configurable via a set of commands. Every second, the kernel scheduler evaluates the number of virtual processors that should be activated to accommodate the physical utilization of the VM.If the number yields a high virtual processor utilization, the base number of virtual processors required is incremented to enable the workload to expand. If the number is less than the number of virtual processors that are currently activated, a virtual processor is deactivated.


Right.. the issue is the DSM runtime system.. it knows when it has no work, and knows when it can sleep.. but if there is no command to cause the VM to sleep, then the runtime just busy-waits, polling for work that never arrives.. The VM sees that busy-wait polling as application activity.. it doesn't know the difference.. so it consumes VM time for the busy waiting, which is charged to the end-user. And the end-user pays money for it. Cloud is pay-per-cpu-hour!

Perhaps there are other, better, mechanisms by which the DSM runtime can get the VM to turn off.. right now, it simply uses pthread yield.. but it's not clear that will be enough to get the VM to stop re-scheduling it..

From a higher level, the deployment tool knows exactly when there's no work for a specific DSM runtime instance, and can issue a suspend to the VM..


In addition there is a capability to Suspend a VM, which indicates it's in standby/hibernated state, and all of its resources can be used by other partitions. I don't know how long it takes to suspend/resume a VM. And there are also several pre-requisites for this capability that I think we don't have in our lab and we'll have to figure out how to make them available for the project. I'll try to find out... (I also want to understand when the automatic mechanisms I mentioned above are not sufficient, and we really need to actually suspend a VM...).

BTW, What do you think should be the size of our Power cloud for the project? Say we have a Power7 server with 8 cores 4-way SMT each. Are 2 such servers sufficient? Or is it preferably to obtain many more/much larger servers?

Two servers should be enough to demonstrate the value.. maybe ask during a telco, to get opinions of others..


One thing to mention is that any VMs that we create/suspend/resume directly via the Power virtualization layer, and not via the cloud management layer, will not be visible to the cloud. If we did want to create/suspend VMs via the cloud management layer, that would take much longer.

Hmmm.. I must have missed something.. so, there are two separate virtualization layers? Does the CloudStack layer run on top of some IBM-specific virtualization? In any case, it's a performance thing, so we can get to it during the heart of the project.. won't be a fail-point, just be listed as a "todo detail, if want to make the technology commercial"


5) About: "- 7.1 Gigas has a KVM based Cloud stack, which allows control over physical machines. An entire machine
can be given exclusive use for a given computation task, if desired. This allows, for example, and
application to allocate all 64 cores and the entire memory space of their largest machine. This is done
via the Cloud interface Gigas presents.
This will be useful for the CloudDSM runtime system, allowing it to take full advantage of the
available hardware, and minimize overhead imposed by the hypervisor switching among multiple users
sharing the machine."

... obviously this may be wasteful from the general cloud point of view... I assume we're not restricting ourselves to this scenario?



I'm not sure whether it would be wasteful.. This is how EC2 from Amazon works, for example.. it's important to get the entire machine assigned to high computation tasks.. It doesn't waste, because the machine is fully occupied by the computation..

This goes back to the question above.. performance vs security vs effort.. to get the performance, want this capability, to prevent waste, want to share among multiple applications.. but then get security issues into the picture.. and the added effort of security measures to ensure that separate applications can't interact..

This bring me to another question:

6) What do we assume in terms of the relationship between the cloud admin and the DSM system?
Do we view the cloud as just a pool of resources that once assigned to us we have exclusive control of?
The portal knows about the physical providers, and what hardware is at each, and what Cloud API commands to use to start VMs at them, and what Cloud API commands to use to learn how busy each location is. The DSM runtime can only report how much work it has inside it, but cannot know how busy the machines are.. hopefully the CloudStack API has some way of informing the portal about uptime, or load, within a given physical location. Then the portal can choose among the available providers, to decide where best to run the work.

Or might the cloud management layer (either automatically or via the cloud admin) move VMs around or reallocate resources (for workload consolidation, or other considerations)?

I would like for the portal/deployment-tool to have the ability to re-allocate work. I don't know whether CloudStack allows controlling which physical machine a VM is assigned to, via its API.. but the work can always be taken away from a poorly performing VM and given to a better performing one, or one started at a different physical location.

Is it possible that the cloud would want to move a VM from one core to another?
That is more inside the runtime system. It's too fine grained for the portal to manage. This is the reason that the KVM approach of giving the DSM the entire machine is good.. then the DSM runtime controls what work runs on what core.

Might there be a situation of "ping-pong" between the decisions the DSM makes and the decisions the cloud makes, resulting in a VM being constantly migrated back and forth?
Or are the VMs created and orchestrated by the DSM system not visible to the cloud?
I see a hierarchical division of work.. at the highest level, work is divided among physical locations.. then at a given location, the work it receives is divided among VMs created within that Cloud host. If the host allows entire machines to be allocated, in the way Gigas and Amazon do, then a further division of work can be performed, which the DSM system manages. Hence, only one level of work division takes place at the VM level.

For your tool, the optimization I see is in adjusting the size of work.. it's not clear whether the Dorit tool will make decisions about how to divide the work. Those decisions will likely happen inside the DSM runtime.. then it will be up to the Dorit tool to modify the code such that it actually performs work in the DSM-chosen size of chunk. We still need to work through the details of how the DSM runtime will interact with the Dorit tool :-)



-- If the cloud is aware of our VMs and can play with their allocation/resources, we may want to introduce some APIs between the DSM system and the cloud -- for better collaboration and/or to accept notifications from the cloud upon changes it makes.
Hmmm.. I'm not clear on the "our" in "our VMs"..? There is a portal that is responsible for creating VMs, via the Cloud API. Each VM is within one physical hosting location. The Cloud stacks work such that each location has a particular URL used for creating VMs within that location AFAIU. The portal will only use these Cloud stack supplied mechanisms to create VMs and manage their location/allocation.

Within a given VM, the DSM runtime will create its own "virtual processors", or VPs, which are equivalent to threads.. all VPs are inside the same hardware-enforced coherent virtual address space. The DSM will switch among these, in order to overlap communications to remote DSM instances, which are running inside different VMs.

There should only be one layer of VMs, which are created by the portal..

Perhaps we should talk more about how you see the binary optimizer interacting with these other parts?

-- It sounds like our focus is only on the application side. We don't care about best resource allocation from the cloud perspective. Such dual optimization target problem is very difficult indeed and may be out of scope. I just wanted to make sure this is an informed decision (to focus only on the application side) and that the call doesn't expect the other aspect (general cloud utilization) to be considered as well.

The resource allocation will be done hierarchically. At the top level, the portal will cooperate with a DSM instance to choose the best division among locations, and then within a location to decide the best division among VMs. Erol Gelenbe will be handling the algorithms for this part of things.

Within a given VM, another level of work-division and resource allocation will be done, in isolation from the rest. This is the core-by-core level of allocation. The DSM runtime will handle this level of work-division, and the allocation of work-units to cores.

The Dorit tool may interact with the DSM system.. it's not clear yet just how they will interact..

7) About: "if the DSM detects that it is running late, it can issue a request to the deployment tool to
add more machines to the on-going computation. The DSM system re-calculates the division of work."

- What information is the DSM system going to base it's decisions on? Maybe it would be useful if we provide feedback information (from monitoring the application from within the VM) to help the DSM system in this process ?

That may, indeed, be helpful.. for x86, I was planning to use either Linux perftool performance counters or simply the x86 time-stamp instruction that returns CPU cycles. The proto-runtime system is architected for such measurements..

I've been viewing the Dorit tool as something that interrupts work in progress, changes the executable, then resumes work in progress.. I haven't been seeing it as a runtime system that controls what work is assigned to what core.. was that how you've been thinking?

Wow.. awesome discussion :-) Thank you :-) This is how progress happens!




!!! WP 1: Coordination and Management -- XLab
-] Task1: to be filled in

!!! WP 2: Cloud Deployment tool -- XLab
* Task1: Develop portal -- gather specifications, produce design, make prototype, test with end-user, iterate
* Task2: Define provisioning tool -- interface with portal, with DSM runtime system, with end-user client, and with Cloud API for each kind of machine
* Task3: Develop provisioning tool -- manipulates each Cloud API
* Task4: Testing and performance tuning


!!! WP 3: DSM Runtime System -- Sean leads WP (at INRIA, or Bell Labs or CWI)
* Task1: Architecture of hierarchical DSM runtime system. Delivered runtime will be a federation. On each virtual machine, or in some cases physical machine, an individual runtime system will be in operation. Each of these individual runtime systems will interact with the others to form a collective, cooperative, overall runtime system, which presents a single address space abstraction. This task will define the architecture, interfaces, and protocols for this overall runtime system. Then for each class of hardware an individual task will implement the interfaces and protocols for that particular kind of hardware, as described below.
-- Sean leads task, partners for 2 through 6, plus Imperial, plus INRIA, plus XLab participate on aspects that involve their individual piece. INRIA will advise on how the architecture choices impact the toolchain design and implementation. Imperial will advise on how the choices impact the algorithms for runtime work division and deployment. XLab will advise on how the architecture impacts the deployment tool. Partners for 2 through 6 will provide input on how the architecture impacts the individual runtime system they are implementing.

* Task2: DSM runtime system on Kalray HW that presents an API that is manipulatable by WP2 Cloud deployment tool -- Kalray (with involvement by Sean and XLab)

* Task3: DSM runtime system on FORTH Cube HW that presents an API that is manipulatable by WP2 Cloud deployment tool -- FORTH (with involvement by Sean and XLab)

* Task4: DSM runtime system on a shared memory x86 based Cloud server instance -- Sean with involvement by XLab

* Task5: DSM runtime system on a shared memory Power based Cloud server instance -- IBM plus Sean with involvement by XLab

* Task6: binary runtime specializer -- IBM (Q: any chance to get open source release? Or to apply to other ISA?). Use fat binary runtime optimization techniques to adjust the granularity and layout during execution. This is specific to IBM Power architecture, but it provides the blue print for implementing for other ISAs as well.

* Task7: Integration testing. Deploy the various individual runtime systems on specific machines. Write test cases whose purpose is to expose performance issues and bugs. Execute the test cases, record the performance, report to the partners who developed the individual runtimes in tasks 2 through 6. Work with the partners to determine the reasons for particular performance anomalies, perhaps writing and running specific tests interactively. The partners for 2 through 6 will use the results to modify their individual runtime system implementations. -- Gigas leads -- partners for 2 through 6, plus XLab are involved.

* Question: "Is the runtime going to be virtualized itself? Why not use standard bytecodes?" A: There won't be any bytecodes. CloudDSM is focused on high performance, and so the DSM runtime for many of the machines will be based on proto-runtime, due to its high performance and productivity benefits. Proto-runtime creates the equivalent of user-level "threads", which are called virtual processors because they capture a CPU context, and a CPU context virtualizes the bare processor pipeline. One of these VPs consists of its own stack, plus a data structure that holds a saved stack pointer, frame pointer, and program-counter contents. That set of state equals a point within the CPU's computation, and it can be used to switch the CPU around among different computation timelines. Each of those timelines behaves like its own, virtual, CPU processor pipeline.. hence, the name "virtual processor". But it's the lowest level, most bare bones "virtualization" possible.. there is no bytecode, no translation between ISAs, nothing but multiple CPU contexts.


[[CloudDSM.RuntimeSpecializationWP]]

!!! WP 4: Application visible interface design
* Task 1: Define more precisely the class of applications that CloudDSM targets
* Task 2: Define the needs of the toolchain, what degrees of freedom it needs in order to accomplish the desired transforms of the source code.
* Task 3: Define the needs of the runtime system, what characteristics it needs in the code generated by the toolchain in order to deliver high performance.
* Task 4: Define the needs of the application developer, what mental models and what syntax, and what debugging and code-checking support they desire.
* Task 5: Integrate the results of tasks 1 through 4 into a specification of the interfaces used by the application developer. There will be two levels of code annotation.. one that high level application developers see and use, of which there will be many variations. For example, Reo will have a different high level user interface than the pragma system for OpenMP. The second, lower level will be common to all versions of the higher level interface. This will be used directly by the toolchain to perform code transforms. Each of the higher level forms will be translated into the same, common, lower level form. This task only considers the top level forms of the code. WP 5 separately defines the common lower level form.

Question: "Why use OpenMP?" A: Many existing parallel libraries are written in OpenMP. One early project goal was to hit the ground running with code that is relevant to industry and readily usable by industry. Using OpenMP libraries during development of the low-level interface allows compiler work to begin almost immediately on relevant, ready-to-run code.

Added lines 192-194:
Question: "any hints on what is the common low-level form of the source that is produced by the development toolchain? is it annotated source code?" Answer: Figuring this out is the content of WP4. Albert would like a source code annotation form for project logistics reasons. In that case by-hand modification of existing OpenMP libraries can begin at once, and act as a test bed for rapid iterations of what is the best low-level form. At the same time, compiler techniques can be tried, and also at the same time high level end-user annotations can be tested for how the "feel" to the application programmers. The DSM-specializer can be worked on in tight iterations with figuring out what the best low-level representation should be.. any desired changes in representation are just done quickly by hand -- no need to fiddle with IR, and reasonably decoupled from high level annotation form..
Added lines 202-211:


Comment and Question: "For the IBM fat binary specializer, there are three stages (1) development stage: static generic compilation on developer machine which produces a custom IR form plus generic executable (2) static specialization compilation on a server, or during load, which generates a Power executable specialized to a specific HW (3) runtime fat-binary based recompilation on the actual deployed HW" Question: "How does this fit into CloudDSM?" Answer: Stage 1 will remain on the developer machine, stage 2 will take place inside the CloudDSM portal, and stage 3 inside the Cloud server during execution.

Question: "how will stage 2 fit with the DSM specific specializations?" Answer: this is an open question, to be resolved during the WP. We need some pictures, to figure out what tools do what at which point..

Comment: "Stage 1 happens inside the development environment on a desktop machine. The low-level annotated source is then sent to the CloudDSM portal by the developer. This process registers the application and makes it available for the end-user to run. This registration process also causes the low-level annotated source to be given to a specialization 'harness'. That harness invokes a number of specializer modules. One specializer module is provided by IBM. This module re-runs stage 1 and then runs stage 2 several times, once for each potential Power HW configuration that the CloudDSM system could send the fat-binary to. Lastly, after the user starts the application and issues a request for computation, the portal deploys a unit of work to a Cloud VM running on a Power ISA machine. That Cloud VM has the DSM runtime in it, and that is given the unit of work. The unit of work includes a function within the fat binary to perform. The fat binary is dynamically linked to the DSM runtime. During execution, the work suspends and the binary optimizer takes over, modifies the code, then resumes the work. When the work reaches a DSM call, the DSM runtime suspends the execution context. That context will remain suspended while communication of data takes place. The DSM runtime will switch the CPU to a different context, whose communication has completed and is ready to resume."


the harness can just invoke your toolchain, from start to final fat binary, in whatever way it is that you do it right now.. except it may have to be either a cross-compiler that runs on x86 but generates Power binary.. or if that's a problem, we can give the specialization harness the ability to run specialization jobs on any of the VMs available to the CloudDSM system..
February 26, 2014, at 02:30 PM by 192.16.201.181 -
Changed lines 15-28 from:
-] where does development tools, such as Eclipse plugin and code style checker fit?

-] where is the best place for the research regarding best division of work and best deployment of it onto available hardware


Also, on the single-address space abstraction: I understand that DSM abstraction is good for programmers, but it may be better to plan for high-level annotations (like regions, effects, etc) with which the programmes will be able to communicate placement information. We found that for Myrmics, the programmer can know more about placement than the runtime/compiler can infer, and once placement/allocation is done properly, the shared-mem abstraction still helps with coding, except that you also get locality and performance. Of course, it's not exactly shared-mem abstraction, because the programmer knows and can say things about locality. (like the X10 places, or regions in Fortress).

Also, maybe we can also participate in the compilation part (wp5), we already have some tools trying to recreate locality and placement information (when the programmer doesn't explicitly declare it), to replace remote accesses with DMA-ing of whole pages for performance.

Cheers,
-polyvios
to:
-] which wp to place development tools, such as Eclipse plugin and code style checker?

-] which wp to place the research regarding best division of work and best deployment of it onto available hardware?
Changed lines 47-51 from:
Questions:

* Q: "Is the runtime going to be virtualized itself? Why not use standard bytecodes?" A: There won't be any bytecodes. CloudDSM is focused on high performance, and so the DSM runtime for many of the machines will be based on proto-runtime, due to its high performance and productivity benefits. Proto-runtime creates the equivalent of user-level "threads", which are called virtual processors because they capture a CPU context, and a CPU context virtualizes the bare processor pipeline. One of these VPs consists of its own stack, plus a data structure that holds a saved stack pointer, frame pointer, and program-counter contents. That set of state equals a point within the CPU's computation, and it can be used to switch the CPU around among different computation timelines. Each of those timelines behaves like its own, virtual, CPU processor pipeline.. hence, the name "virtual processor". But it's the lowest level, most bare bones "virtualization" possible.. there is no bytecode, no translation between ISAs, nothing but multiple CPU contexts.
to:
* Question: "Is the runtime going to be virtualized itself? Why not use standard bytecodes?" A: There won't be any bytecodes. CloudDSM is focused on high performance, and so the DSM runtime for many of the machines will be based on proto-runtime, due to its high performance and productivity benefits. Proto-runtime creates the equivalent of user-level "threads", which are called virtual processors because they capture a CPU context, and a CPU context virtualizes the bare processor pipeline. One of these VPs consists of its own stack, plus a data structure that holds a saved stack pointer, frame pointer, and program-counter contents. That set of state equals a point within the CPU's computation, and it can be used to switch the CPU around among different computation timelines. Each of those timelines behaves like its own, virtual, CPU processor pipeline.. hence, the name "virtual processor". But it's the lowest level, most bare bones "virtualization" possible.. there is no bytecode, no translation between ISAs, nothing but multiple CPU contexts.
Added line 60:
Added lines 69-71:
Comment and Question: "It may provide benefit desired by end-users if location-aware high-level annotations were also provided, such as regions, effects, etc. With these, the programmers will be able to communicate placement information. We found that for Myrmics, the programmer can know more about placement than the runtime/compiler can infer, and once placement/allocation is done properly, the shared-mem abstraction still helps with coding, except that the manual placement provides superior locality and performance. The programmer knows about and can say things about locality (like the X10 places, or regions in Fortress)." Question: "does it make sense to expose location within a Cloud-system that automatically and dynamically changes the number, type, and location of machines assigned to the computation?" Answer: you have nailed the heart of what this workpackage is all about. This will be an on-going discussion during the first six to nine months of the project. Indeed, one desire is to capture the understanding that the programmer has in their head and uses during the process of specifying pieces of work and choosing, themselves, where to place each piece. The goal of the WP is to discover an encoding of the process that the programmer does in their head, so that they encode that mental process, in a parameterized way. The automation then chooses the parameter values, and plugs them into what the programmer provided, which delivers pieces of work and the affinities among them. The WP content is the work of discovering programmer abstractions that get us as close to there as possible, in a way that we know how to implement..
Added lines 77-79:

FORTH will contribute their compiler work that recreates locality and placement information when the programmer doesn't explicitly declare it (this is used to replace remote accesses with DMA-ing of whole pages for performance).
February 26, 2014, at 02:08 PM by 192.16.201.181 -
Added line 14:
Added line 16:
Added lines 20-28:
Also, on the single-address space abstraction: I understand that DSM abstraction is good for programmers, but it may be better to plan for high-level annotations (like regions, effects, etc) with which the programmes will be able to communicate placement information. We found that for Myrmics, the programmer can know more about placement than the runtime/compiler can infer, and once placement/allocation is done properly, the shared-mem abstraction still helps with coding, except that you also get locality and performance. Of course, it's not exactly shared-mem abstraction, because the programmer knows and can say things about locality. (like the X10 places, or regions in Fortress).

Also, maybe we can also participate in the compilation part (wp5), we already have some tools trying to recreate locality and placement information (when the programmer doesn't explicitly declare it), to replace remote accesses with DMA-ing of whole pages for performance.

Cheers,
-polyvios

Added lines 55-59:
Questions:

* Q: "Is the runtime going to be virtualized itself? Why not use standard bytecodes?" A: There won't be any bytecodes. CloudDSM is focused on high performance, and so the DSM runtime for many of the machines will be based on proto-runtime, due to its high performance and productivity benefits. Proto-runtime creates the equivalent of user-level "threads", which are called virtual processors because they capture a CPU context, and a CPU context virtualizes the bare processor pipeline. One of these VPs consists of its own stack, plus a data structure that holds a saved stack pointer, frame pointer, and program-counter contents. That set of state equals a point within the CPU's computation, and it can be used to switch the CPU around among different computation timelines. Each of those timelines behaves like its own, virtual, CPU processor pipeline.. hence, the name "virtual processor". But it's the lowest level, most bare bones "virtualization" possible.. there is no bytecode, no translation between ISAs, nothing but multiple CPU contexts.
Added lines 68-69:

Question: "Why use OpenMP?" A: Many existing parallel libraries are written in OpenMP. One early project goal was to hit the ground running with code that is relevant to industry and readily usable by industry. Using OpenMP libraries during development of the low-level interface allows compiler work to begin almost immediately on relevant, ready-to-run code.
Changed line 91 from:
* Douglas Connect will use the results of the project within its Drug Discovery product, and make the project technology available to the startups that it incubates.
to:
* Douglas Connect will use the results of the project within its Drug Discovery product, and make the project technology available to the startups that it incubates.
February 23, 2014, at 03:50 PM by 80.114.134.224 -
Added lines 43-44:

[[CloudDSM.RuntimeSpecializationWP]]
February 23, 2014, at 05:30 AM by 80.114.134.224 -
Added lines 1-71:
!!Work packages:

List of workpackages, and leader for the WP

* WP 1: Coordination and Management -- XLab
* WP 2: Cloud Deployment tool -- XLab
* WP 3: DSM Runtime System -- Sean
* WP 4: Application visible interface Design -- Sean
* WP 5: Toolchain -- INRIA
* WP 6: End User Application -- Douglas Connect
* WP 7: Dissemination and Exploitation -- TBD

Open questions:
-] where does development tools, such as Eclipse plugin and code style checker fit?
-] where is the best place for the research regarding best division of work and best deployment of it onto available hardware


!!! WP 1: Coordination and Management -- XLab
-] Task1: to be filled in

!!! WP 2: Cloud Deployment tool -- XLab
* Task1: Develop portal -- gather specifications, produce design, make prototype, test with end-user, iterate
* Task2: Define provisioning tool -- interface with portal, with DSM runtime system, with end-user client, and with Cloud API for each kind of machine
* Task3: Develop provisioning tool -- manipulates each Cloud API
* Task4: Testing and performance tuning


!!! WP 3: DSM Runtime System -- Sean leads WP (at INRIA, or Bell Labs or CWI)
* Task1: Architecture of hierarchical DSM runtime system. Delivered runtime will be a federation. On each virtual machine, or in some cases physical machine, an individual runtime system will be in operation. Each of these individual runtime systems will interact with the others to form a collective, cooperative, overall runtime system, which presents a single address space abstraction. This task will define the architecture, interfaces, and protocols for this overall runtime system. Then for each class of hardware an individual task will implement the interfaces and protocols for that particular kind of hardware, as described below.
-- Sean leads task, partners for 2 through 6, plus Imperial, plus INRIA, plus XLab participate on aspects that involve their individual piece. INRIA will advise on how the architecture choices impact the toolchain design and implementation. Imperial will advise on how the choices impact the algorithms for runtime work division and deployment. XLab will advise on how the architecture impacts the deployment tool. Partners for 2 through 6 will provide input on how the architecture impacts the individual runtime system they are implementing.

* Task2: DSM runtime system on Kalray HW that presents an API that is manipulatable by WP2 Cloud deployment tool -- Kalray (with involvement by Sean and XLab)

* Task3: DSM runtime system on FORTH Cube HW that presents an API that is manipulatable by WP2 Cloud deployment tool -- FORTH (with involvement by Sean and XLab)

* Task4: DSM runtime system on a shared memory x86 based Cloud server instance -- Sean with involvement by XLab

* Task5: DSM runtime system on a shared memory Power based Cloud server instance -- IBM plus Sean with involvement by XLab

* Task6: binary runtime specializer -- IBM (Q: any chance to get open source release? Or to apply to other ISA?). Use fat binary runtime optimization techniques to adjust the granularity and layout during execution. This is specific to IBM Power architecture, but it provides the blue print for implementing for other ISAs as well.

* Task7: Integration testing. Deploy the various individual runtime systems on specific machines. Write test cases whose purpose is to expose performance issues and bugs. Execute the test cases, record the performance, report to the partners who developed the individual runtimes in tasks 2 through 6. Work with the partners to determine the reasons for particular performance anomalies, perhaps writing and running specific tests interactively. The partners for 2 through 6 will use the results to modify their individual runtime system implementations. -- Gigas leads -- partners for 2 through 6, plus XLab are involved.

!!! WP 4: Application visible interface design
* Task 1: Define more precisely the class of applications that CloudDSM targets
* Task 2: Define the needs of the toolchain, what degrees of freedom it needs in order to accomplish the desired transforms of the source code.
* Task 3: Define the needs of the runtime system, what characteristics it needs in the code generated by the toolchain in order to deliver high performance.
* Task 4: Define the needs of the application developer, what mental models and what syntax, and what debugging and code-checking support they desire.
* Task 5: Integrate the results of tasks 1 through 4 into a specification of the interfaces used by the application developer. There will be two levels of code annotation.. one that high level application developers see and use, of which there will be many variations. For example, Reo will have a different high level user interface than the pragma system for OpenMP. The second, lower level will be common to all versions of the higher level interface. This will be used directly by the toolchain to perform code transforms. Each of the higher level forms will be translated into the same, common, lower level form. This task only considers the top level forms of the code. WP 5 separately defines the common lower level form.

Tasks 1 through 5 will be performed iteratively, with multiple revisions during the first six months of the project. Each of the tasks will have an impact on the other tasks, and it will require a large amount of communication, via iterations, in order to find a suitable common ground interface that supports all the aspects well.

* Task 6: Development tools to support the writing of application code. This includes code checkers that enforce restrictions on what coding practices are allowed inside the portions of the application that employ the CloudDSM system. It also includes debugging aids, to detect bugs and to narrow down the portion of the application code causing discrepancies from specified behavior.

-- Sean and INRIA and Douglas Connect will lead, with input from XLab, Imperial, and partners for WP3 tasks 2 through 6.


!!! WP 5: Compiler Toolchain -- INRIA
* Task 1: participate in WP 4 task 2, as part of arriving at the interface that WP 5 will take as input.
* Task 2: Define intermediate, low level form of code annotation. The interfaces defined in WP 4 will be translated into this common lower level form.
* Task 3: Create tools that transform from each form of higher level code annotation into the common lower level code annotation form.
* Task 4: Create transform tools that translate from the common lower level form into the final C form of the code. The final C form includes OS calls, DSM runtime system calls, and synchronization calls that are inserted by the tool. The final C code has a form of the application that performs large chunks of work in-between calls to the DSM system. Each target hardware platform will require its own variation of the transform tool, which is tuned to the details of that hardware, especially communication details. The tool may produce a single multi-versioned binary, or it may include a runtime specializer, or it may generate many independent versions of the binary. A large portion of the research will involve determining the best approach.

!!! WP 6: End User Application -- Douglas Connect
* Task1: divide application into user client, computation kernels, and work division
* Task2: mock up using annotations
* Task3: employ the various interfaces

!!! WP 7: Dissemination and Exploitation
* Gigas will make the results of the project available to its customers
* Douglas Connect will use the results of the project within its Drug Discovery product, and make the project technology available to the startups that it incubates.