Work Package 2 CloudDSM portal

  • Task1: Develop portal prototype (XLab, ICL, DC, Gigas, Sean) -- Participate as part of a rapid prototyping process, together with members of WP 3 -- DSM runtime, WP 4 -- annotations, WP 5 -- transform tools, and WP 7 -- end application. Together get a minimum viable system up and running, by iterating on the process of gathering specifications, producing a design, implementing, and testing with end-user. This will create a high degree of communication between partners early in the project, to uncover hidden assumptions, develop shared ideas about approach, define clean interfaces and interactions between partners, workpackages, and deliverable pieces, and increase the smoothness and pace of progress for the rest of the project.

This task includes several revisions of specifications for: the portal functionality; internal structure; external interfaces; and protocols for communicating with it. A new incremental revision will be delivered each month, which refines what was delivered in the previous month. The prototype will then be available for the other WPs to work with, and deliver feedback to direct the next prototype delivered. The first deliverable will be mainly interfaces, with dummy modules inside that return "canned" results that correspond to the interfaces, without computing anything inside. The first prototype's interfaces may be incomplete, leaving out functionality that will be introduced over time in subsequent versions, such as the interface for gathering of scheduling related information. The monthly delivery of new and improved prototypes will continue until month 8, at which point the candidate for the final interfaces will be delivered, with minimum viable functionality implemented inside, although the functionality will be simple, fast-to-implement versions, taken from previous EU projects or open source, or created fresh, which ever path proves fastest during development.

The portal contains a specialization harness, a scheduler, a deployer, repository of specialized binaries, repository of application characteristics relevant to scheduling, statistics on application resource usage, and continuously updated state of the hardware resources. The portal receives instructions directly from human users, as well as engaging in a protocol with the front-end of applications. It observes resources and interacts with lower system levels, and also observes runing applications, and makes decisions for tasks. As such, the portal contains a scheduler that self-optimises the system in real time. Thus the portal and scheduler must be efficient and avoid complex computations. The portal will accomplish responsiveness by being highly distributed and by exploiting inputs from multiple agents that operate simultanteously around different time scales, and around different resources, which share the gathered information with the scheduler. This will be the philosophy driving the development of the portal so that it is made of many distributed/concurrent lightweight components with only lightweight synchronisation while maximising information sharing.

  * Deliverable D2.1: Month 1 -- XLab, with input from INRIA, DC, LB, IBM, Passau, and Gigas -- simple set of interfaces to the portal, where the portal delivers "canned" responses when the interfaces are exercised.  Upon delivery, these interfaces become available to the other workpackages, and feedback from them will drive design of the months 2 and 3 deliverables of this WP.  The interfaces will be documented in a convetient form for the other WP participants to understand and use.

  * Deliverable D2.2: Month 2  -- XLab, with input from INRIA, DC, LB, IBM, Passau, and Gigas -- improved interfaces, with some functionality filled in, as determined by circumstances uncovered during the project.  Upon delivery, these interfaces become available to the other workpackages, and feedback from them will drive design of the months 3 and 4 deliverables of this WP.  The interfaces will be documented in a convenient form for the other WP participants to understand and use.

  * Deliverable D2.3: Month 3  -- XLab, with input from INRIA, DC, LB, IBM, Passau, and Gigas -- improved interfaces, with some actual functionality filled in, as determined by circumstances uncovered during the project.  Upon delivery, these interfaces become available to the other workpackages, and feedback from them will drive design of the months 4 and 5 deliverables of this WP.  The interfaces will be documented in a convenient form for the other WP participants to understand and use.

  * Deliverable D2.4: Month 4  -- XLab, with input from INRIA, DC, LB, IBM, Passau, and Gigas -- interfaces should be starting to stabilize in this version, and more functionality is filled in, as determined by circumstances uncovered during the project.  Upon delivery, the interfaces become available to the other workpackages, and feedback from them will drive design of the months 5 and 6 deliverables of this WP.  The interfaces will be documented in a convenient form for the other WP participants to understand and use.

  * Deliverable D2.5: Month 5  -- XLab, with input from INRIA, DC, LB, IBM, Passau, and Gigas -- interfaces should be nearly stabile in this version, and more functionality is filled in, as determined by circumstances uncovered during the project.  Upon delivery, the interfaces become available to the other workpackages, and feedback from them will drive design of the month 6 deliverables of this WP.  Month 6 should be the last deliverable in which the interfaces change in any major way. The interfaces will be documented in a convenient form for the other WP participants to understand and use.

  * Deliverable D2.6: Month 6  -- XLab, with input from INRIA, DC, LB, IBM, Passau, and Gigas -- the last prototype in which the interfaces may change in major ways.  Some version of most functionality should be filled in, missing only where circumstances arising within the project delay.  Upon delivery, the interfaces become available to the other workpackages, as a candidate for the final interfaces to the portal.  The interfaces will be documented in a convenient form for the other WP participants to understand and use.

  * Deliverable D2.7: Month 7  -- XLab, with input from INRIA, DC, LB, IBM, Passau, and Gigas -- Nearly all functionality should have some form of implementation, missing only where circumstances arising within the project delay.  The interfaces will be documented in a convenient form for the other WP participants to understand and use.

  * Deliverable D2.8: Month 8  -- XLab, with input from INRIA, DC, LB, IBM, Passau, and Gigas -- working, simple, prototype of portal, with simple implementations of each subsystem -- code and related artifacts of a working simple portal prototype (ready for use by DC and INRIA)
  • Task2: Harden the internals of the portal. Integrate deliverables from other work packages into the portal infrastructure, such as the scheduler delivered by workpackage 6. Improve the function of the simple elements that are inside the prototype.
    • Deliverable D2.9: Month 18 -- A working portal that includes a working Scheduler, Specialization harness, executable repository, application statistics gathering, hardware statistics monitoring, and deployment onto a given Cloud stack -- Deliver the working code base plus a technical report, and other artifacts related to the implementation of the portal and supporting subsystems.
    • Deliverable D2.10: Month 30 -- Deliver the final, portal, ready for integration and testing of the final deliverables from the other work packages, such as the low level annotation transform tool, and the IBM toolchain that generates a fat binary. Deliver a technical report, code, and other artifacts related to the implementation of the portal and supporting subsystems.
  • Task3: Testing and performance tuning of full CloudDSM system (IBM, DC, LB, INRIA, CWI, Gigas, ICL, XLab) In this task, a working version of the CloudDSM portal is tested and performance is measured, when running applications from DC and LB.
    • Deliverable D2.11: Month 12 -- Report on the results of testing and measuring performance of the initial end-user apps from DC and LB, when run on the prototype system, with appropriate pieces from other work packages integrated.
    • Deliverable D2.12: Month 24 -- Report on the results of testing and measuring performance of the initial end-user apps from DC and LB, when run on the prototype system, with appropriate pieces from other work packages integrated.
    • Deliverable D2.13: Month 36 -- Final Evaluation of the CloudDSM system portal
  • Task4: Modifications to Cloud software API inside Host provider -- (Gigas, XLab, Sean) -- add the following to the Cloud API:
    • The ability to assign which physical machine (inside IaaS provider's premises) a given VM image is started and animated on, via API.
    • The ability to gather information about the CPU and network usage of each physical machine in a given hosting location, via API. Define interfaces to the chosen IaaS providers.
    • The ability for the CloudDSM portal to inform the hypervisor on a given physical machine (via IaaS provider) that it should stop assigning a specific VM to CPUs, which is already loaded and active inside the hypervisor. The portal also has the ability to tell the hypervisor to again begin assigning that VM to CPUs. (This eliminates the 2 minute overhead required to load a VM image into the hypervisor, and also eliminates charging the client for CPU time spent busy-waiting while waiting for data to arrive or for work to become free).
    • Deliverable D2.14: Month 10 -- the functionality of this task is implemented and integrated with the prototype portal. All of the functionalities are exercised from the portal, and tests exist to exercise them, which are performed in Task 3 -- testing.
  • Task5: Research and Develop Scheduling component within portal -- (Leader ICL, major contribution by XLab, with input by DC, Gigas, and Sean; ICL 21PMs) -- Once the prototype is complete, the interface internal to the portal between the scheduler and other components inside portal, is defined. The other components supply information upon which to base decisions and produce a schedule to be carried out. The scheduler will exploit a self-aware dynamic analysis approach which in an on-line manner and in real-time, will receive as input the state of all system resources, and calculate resource availability and expected execution times for tasks currently active within the portal and expected to be launched based on predictions derived from usage patterns. Scheduling decisions will combine detailed status information regarding expected response times, internal network delays, possible security risks, and possible reliability problems. The self-aware dynamic analysis will return a "short list" of the best instantaneous provisioning decisions, ranked according to performance, security, reliability, energy consumption and other relevant metrics. Based on the task to be provisioned, the scheduler will be able to decide on provisioning rapidly. The portal and collection of DSM runtime systems include monitoring of whether the performance objectives of the task are being met, which will re-trigger the scheduler to make a decision again if the observations indicate an unsatisfactory outcome, which will be based on the new system state as well as the overhead related to any changes in provisioning.

Deliverable 2.15: 12 Months after the start of the project -- 6PMs -- A working simple prototype scheduler, integrated with the prototype portal, plus a detailed description of the final full featured scheduler. The prototype will implement all interfaces and successfully generate a schedule, but using simple algorithms. The targeted final scheduler is described in a delivered technical report defining the scheduler functionality, structure, and all of its components, and describing its interfaces to and integration with the portal subsystems that monitor hardware and application features and with the DSM runtime system.

Questions and comments related to the WP

Question: "Why would a single Cloud VM be used to perform computation tasks from multiple applications?" Answer: from an overhead perspective, sharing a single "worker" VM among multiple applications is the most efficient way to go. But there must be isolation between applications, so this raises security concerns. Those have to be addressed, which requires extra development effort. So there's a balance between performance (overhead), security, and effort (IE, add security features to CloudDSM runtime is extra effort).

Note that when a DSM command is issued by application code, the application context is suspended. If there are no ready contexts within the application, then the proto-runtime system can switch to work from a different application within nano-seconds. This allows fine-grained interleaving of work with communication. In contrast, if the time to switch among applications is on the order of hundreds of micro-seconds, which it would be if the CPU had to switch to a different Cloud level VM, then the processor is better off simply sitting idle for any non-overlapped communication that is shorter than the switch-application time. In many situations, this will be the case, causing much lost performance and lost hardware utilization, and the customer being charged for idle CPU time. It is this loss that will be prevented by allowing a single VM, with its single DSM runtime instance inside it, to run work for multiple applications, and have the DSM runtime switch among the applications in its very fast way.

Question: "Doesn't the hypervisor automatically handle suspending an idle VM under the covers? Why is an explicit sleep/wake command needed in the CloudAPI?" Answer: When an application is started, the portal may over-provision, starting VMs on many machines, and then putting them to sleep. As soon as the application issues a high computation request, the VMs are woken and given work to do, with expected duration on the order of seconds to at most a minute. When the computation is done, most of the VMs are put back to sleep until the next time. The time to suspend and resume is less than the time to start the VM.

Comment: "A VM can yield a physical processor through a POWER Hypervisor call. It puts idle VMs into a hibernation state so that they are not assigned to any physical CPU resources."

Comment: "For IBM, our binary-reoptimizer tool has a special loader that runs the application as a thread within the same process as the optimizer and kicks off dynamic monitoring and recompilation. So there is one dynamic optimization process for each application running in a VM."

Comment: "In summary, there are two cases: (1) A given application has no work ready, with which to overlap communication (2) The whole DSM runtime system has no work ready. In case 1, the sharing of the hardware is more efficiently performed inside the DSM runtime, where it happens on the order of nanoseconds, which is why it's better for the DSM runtime to handle switching among applications. In case 2, the hypervisor has no way to know that the DSM runtime has no work! It sees the polling the DSM does while waiting for new incoming work as a busy application, so the hypervisor keeps the DSM going. The DSM runtime needs a way to tell the hypervisor to suspend the idle VM. After all, the polling consumes cpu time which the end-user pays money for! At the moment, proto-runtime simply uses pthread yield when it is idle, polling for new incoming work.. but it's not clear that will be enough to get the VM to stop re-scheduling it.. Also, from a higher level, the deployment tool knows periods when there's no work for a specific DSM runtime instance, and can issue a sleep/yield to the VM, which is faster than a full suspend-to-disk..

Comment: "On IBM there are two virtualization layers: Power hypervisor, and Cloud stack hypervisor. The Power hypervisor is not visible to the cloud stack, and is also much faster."

Question: "Why is it advantageous to have Gigas's KVM based Cloud stack, which allows an entire machine to be given exclusively to a given computation task. Isn't this wasteful?" Answer: This allows the DSM runtime to manage the assignment of work onto cores, which happens on the order of nano-seconds, far faster than a hypervisor could manage assignment of work onto cores. This assignment of whole machine to a VM is how EC2 from Amazon works, as well, for example. It doesn't waste, because the machine is fully occupied by the computation.

Question: "How does the cloud VM layer relate to the DSM runtime and the CloudDSM portal?" Answer: There is one instance of the DSM runtime inside each Cloud level VM. The DSM might directly use hypervisor commands to cause the VM it is inside of to fast-yield/sleep, at the point the DSM runtime detects that it has no ready work. The portal, though, decides when the VM should be long-term suspended or shutdown. The Cloud VM is given all the cores of a machine whenever possible, then the DSM within directly manages assigning application work to the cores, which includes suspending execution contexts at the point they perform a DSM call or synchronization. The portal runs inside its own Cloud VMs. It performs Cloud level control of creating VMs, suspending them to disk, starting DSM runtimes inside them, receiving command requests from application front-ends, and starting work-units within chosen VMs. The portal knows about the physical providers, and what hardware is at each, and what Cloud API commands to use to start VMs at them, and what Cloud API commands to use to learn how busy each location is.. hopefully the Cloud stack API has some way of informing the portal about uptime, or load, within a given physical location. The portal has a set of VMs that run code that performs the decision making about how to divide the work of a request among the providers and among the VMs created within a given provider. It may be advantageous for the portal to have Cloud APIs available that expose the characteristics of the hardware within the provider, and allows a measure of control over assignment of VMs to the hardware. The DSMs report status to the portal, which may decide to take work away from poor performing VMs and give it to others, perhaps even VMs in a different physical location. (It is unlikely that a VM itself will be migrated, but rather the work assigned to the DSM runtime inside the VM).

Question: "Is it possible that the cloud would want to move a VM from one core to another?" Answer: control over cores is inside the DSM runtime system. It's too fine grained for the Cloud stack or portal to manage. When a VM is created, all the cores of the physical machine should be given to that VM, and the hypervisor should let that VM own the hardware for as long as possible, ideally until the DSM runtime signals that it has run out of work.

Question: "How do units of work map onto Cloud level entities?" Answer: I see a hierarchical division of work.. at the highest level, work is divided among physical locations.. then at a given location, the work it receives is divided again, among VMs created within that Cloud host. If the Cloud stack at a host allows entire machines to be allocated, in the way Gigas and Amazon EC2 do, then a further division of work is performed, among the CPUs in the machine, which is managed by the DSM runtime. Hence, the portal invokes APIs to manage work chunks starting execution at the level of providers and at the level of Cloud VMs. The DSM system inside a given VM manages work chunks starting on individual cores inside a VM.

Within a given VM, the DSM runtime will create its own "virtual processors", or VPs, which are analogous to threads.. all VPs are inside the same hardware-enforced coherent virtual address space. The DSM will switch among these, in order to overlap communications to remote DSM instances, which are running inside different VMs.

Question: "What components of the CloudDSM system care about best resource allocation from the cloud perspective?" Answer: The calculation of best cloud level resource allocation is encapsulated inside a module that the portal runs. The portal collects information from the application, which the toolchain packages, and the portal collects information about the hardware at each provider, and about the current load on that hardware, and it collects statistics on each of the commands invoked by a given application, and it gives all of this information to the resource-calculation module. That module determines the best way to break up the work represented by the user command, and distribute the pieces across providers and across VMs. Within a VM, the DSM runtime system independently and dynamically decides allocation among the CPUs. Erol Gelenbe at Imperial College will be handling the algorithms for the Cloud level resource calculations.

Comment and Question: "if the DSM detects that it is running late, it can issue a request to the portal to add more machines to the on-going computation. The resource tool re-calculates the division of work." Question: "What information is the portal going to base its decisions on?" Answer: the DSM runtime running inside a given VM communicates status to the portal. Annotations might be inserted into the low-level source form that help the DSM runtime with this task, or the DSM runtime may end up handling this all by itself.

Research regarding _search_ for best division of work and best deployment of it onto available hardware

Suggested solution

ConPaaS[1] is an open source runtime environment for hosting applications in the cloud.

ConPaaS is part of PaaS family, provides full power of cloud to application developers while shielding them from the complexity of the cloud. It can use different IaaS providers as a backend (EC2, OpenNebula, plan is also to fully support OpenStack).

It is designed to host HPC scientific applications and web applications. Moreover, it automates the entire life-cycle of an application: development, deployment, perf.monitoring, and also automatic scaling. Important: it is under further development (as a result of project Contrail).

It runs on variety of public and private clouds, it is easily extendable where CloudDSM benifits the most. It allows developers to focus on application-specific concerns rather than on cloud-specific details.

Application Model It provides a collection of services, where each service acts as a replacement for a commonly used runtime environment. For example, to replace a MySQL database, ConPaaS provides a cloud-based MySQL service which acts as a high-level database abstraction. The service uses real MySQL databases internally, and therefore makes it easy to port a cloud application to ConPaaS. Unlike a regular centralized database, however, it is self-managed and fully elastic: one can dynamically increase or decrease its processing capacity by requesting the service to reconfigure itself with a different number of virtual machines.

ConPaaS is also designed to facilitate the development of new services. All services derive from a single “generic service” which provides the basic functionalities for starting and stopping VMs, configuring and initializing the right services in them, etc. All the virtual machines in ConPaaS also rely on the same generic VM image. A service implementation therefore consists of a few hundred lines of Python (implementing the service-specific parts of the service manager) and Javascript (extending the front-end GUI with service-specific information and control).

Map reduce and TaskFarming: these two services make intensive use of the MapReduce and TaskFarming services from ConPaaS as their main runtime for high-performance big-data computations. Both applications also plan to offer Web-based frontends to their users.

[1] http://www.conpaas.eu/