WP 7 End User Applications

The major goal for this workpackage is to provide commercial end-user applications that employ CloudDSM, which allows them to gain benefit in either response time or in size of problem. Applications are chosen that are otherwise barred from such benefits due to the logistics or cost of access of machines of sufficient size or computation power, or are barred by the development cost, or advanced coding skills required for distributed style development. The work will begin with one clean, well defined application in the area of drug discovery. This application will be used during the initial phases of the project within a rapid proto-type development process.

Douglas Connect

The application chosen for the initial development is the creation and cross-validation of statistical models of toxicology data. The data is housed within a system created as part of a previous EU project called ToxBank (www.ToxBank.net) and is accessed via the OpenTox framework (www.OpenTox.org). It has complex data and suitable algorithms covering structure of chemical compounds and related data from a large number of in vitro and in vivo experiments including large sets of OMICS-data. Each experiment has its own sources of error, and noise within the collected data. The end user wishes to extract some conclusion about a candidate compound. In the process, they use multiple of the experiments, creating a statistical model for each experiment. They then fit, or train, the models from the experiment data. Once complete, they must determine whether the models have derived meaningful conclusions, which is accomplished by a technique called cross-validation, in which results from one model are compared to those from the others. Depending upon the cross-validation outcome, the user then modifies the models, or changes their selection of data and repeats. Algorithms in this area often have a super-linear or even exponential time complexity. However, the users wish to perform this work interactively, so that they can try multiple different statistical approaches within a given session, until they reach a set of data and models that give high confidence. With the complexity of the algorithms and large amounts of data, the time to train and cross-validate the models is much too high for interactive use when only a single machine is available. A federated cloud system will reduce the time to minutes rather than hours, allowing the drug discovery researchers to ask "what if" questions, which improves the liklihood of finding new, useful drugs.

In the future, Toxbank will also incorporate iPSC models, which are quickly establishing themselves in pharmacology and drug development and will make their way to toxicology. This will give a new dimension to ToxBank and in general to the OpenTox landscape: A new set of rich data and algorithms, increasing the need for cloud storage and parallel processing.

Once the initial, minimum viable system, is complete, Douglas Connect expects to develop applications that involve NGS platfroms like e.g. Illumina HiSeq. NGS systems perform rapid, high-throughput, high-quality whole-genome sequencing for patients, producing typical records of 100s of giga bytes for a given patient. The purpose of the sequencing is to overcome individual variations in responses to drugs and toxicants, which have been considered to be the major cause of failure in late clinical phases of drug development. The process also involves generating induced pluripotent stem cells (iPSC; www.ebisc.org) from each patient. The NGS data is cross validated with iPSC cell response profiles using multiple functional and data-intensive techniques. End users of these systems will enjoy benefits when the applications in this area are adapted to the CloudDSM system, which will increase the size of data that can be computed upon and reduce the time of cross validating against existing models, related disease areas, clinical meta data, and results published in the literature.

Near the end of the project, Douglas Connect would also like to consider applications in the area of personalized medicine and pathway-driven toxicology, in which drug-related efficacy models are integrated with toxicity models, preferentially utilizing individualized human in vitro models such as iPSC. This personalized medicine and pathway-driven toxicology integrates multiple data sets as part of generating dynamic models of the interaction of gene and protein regulation networks as they behave over time. Simulation of such dynamic models consumes large computation time. End users who investigate such dynamic systems will benefit from interactive use because the biological regulation action that must be understood by the user involves biological signals feeding backward and forward, which impels the user to try multiple stimuli and multiple runs. From this, they perform complex decision making in building appropriate models for prediction. Time is a key asset in the decision making in the pharmaceutical industry. Speeding up the prediction of clinical and toxicological outcomes during a session of an end user will increase the breadth of possibilities considered, and thereby increase the number of innovative medicines discovered, and it will also decrease the time to market for such medicines.

LarkBio

The second contribution to this work package will be made by LarkBio, who develop a software solution for de-novo genome assembly using De Brujin graphs. This application will be adapted to the CloudDSM system in order to allow larger data sets to be computed upon than are possible on the machines typically available in a Cloud setting.

The application involves metagenomic studies, whose aim is to identify novel enzyme candidates in bacterial communities. They have applications in medicine, biotechnology and agriculture. The sequence data comes from the DNA of different species, each of which can be hundreds of giga bytes, and the collection used in a computation is preferably in the range of terabytes. More data means higher probability of covering the least abundant of the species. As part of identifying novel enzyme candidates, de-novo assembly algorithms are used to join the short reads so as to build up longer fragments of DNA. Gene prediction algorithms are then run on these fragments. Current de-novo algorithms build up an in-memory graph that processes all reads, while the size of the available memory determines the number of reads that can be handled this way. The larger the memory, the more enzyme candidates can be discovered in the sample.

The current software that performs de novo genome assembly include the Velvet and SOAPdenovo tools, both of which are based on in-memory De Bruijn graphs. Such graphs reach terabyte sizes when assembling large (mammalian) genomes, or when performing optimal size runs on large metagenomes. The largest practical machines available for performing these computations are limited to about 500GB of RAM, which is insufficient for acceptable coverage of large metagenomes.

  * We expect that the CloudDSM based de-novo solution:
    * will be able to handle larger data sets
    * produces more precise genome assembly on the large sequencing data
    * results in candidate enzymes which are not visible in case of lower data amounts
    * results in higher average length of the assembled sequences
    * higher number of the enzyme candidates, and hence higher probability of success of the project.

  * The impact of the CloudDSM based solution on our customers is that:
    * they can use the NGS short read sequencing approach to genome assembly projects where they needed to use traditional and more expensive solutions before 
    * we can feed higher data volume to the de-novo assembly step of our analysis pipeline
    * we will be able to deliver more precise results
    * the customer will be able to identify novel enzymes which can be used to catalyze a wide spectrum of industrial biochemical processes in e.g. photocatalysis, 
drying & powder technologies, clean fuel & energy sector, perfumes & flavor materials production or
medicinal drug design.

Working with genetic data of humans is a very sensitive area from an ethical perspective. During the project we will not work with data that contains personal information. A good option is to use metagenomic data for the validation can be obtained from public sources. An example for this is the EBI database (http://www.ebi.ac.uk/ena/data/view/ERA000116&display=html) where human gut metagenome data is publicly accessible. It is also possible to get data which are not from human samples but soil, or sea water etc.

Tasks

  • Task1: participate in rapid prototyping process, together with members of WP 2 -- portal, WP 3 -- DSM runtime, WP 4 -- annotations, and WP 5 -- transform tools, to together get a minimum viable system up and running. This will create a high degree of communication between partners early in the project, to uncover hidden assumptions, develop shared ideas about approach, define clean interfaces and interactions between partners, workpackages, and deliverable pieces, and increase the smoothness and pace of progress for the rest of the project.
  • Task1: divide application into user client, computation kernels, and work division
    • Deliverable D6.: Month 12 --
    • Deliverable D6.: Month 24 --
    • Deliverable D6.: Month 36 --
  • Task2: mock up using annotations
    • Deliverable D6.: Month 12 --
    • Deliverable D6.: Month 24 --
    • Deliverable D6.: Month 36 --
  • Task3: employ the various interfaces
    • Deliverable D6.: Month 12 --
    • Deliverable D6.: Month 24 --
    • Deliverable D6.: Month 36 --
  • Task4: Develop de novo gene assembly application on CloudDSM
    • Milestone M6.: Month 12 -- Prototype of the CloudDSM based de-novo genome assembler tool
    • Deliverable D6.: Month 12 -- Source code of the de-novo assembly tool and the executable file properly instrumented to be able to run on the CloudDSM system
      The input of the tool is the NGS sequencing reads in FASTQ format, and the algorithm parameters.
      The output of the tool is the assembled genome fragments in FASTA format, and the statistics of results (average sequence length, N50 value).
  • Task5: Validate the solution on a dataset which can be solved using currently available tools. (LarkBio)
    As the validation of the software, we run the de-novo assembly on a smaller data set using CloudDSM and plain EC2 instance, and compare the results.
    • Milestone M6.: Month 24 -- Validation phase 1
    • Deliverable D6.: Month 24 -- Documentation and results of the test run on the smaller dataset. Comparison of the results with current solution.
  • Task6: Validate the solution on large dataset, which is not feasible without CloudDSM. (LarkBio)
    The next step of the validation is to run the algorithm on an extended data set, and verify if novel fragments are assembled this way.
    • Milestone M6.: Month 36 -- Validation phase 2
    • Deliverable D6.: Month 36 -- Documentation and results of the test run on a large metagenomic dataset. Comparison of the results with the partial dataset, analysed with traditional tools.