CloudDSM.WP7 History

Show minor edits - Show changes to output - Cancel

April 09, 2014, at 09:21 AM by 80.99.199.138 -
Added lines 36-37:

Working with genetic data of humans is a very sensitive area from an ethical perspective. During the project we will not work with data that contains personal information. A good option is to use metagenomic data for the validation can be obtained from public sources. An example for this is the EBI database (http://www.ebi.ac.uk/ena/data/view/ERA000116&display=html) where human gut metagenome data is publicly accessible. It is also possible to get data which are not from human samples but soil, or sea water etc.
April 01, 2014, at 09:19 AM by 178.11.216.240 -
Added lines 38-41:


* Task1: participate in rapid prototyping process, together with members of WP 2 -- portal, WP 3 -- DSM runtime, WP 4 -- annotations, and WP 5 -- transform tools, to together get a minimum viable system up and running. This will create a high degree of communication between partners early in the project, to uncover hidden assumptions, develop shared ideas about approach, define clean interfaces and interactions between partners, workpackages, and deliverable pieces, and increase the smoothness and pace of progress for the rest of the project.
March 28, 2014, at 02:57 PM by 192.16.201.181 -
Deleted line 51:
* Task3: employ the various interfaces
March 28, 2014, at 10:44 AM by 80.114.134.224 -
Changed lines 10-12 from:
In the future, Toxbank will also incorporate iPSC models, which are quickly establishing in pharmacology and drug development and will make their way to toxicology. This will give a new dimension to ToxBank and in general to the OpenTox landscape: A new set of rich data and algorithms, increasing the need for cloud storage and parallel processing.

Once the initial, minimum viable system, is complete, Douglas Connect expects to develop applications that involve NGS platfroms like e.g. Illumina HiSeq. NGS systems perform rapid, high-throughput, high-quality whole-genome sequencing for patients, producing typical records of 100s of giga bytes for a given patient. The purpose of the sequencing is to overcome individual variations in responses to drugs and toxicants, which have been considered to be the major cause of failure in late clinical phases of drug development. The process also involves generating induced pluripotent stem cells (iPSC; www.ebisc.org) from each patient. The NGS data is cross validated with iPSC cell response profiles using multiple functional and data-intensive techniques. End users of these systems will enjoy similar benefits when the applications in this area are adapted to the CloudDSM, which will increase the size of data that can be computed upon and to reduce the time of cross validating with existing models, related disease areas, clinical meta data, and results published in the literature.
to:
In the future, Toxbank will also incorporate iPSC models, which are quickly establishing themselves in pharmacology and drug development and will make their way to toxicology. This will give a new dimension to ToxBank and in general to the OpenTox landscape: A new set of rich data and algorithms, increasing the need for cloud storage and parallel processing.

Once the initial, minimum viable system, is complete, Douglas Connect expects to develop applications that involve NGS platfroms like e.g. Illumina HiSeq. NGS systems perform rapid, high-throughput, high-quality whole-genome sequencing for patients, producing typical records of 100s of giga bytes for a given patient. The purpose of the sequencing is to overcome individual variations in responses to drugs and toxicants, which have been considered to be the major cause of failure in late clinical phases of drug development. The process also involves generating induced pluripotent stem cells (iPSC; www.ebisc.org) from each patient. The NGS data is cross validated with iPSC cell response profiles using multiple functional and data-intensive techniques. End users of these systems will enjoy benefits when the applications in this area are adapted to the CloudDSM system, which will increase the size of data that can be computed upon and reduce the time of cross validating against existing models, related disease areas, clinical meta data, and results published in the literature.
March 28, 2014, at 10:38 AM by 80.114.134.224 -
Changed line 8 from:
The application chosen for the initial development is the creation and cross-validation of statistical models of toxicology data. The data is housed within a system created as part of a previous EU project called ToxBank (www.ToxBank.net) and accessed via the OpenTox framework (www.OpenTox.org). It has complex data and suitable algorithms covering structure of chemical compounds and related data from a large number of in vitro and in vivo experiments including large sets of OMICS-data. Each experiment has its own sources of error, and noise within the collected data. The end user wishes to extract some conclusion about a candidate compound. In the process, they use multiple of the experiments, creating a statistical model for each experiment. They then fit, or train, the models from the experiment data. Once complete, they must determine whether the models have derived meaningful conclusions, which is accomplished by a technique called cross-validation, in which results from one model are compared to those from the others. Depending upon the cross-validation outcome, the user then modifies the models, or changes their selection of data and repeats. Algorithms in this area often have a super-linear or even exponential time complexity. However, the users wish to perform this work interactively, so that they can try multiple different statistical approaches within a given session, until they reach a set of data and models that give high confidence. With the complexity of the algorithms and large amounts of data, the time to train and cross-validate the models is much too high for interactive use when only a single machine is available. A federated cloud system will reduce the time to minutes rather than hours, allowing the drug discovery researchers to ask "what if" questions, which improves the liklihood of finding new, useful drugs.
to:
The application chosen for the initial development is the creation and cross-validation of statistical models of toxicology data. The data is housed within a system created as part of a previous EU project called ToxBank (www.ToxBank.net) and is accessed via the OpenTox framework (www.OpenTox.org). It has complex data and suitable algorithms covering structure of chemical compounds and related data from a large number of in vitro and in vivo experiments including large sets of OMICS-data. Each experiment has its own sources of error, and noise within the collected data. The end user wishes to extract some conclusion about a candidate compound. In the process, they use multiple of the experiments, creating a statistical model for each experiment. They then fit, or train, the models from the experiment data. Once complete, they must determine whether the models have derived meaningful conclusions, which is accomplished by a technique called cross-validation, in which results from one model are compared to those from the others. Depending upon the cross-validation outcome, the user then modifies the models, or changes their selection of data and repeats. Algorithms in this area often have a super-linear or even exponential time complexity. However, the users wish to perform this work interactively, so that they can try multiple different statistical approaches within a given session, until they reach a set of data and models that give high confidence. With the complexity of the algorithms and large amounts of data, the time to train and cross-validate the models is much too high for interactive use when only a single machine is available. A federated cloud system will reduce the time to minutes rather than hours, allowing the drug discovery researchers to ask "what if" questions, which improves the liklihood of finding new, useful drugs.
March 28, 2014, at 10:38 AM by 80.114.134.224 -
Added lines 5-6:

!!!Douglas Connect
March 28, 2014, at 10:37 AM by 80.114.134.224 -
Added lines 14-36:
!!!LarkBio

The second contribution to this work package will be made by LarkBio, who develop a software solution for de-novo genome assembly using De Brujin graphs. This application will be adapted to the CloudDSM system in order to allow larger data sets to be computed upon than are possible on the machines typically available in a Cloud setting.

The application involves metagenomic studies, whose aim is to identify novel enzyme candidates in bacterial communities. They have applications in medicine, biotechnology and agriculture. The sequence data comes from the DNA of different species, each of which can be hundreds of giga bytes, and the collection used in a computation is preferably in the range of terabytes. More data means higher probability of covering the least abundant of the species. As part of identifying novel enzyme candidates, de-novo assembly algorithms are used to join the short reads so as to build up longer fragments of DNA. Gene prediction algorithms are then run on these fragments. Current de-novo algorithms build up an in-memory graph that processes all reads, while the size of the available memory determines the number of reads that can be handled this way. The larger the memory, the more enzyme candidates can be discovered in the sample.

The current software that performs de novo genome assembly include the Velvet and SOAPdenovo tools, both of which are based on in-memory De Bruijn graphs. Such graphs reach terabyte sizes when assembling large (mammalian) genomes, or when performing optimal size runs on large metagenomes. The largest practical machines available for performing these computations are limited to about 500GB of RAM, which is insufficient for acceptable coverage of large metagenomes.

* We expect that the CloudDSM based de-novo solution:
* will be able to handle larger data sets
* produces more precise genome assembly on the large sequencing data
* results in candidate enzymes which are not visible in case of lower data amounts
* results in higher average length of the assembled sequences
* higher number of the enzyme candidates, and hence higher probability of success of the project.

* The impact of the CloudDSM based solution on our customers is that:
* they can use the NGS short read sequencing approach to genome assembly projects where they needed to use traditional and more expensive solutions before
* we can feed higher data volume to the de-novo assembly step of our analysis pipeline
* we will be able to deliver more precise results
* the customer will be able to identify novel enzymes which can be used to catalyze a wide spectrum of industrial biochemical processes in e.g. photocatalysis, \\ drying & powder technologies, clean fuel & energy sector, perfumes & flavor materials production or \\ medicinal drug design.

!!!Tasks
Changed lines 52-69 from:
* Task4: Develop a software solution for de-novo genome assembly using De Brujin graphs, which utilizes the CloudDSM system. (LarkBio) \\
Metagenomic studies aim to identify novel enzyme candidates in bacterial communities, and has a variety of applications in medicine, biotechnology and agriculture. The sequence data to process is large, and come from the DNA of different species. More data means higher probability of covering also the least abundant of these species. In order to identify novel enzyme candidates, we have to run de-novo assembly algorithms to join the short reads building up longer fragments of DNA, and run gene predictions algorithms on these fragments. Current de-novo algorithms build up an in-memory graph processing all reads, then analyze this graph. The size of the available memory limits the number of reads that can be handled this way, limiting the enzyme candidates discovered in the sample. This is the problem that we address using the CloudDSM system.

Current applications of de novo genome assembly include the Velvet and SOAPdenovo tools, both are based on in-memory De Bruijn graphs. For this reason their use is prohibitive when assembling large (mammalian) genomes, and suboptimal for large metagenomes. The currently available large memory instances are limited to about 500GB of RAM, which is clearly insufficient for high coverage metagenomes.

* We expect that the CloudDSM based de-novo solution:
* will be able to handle larger data sets
* produces more precise genome assembly on the large sequencing data
* results in candidate enzymes which are not visible in case of lower data amounts
* results in higher average length of the assembled sequences
* higher number of the enzyme candidates, and hence higher probability of success of the project.

* The impact of the CloudDSM based solution on our customers is that:
* they can use the NGS short read sequencing approach to genome assembly projects where they needed to use traditional and more expensive solutions before
* we can feed higher data volume to the de-novo assembly step of our analysis pipeline
* we will be able to deliver more precise results
* the customer will be able to identify novel enzymes which can be used to catalyze a wide spectrum of industrial biochemical processes in e.g. photocatalysis, \\ drying & powder technologies, clean fuel & energy sector, perfumes & flavor materials production or \\ medicinal drug design.
to:
* Task4: Develop de novo gene assembly application on CloudDSM
Added lines 58-59:
Deleted lines 68-72:


!!information on end-user applications

Douglas Connect:
March 28, 2014, at 10:18 AM by 80.114.134.224 -
Changed lines 4-6 from:
The major goal for this workpackage is to provide commercial end-user applications that employ CloudDSM, which allows them to gain benefit in either response time or in size of problem, where they were previously barred from such benefits due to logistics or cost of machine access, or development cost, or advanced coding skills. The work will begin with one clean, well defined application in the area of drug discovery. This application will be used during the initial phases of the project within a rapid proto-type development process.

The application chosen
is the creation and cross-validation of statistical models of toxicology data. The data is housed within a system created as part of a previous EU project called ToxBank (www.ToxBank.net) and accessed via the OpenTox framework (www.OpenTox.org). It has complex data and suitable algorithms covering structure of chemical compounds and related data from a large number of in vitro and in vivo experiments including large sets of OMICS-data. Each experiment has its own sources of error, and noise within the collected data. The end user wishes to extract some conclusion about a candidate compound. In the process, they use multiple of the experiments, creating a statistical model for each experiment. They then fit, or train, the models from the experiment data. Once complete, they must determine whether the models have derived meaningful conclusions, which is accomplished by a technique called cross-validation, in which results from one model are compared to those from the others. Depending upon the cross-validation outcome, the user then modifies the models, or changes their selection of data and repeats. Algorithms in this area often have a super-linear or even exponential time complexity. However, the users wish to perform this work interactively, so that they can try multiple different statistical approaches within a given session, until they reach a set of data and models that give high confidence. With the complexity of the algorithms and large amounts of data, the time to train and cross-validate the models is much too high for interactive use when only a single machine is available. A federated cloud system will reduce the time to minutes rather than hours, allowing the drug discovery researchers to ask "what if" questions, which improves the liklihood of finding new, useful drugs.
to:
The major goal for this workpackage is to provide commercial end-user applications that employ CloudDSM, which allows them to gain benefit in either response time or in size of problem. Applications are chosen that are otherwise barred from such benefits due to the logistics or cost of access of machines of sufficient size or computation power, or are barred by the development cost, or advanced coding skills required for distributed style development. The work will begin with one clean, well defined application in the area of drug discovery. This application will be used during the initial phases of the project within a rapid proto-type development process.

The application chosen for the initial development
is the creation and cross-validation of statistical models of toxicology data. The data is housed within a system created as part of a previous EU project called ToxBank (www.ToxBank.net) and accessed via the OpenTox framework (www.OpenTox.org). It has complex data and suitable algorithms covering structure of chemical compounds and related data from a large number of in vitro and in vivo experiments including large sets of OMICS-data. Each experiment has its own sources of error, and noise within the collected data. The end user wishes to extract some conclusion about a candidate compound. In the process, they use multiple of the experiments, creating a statistical model for each experiment. They then fit, or train, the models from the experiment data. Once complete, they must determine whether the models have derived meaningful conclusions, which is accomplished by a technique called cross-validation, in which results from one model are compared to those from the others. Depending upon the cross-validation outcome, the user then modifies the models, or changes their selection of data and repeats. Algorithms in this area often have a super-linear or even exponential time complexity. However, the users wish to perform this work interactively, so that they can try multiple different statistical approaches within a given session, until they reach a set of data and models that give high confidence. With the complexity of the algorithms and large amounts of data, the time to train and cross-validate the models is much too high for interactive use when only a single machine is available. A federated cloud system will reduce the time to minutes rather than hours, allowing the drug discovery researchers to ask "what if" questions, which improves the liklihood of finding new, useful drugs.
March 28, 2014, at 10:15 AM by 80.114.134.224 -
Changed line 4 from:
The major goal for this workpackage is to provide commercial end-user applications that employ CloudDSM, which allows them to gain benefit in either response time or in size of problem, which they were previously barred from due to logistics or cost of machine access, or development cost, or advanced coding skills. The work will begin with one clean, well defined application in the area of drug discovery. This application will be used during the initial phases of the project within a rapid proto-type development process.
to:
The major goal for this workpackage is to provide commercial end-user applications that employ CloudDSM, which allows them to gain benefit in either response time or in size of problem, where they were previously barred from such benefits due to logistics or cost of machine access, or development cost, or advanced coding skills. The work will begin with one clean, well defined application in the area of drug discovery. This application will be used during the initial phases of the project within a rapid proto-type development process.
March 28, 2014, at 10:14 AM by 80.114.134.224 -
Added lines 3-13:

The major goal for this workpackage is to provide commercial end-user applications that employ CloudDSM, which allows them to gain benefit in either response time or in size of problem, which they were previously barred from due to logistics or cost of machine access, or development cost, or advanced coding skills. The work will begin with one clean, well defined application in the area of drug discovery. This application will be used during the initial phases of the project within a rapid proto-type development process.

The application chosen is the creation and cross-validation of statistical models of toxicology data. The data is housed within a system created as part of a previous EU project called ToxBank (www.ToxBank.net) and accessed via the OpenTox framework (www.OpenTox.org). It has complex data and suitable algorithms covering structure of chemical compounds and related data from a large number of in vitro and in vivo experiments including large sets of OMICS-data. Each experiment has its own sources of error, and noise within the collected data. The end user wishes to extract some conclusion about a candidate compound. In the process, they use multiple of the experiments, creating a statistical model for each experiment. They then fit, or train, the models from the experiment data. Once complete, they must determine whether the models have derived meaningful conclusions, which is accomplished by a technique called cross-validation, in which results from one model are compared to those from the others. Depending upon the cross-validation outcome, the user then modifies the models, or changes their selection of data and repeats. Algorithms in this area often have a super-linear or even exponential time complexity. However, the users wish to perform this work interactively, so that they can try multiple different statistical approaches within a given session, until they reach a set of data and models that give high confidence. With the complexity of the algorithms and large amounts of data, the time to train and cross-validate the models is much too high for interactive use when only a single machine is available. A federated cloud system will reduce the time to minutes rather than hours, allowing the drug discovery researchers to ask "what if" questions, which improves the liklihood of finding new, useful drugs.

In the future, Toxbank will also incorporate iPSC models, which are quickly establishing in pharmacology and drug development and will make their way to toxicology. This will give a new dimension to ToxBank and in general to the OpenTox landscape: A new set of rich data and algorithms, increasing the need for cloud storage and parallel processing.

Once the initial, minimum viable system, is complete, Douglas Connect expects to develop applications that involve NGS platfroms like e.g. Illumina HiSeq. NGS systems perform rapid, high-throughput, high-quality whole-genome sequencing for patients, producing typical records of 100s of giga bytes for a given patient. The purpose of the sequencing is to overcome individual variations in responses to drugs and toxicants, which have been considered to be the major cause of failure in late clinical phases of drug development. The process also involves generating induced pluripotent stem cells (iPSC; www.ebisc.org) from each patient. The NGS data is cross validated with iPSC cell response profiles using multiple functional and data-intensive techniques. End users of these systems will enjoy similar benefits when the applications in this area are adapted to the CloudDSM, which will increase the size of data that can be computed upon and to reduce the time of cross validating with existing models, related disease areas, clinical meta data, and results published in the literature.

Near the end of the project, Douglas Connect would also like to consider applications in the area of personalized medicine and pathway-driven toxicology, in which drug-related efficacy models are integrated with toxicity models, preferentially utilizing individualized human in vitro models such as iPSC. This personalized medicine and pathway-driven toxicology integrates multiple data sets as part of generating dynamic models of the interaction of gene and protein regulation networks as they behave over time. Simulation of such dynamic models consumes large computation time. End users who investigate such dynamic systems will benefit from interactive use because the biological regulation action that must be understood by the user involves biological signals feeding backward and forward, which impels the user to try multiple stimuli and multiple runs. From this, they perform complex decision making in building appropriate models for prediction. Time is a key asset in the decision making in the pharmaceutical industry. Speeding up the prediction of clinical and toxicological outcomes during a session of an end user will increase the breadth of possibilities considered, and thereby increase the number of innovative medicines discovered, and it will also decrease the time to market for such medicines.
Deleted line 14:
* Milestone M6.: Month 12 --
Deleted line 15:
* Milestone M6.: Month 24 --
Deleted line 16:
* Milestone M6.: Month 36 --
Deleted line 18:
* Milestone M6.: Month 12 --
Deleted line 19:
* Milestone M6.: Month 24 --
Deleted line 20:
* Milestone M6.: Month 36 --
Deleted line 22:
* Milestone M6.: Month 12 --
Deleted line 23:
* Milestone M6.: Month 24 --
Deleted line 24:
* Milestone M6.: Month 36 --
Changed lines 28-29 from:
* Milestone M7.: Month
to:
Changed lines 65-76 from:
Douglas Connect:

a) The CloudDSM project will begin with a clean, well defined application in the area of drug discovery. This application will be used during the initial phases of the project within a rapid proto-type development process. The application is the creation and cross-validation of statistical models of toxicology data. The data is housed within a system created as part of a previous EU project called ToxBank (www.ToxBank.net) and the OpenTox framework (www.OpenTox.org). It has a fast growing wealth of complex data and related algorithms: structural data of chemical compounds and related data from in vitro and in vivo experiments including large sets of OMICS-data that are then integrated for evidence-based prediction models. ToxBank collects the results of a large number of different experiments, each with their own sources of error, and noise within the collected data. The users wish to extract some conclusion about a candidate compound. In the process, they use multiple of the experiments, creating a statistical model for each experiment. They then fit, or train, the models from the experiment data. Once complete, they must determine whether the models have derived meaningful conclusions, which is accomplished by a technique called cross-validation, in which results from one model are compared to those from the others. Algorithms in this area often have a super-linear or even exponential time complexity. However, the users wish to perform this work interactively, so that they can try multiple different statistical approaches within a given setting, and weed out which experiments may be giving unacceptably high error. With the complexity of the algorithms and large amounts of data, the time to train and cross-validate the models is much too high for interactive use. A federated cloud system will reduce the time to minutes rather than hours, allowing the drug discovery researchers to ask "what if" questions, which improves the liklihood of finding new, useful drugs.

A further goal for Toxbank is to incorporate iPSC models, which are quickly establishing in pharmacology and drug development and as a result are also naturally making their way to toxicology. This will give a new dimension to ToxBank and in general to the OpenTox landscape: A new set of rich data and algorithms, increasing the need for cloud storage and parallel processing.

b) The second application that Douglas Connect expects to adapt to the CloudDSM system involves NGS platfroms like e.g. Illumina HiSeq. These enable rapid, high-throughput, high-quality whole-genome sequencing, with the highest sequencing output up to approx. 10 billion reads/1000GB per run of raw data. Even if there are several levels of data compression and filtering, eventually we can expect 10s to 100s of giga bytes per patient. The idea of NGS is to overcome individual variations in responses to drugs and toxicants, which so far have been considered to be the major cause of failure in late clinical phases of drug development. Together with the possibility to generate induced pluripotent stem cells (iPSC; www.ebisc.org) from every patient, the processing and cross validation of individual patients’ NGS data and response profiles based on multiple functional and data-rich read-outs (e.g. high content imaging) from these iPSC, creates a huge need for big data storage and an even bigger need for processing time.
The cross validation with existing models, related disease areas, clinical meta data and literature adds to the need described.

In the medium term, drug-related efficacy models need to be integrated with toxicity models, preferentially from individualized human in vitro models (like iPSC).
The relevance for personalized medicine and pathway-driven toxicology requires integration of gene (and protein) regulation networks over time. Dynamic modeling of multiple complex data sets will require exorbitant computation times in both cases. Since the biological regulation underlying mechanisms of action is redundant and feeding back and forward, we aim at iterative and interactive applications for complex decision making in building appropriate models for prediction involving federation sources and heterogeneous evidence (read-across). (Are there App examples for this?)
In terms of decision making in the pharmaceutical industry, time is a key asset. An acceleration of processing time with regard to these big data sets with the aim of predicting clinical and toxicological outcomes can decide about the fate of innovative medicines pipelines, intellectual property priorities and thus the fate of companies.
to:
Douglas Connect:
March 28, 2014, at 08:06 AM by 80.114.134.224 -
Changed line 75 from:
In terms of decision making in the pharmaceutical industry, time is a key asset. An acceleration of processing time with regard to these big data sets with the aim of predicting clinical and toxicological outcomes can decide about the fate of innovative medicines pipelines, intellectual property priorities and thus the fate of companies.
to:
In terms of decision making in the pharmaceutical industry, time is a key asset. An acceleration of processing time with regard to these big data sets with the aim of predicting clinical and toxicological outcomes can decide about the fate of innovative medicines pipelines, intellectual property priorities and thus the fate of companies.
March 28, 2014, at 07:00 AM by 80.114.134.224 -
Changed line 75 from:
In terms of decision making in the pharmaceutical industry, time is a key asset. An acceleration of processing time with regard to these big data sets with the aim of predicting clinical and toxicological outcomes can decide about the fate of innovative medicines pipelines, intellectual property priorities and thus the fate of companies.
to:
In terms of decision making in the pharmaceutical industry, time is a key asset. An acceleration of processing time with regard to these big data sets with the aim of predicting clinical and toxicological outcomes can decide about the fate of innovative medicines pipelines, intellectual property priorities and thus the fate of companies.
March 28, 2014, at 06:55 AM by 80.114.134.224 -
Changed lines 59-75 from:
* Deliverable D6.: Month 36 -- Documentation and results of the test run on a large metagenomic dataset. Comparison of the results with the partial dataset, analysed with traditional tools.
to:
* Deliverable D6.: Month 36 -- Documentation and results of the test run on a large metagenomic dataset. Comparison of the results with the partial dataset, analysed with traditional tools.


!!information on end-user applications

Douglas Connect:

a) The CloudDSM project will begin with a clean, well defined application in the area of drug discovery. This application will be used during the initial phases of the project within a rapid proto-type development process. The application is the creation and cross-validation of statistical models of toxicology data. The data is housed within a system created as part of a previous EU project called ToxBank (www.ToxBank.net) and the OpenTox framework (www.OpenTox.org). It has a fast growing wealth of complex data and related algorithms: structural data of chemical compounds and related data from in vitro and in vivo experiments including large sets of OMICS-data that are then integrated for evidence-based prediction models. ToxBank collects the results of a large number of different experiments, each with their own sources of error, and noise within the collected data. The users wish to extract some conclusion about a candidate compound. In the process, they use multiple of the experiments, creating a statistical model for each experiment. They then fit, or train, the models from the experiment data. Once complete, they must determine whether the models have derived meaningful conclusions, which is accomplished by a technique called cross-validation, in which results from one model are compared to those from the others. Algorithms in this area often have a super-linear or even exponential time complexity. However, the users wish to perform this work interactively, so that they can try multiple different statistical approaches within a given setting, and weed out which experiments may be giving unacceptably high error. With the complexity of the algorithms and large amounts of data, the time to train and cross-validate the models is much too high for interactive use. A federated cloud system will reduce the time to minutes rather than hours, allowing the drug discovery researchers to ask "what if" questions, which improves the liklihood of finding new, useful drugs.

A further goal for Toxbank is to incorporate iPSC models, which are quickly establishing in pharmacology and drug development and as a result are also naturally making their way to toxicology. This will give a new dimension to ToxBank and in general to the OpenTox landscape: A new set of rich data and algorithms, increasing the need for cloud storage and parallel processing.

b) The second application that Douglas Connect expects to adapt to the CloudDSM system involves NGS platfroms like e.g. Illumina HiSeq. These enable rapid, high-throughput, high-quality whole-genome sequencing, with the highest sequencing output up to approx. 10 billion reads/1000GB per run of raw data. Even if there are several levels of data compression and filtering, eventually we can expect 10s to 100s of giga bytes per patient. The idea of NGS is to overcome individual variations in responses to drugs and toxicants, which so far have been considered to be the major cause of failure in late clinical phases of drug development. Together with the possibility to generate induced pluripotent stem cells (iPSC; www.ebisc.org) from every patient, the processing and cross validation of individual patients’ NGS data and response profiles based on multiple functional and data-rich read-outs (e.g. high content imaging) from these iPSC, creates a huge need for big data storage and an even bigger need for processing time.
The cross validation with existing models, related disease areas, clinical meta data and literature adds to the need described.

In the medium term, drug-related efficacy models need to be integrated with toxicity models, preferentially from individualized human in vitro models (like iPSC).
The relevance for personalized medicine and pathway-driven toxicology requires integration of gene (and protein) regulation networks over time. Dynamic modeling of multiple complex data sets will require exorbitant computation times in both cases. Since the biological regulation underlying mechanisms of action is redundant and feeding back and forward, we aim at iterative and interactive applications for complex decision making in building appropriate models for prediction involving federation sources and heterogeneous evidence (read-across). (Are there App examples for this?)
In terms of decision making in the pharmaceutical industry, time is a key asset. An acceleration of processing time with regard to these big data sets with the aim of predicting clinical and toxicological outcomes can decide about the fate of innovative medicines pipelines, intellectual property priorities and thus the fate of companies.
March 26, 2014, at 07:04 AM by 80.99.199.138 -
Changed lines 44-45 from:
to:
* the customer will be able to identify novel enzymes which can be used to catalyze a wide spectrum of industrial biochemical processes in e.g. photocatalysis, \\ drying & powder technologies, clean fuel & energy sector, perfumes & flavor materials production or \\ medicinal drug design.
Changed line 59 from:
* Deliverable D6.: Month 36 -- Documentation and results of the test run on a large metagenomic dataset. Comparison of the results with the partial dataset, analysed with traditional tools.
to:
* Deliverable D6.: Month 36 -- Documentation and results of the test run on a large metagenomic dataset. Comparison of the results with the partial dataset, analysed with traditional tools.
March 26, 2014, at 06:37 AM by 80.99.199.138 -
Changed line 28 from:
* Task4: Develop a software solution for de-novo genome assembly using De Brujin graphs, which utilizes the CloudDSM system. (LarkBio)
to:
* Task4: Develop a software solution for de-novo genome assembly using De Brujin graphs, which utilizes the CloudDSM system. (LarkBio) \\
Changed lines 46-49 from:
* Deliverable D6.: Month 12 -- Source code of the de-novo assembly tool and the executable file properly instrumented to be able to run on the CloudDSM system
The input of the tool is the NGS sequencing reads in FASTQ format, and the algorithm parameters. The output of the tool is the assembled genome fragments in FASTA format, and the statistics of results (average sequence length, N50 value).

* Task5: Validate the solution on a dataset which can be solved using currently available tools.
(LarkBio)
to:
* Deliverable D6.: Month 12 -- Source code of the de-novo assembly tool and the executable file properly instrumented to be able to run on the CloudDSM system \\
The input of the tool is the NGS sequencing reads in FASTQ format, and the algorithm parameters. \\
The output of
the tool is the assembled genome fragments in FASTA format, and the statistics of results (average sequence length, N50 value).

* Task5: Validate the solution on a dataset which can be solved using currently available tools.
(LarkBio) \\
Changed line 55 from:
* Task6: Validate the solution on large dataset, which is not feasible without CloudDSM. (LarkBio)
to:
* Task6: Validate the solution on large dataset, which is not feasible without CloudDSM. (LarkBio) \\
March 26, 2014, at 06:33 AM by 80.99.199.138 -
Changed lines 29-38 from:
Metagenomic studies aim to identify novel enzyme candidates in bacterial communities, and has a variety of applications in medicine, biotechnology and agriculture. The sequence data to process is large, and come from the DNA of different species. More data means higher probability of covering also the least abundant of these species. In order to identify novel enzyme candidates, we have to run de-novo assembly algorithms to join the short reads building up longer fragments of DNA, and run gene predictions algorithms on these fragments. Current de-novo algorithms build up an in-memory graph processing all reads, then analyze this graph. The size of the available memory limits the number of reads that can be handled this way, limiting the enzyme candidates discovered in the sample. This is the problem that we address using the CloudDSM system.

Current applications of de novo genome assembly include the Velvet and SOAPdenovo tools, both are based on in-memory De Bruijn graphs. For this reason their use is prohibitive when assembling large (mammalian) genomes, and suboptimal for large metagenomes. The currently available large memory instances are limited to about 500GB of RAM, which is clearly insufficient for high coverage metagenomes.

We expect that the CloudDSM based de-novo solution:
* will be able to handle larger data sets
* produces more precise genome assembly on the large sequencing data
* results
in candidate enzymes which are not visible in case of lower data amounts
* results in higher average length of the assembled sequences
* higher number of the enzyme candidates, and hence higher probability of success of the project.
to:
Metagenomic studies aim to identify novel enzyme candidates in bacterial communities, and has a variety of applications in medicine, biotechnology and agriculture. The sequence data to process is large, and come from the DNA of different species. More data means higher probability of covering also the least abundant of these species. In order to identify novel enzyme candidates, we have to run de-novo assembly algorithms to join the short reads building up longer fragments of DNA, and run gene predictions algorithms on these fragments. Current de-novo algorithms build up an in-memory graph processing all reads, then analyze this graph. The size of the available memory limits the number of reads that can be handled this way, limiting the enzyme candidates discovered in the sample. This is the problem that we address using the CloudDSM system.

Current applications of de novo genome assembly include the Velvet and SOAPdenovo tools, both are based on in-memory De Bruijn graphs. For this reason their use is prohibitive when assembling large (mammalian) genomes, and suboptimal for large metagenomes. The currently available large memory instances are limited to about 500GB of RAM, which is clearly insufficient for high coverage metagenomes.

* We expect that the CloudDSM based de-novo solution:
* will be able to handle larger data sets
* produces more precise genome assembly on the large sequencing data
* results in candidate enzymes which are not visible
in case of lower data amounts
* results in higher average length of the assembled sequences
* higher number of the enzyme candidates, and hence higher probability of success of the project.
Changed lines 40-44 from:
The impact of the CloudDSM based solution on our customers is that:
* they can use the NGS short read sequencing approach to genome assembly projects where they needed to use traditional and more expensive solutions before
* we can feed higher data volume to the de-novo assembly step of our analysis pipeline
* we will be able to deliver more precise results
to:
* The impact of the CloudDSM based solution on our customers is that:
* they can use the NGS short read sequencing approach to genome assembly projects where they needed to use traditional and more expensive solutions before
* we can feed higher data volume to the de-novo assembly step of our analysis pipeline
* we will be able to deliver more precise results
Changed lines 47-48 from:
The input of the tool is the NGS sequencing reads in FASTQ format, and the algorithm parameters. The output of the tool is the assembled genome fragments in FASTA format, and the statistics of results (average sequence length, N50 value).
to:
The input of the tool is the NGS sequencing reads in FASTQ format, and the algorithm parameters. The output of the tool is the assembled genome fragments in FASTA format, and the statistics of results (average sequence length, N50 value).
Changed lines 50-51 from:
As the validation of the software, we run the de-novo assembly on a smaller data set using CloudDSM and plain EC2 instance, and compare the results.
to:
As the validation of the software, we run the de-novo assembly on a smaller data set using CloudDSM and plain EC2 instance, and compare the results.
Changed lines 55-56 from:
The next step of the validation is to run the algorithm on an extended data set, and verify if novel fragments are assembled this way.
to:
The next step of the validation is to run the algorithm on an extended data set, and verify if novel fragments are assembled this way.
March 26, 2014, at 06:24 AM by 80.99.199.138 -
Changed lines 31-38 from:
The deliverables of this task contains the source code of the de-novo assembly tool and the executable file properly instrumented to be able to run on the CloudDSM system. The input of the tool is the NGS sequencing reads in FASTQ format, and the algorithm parameters. The output of the tool is the assembled genome fragments in FASTA format, and the statistics of results (average sequence length, N50 value).
to:
Current applications of de novo genome assembly include the Velvet and SOAPdenovo tools, both are based on in-memory De Bruijn graphs. For this reason their use is prohibitive when assembling large (mammalian) genomes, and suboptimal for large metagenomes. The currently available large memory instances are limited to about 500GB of RAM, which is clearly insufficient for high coverage metagenomes.

We expect that the CloudDSM based de-novo solution:
* will be able to handle larger data sets
* produces more precise genome assembly on the large sequencing data
* results in candidate enzymes which are not visible in case of lower data amounts
* results in higher average length of the assembled sequences
* higher number of the enzyme candidates, and hence higher probability of success of the project
.
Changed lines 40-46 from:
Current applications of de novo genome assembly include the Velvet and SOAPdenovo tools, both are based on in-memory De Bruijn graphs. For this reason their use is prohibitive when assembling large (mammalian) genomes, and suboptimal for metagenomes. The currently available large memory instances are limited to about 500GB of RAM, which is definitely insufficient for high coverage metagenomes.

We expect that the CloudDSM based de-novo solution will be able to handle larger data sets, and produces more precise genome assembly on the large sequencing data, resulting in candidate enzymes which are not visible in case of lower data amounts. This is because the genome sequences
of rare species in the metagenomic sample are only visible in case of high data volume. The average length of the assembled sequences is also expected to be higher, which increases the number of the enzyme candidates, and hence the of success probability of the project.

The impact of the CloudDSM based solution on our customers is that they can use the NGS short read sequencing approach to genome assembly projects where they needed to use traditional and more expensive solutions before. In other cases we will be able to deliver more precise results, because we can feed higher data volume to the de-novo assembly step of our analysis pipeline.
to:
The impact of the CloudDSM based solution on our customers is that:
* they can use the NGS short read sequencing approach to genome assembly projects where they needed to use traditional and more expensive solutions before
* we can feed higher data volume to the de-novo assembly step of our analysis pipeline
* we will be able to deliver more precise results

* Milestone M6.: Month 12 -- Prototype of the CloudDSM based de-novo genome assembler tool
* Deliverable D6.: Month 12 -- Source code
of the de-novo assembly tool and the executable file properly instrumented to be able to run on the CloudDSM system
The input of the tool is
the NGS sequencing reads in FASTQ format, and the algorithm parameters. The output of the tool is the assembled genome fragments in FASTA format, and the statistics of results (average sequence length, N50 value).
Added lines 52-54:
* Milestone M6.: Month 24 -- Validation phase 1
* Deliverable D6.: Month 24 -- Documentation and results of the test run on the smaller dataset. Comparison of the results with current solution.
Changed lines 56-59 from:
The next step of the validation is to run the algorithm on an extended data set, and verify if novel fragments are assembled this way.
to:
The next step of the validation is to run the algorithm on an extended data set, and verify if novel fragments are assembled this way.

* Milestone M6.: Month 36 -- Validation phase 2
* Deliverable D6.: Month 36 -- Documentation and results of the test run on a large metagenomic dataset. Comparison of the results with the partial dataset, analysed with traditional tools
.
March 26, 2014, at 05:58 AM by 80.99.199.138 -
Changed lines 28-29 from:
* Task4: Develop a software solution for de-novo genome assembly using De-Brujin graphs, which utilizes the CloudDSM system. (LarkBio)
Metagenomic studies aim to identify novel enzyme candidates in bacterial communities, and has a variety of applications in medicine, biotechnology and agriculture. The sequence data to process is large, and come from the DNA of different species. More data means higher probability of covering also the least abundant of these species. In order to identify novel enzyme candidates, we have to run de-novo assembly algorithms to join the short reads building up longer fragments of DNA, and run gene predictions algorithms on these fragments. Current de-novo algorithms build up an in-memory graph processing all reads, then analyze this graph. The size of the available memory limits the number of reads that can be handled this way, limiting the enzyme candidates discovered in the sample. This is the problem that we address using the CloudDSM system.
to:
* Task4: Develop a software solution for de-novo genome assembly using De Brujin graphs, which utilizes the CloudDSM system. (LarkBio)
Metagenomic studies aim to identify novel enzyme candidates in bacterial communities, and has a variety of applications in medicine, biotechnology and agriculture. The sequence data to process is large, and come from the DNA of different species. More data means higher probability of covering also the least abundant of these species. In order to identify novel enzyme candidates, we have to run de-novo assembly algorithms to join the short reads building up longer fragments of DNA, and run gene predictions algorithms on these fragments. Current de-novo algorithms build up an in-memory graph processing all reads, then analyze this graph. The size of the available memory limits the number of reads that can be handled this way, limiting the enzyme candidates discovered in the sample. This is the problem that we address using the CloudDSM system.

The deliverables of this task contains the source code of the de-novo assembly tool and the executable file properly instrumented to be able to run on the CloudDSM system. The input of the tool is the NGS sequencing reads in FASTQ format, and the algorithm parameters. The output of the tool is the assembled genome fragments in FASTA format, and the statistics of results (average sequence length, N50 value).

Current applications of de novo genome assembly include the Velvet and SOAPdenovo tools, both are based on in-memory De Bruijn graphs. For this reason their use is prohibitive when assembling large (mammalian) genomes, and suboptimal for metagenomes. The currently available large memory instances are limited to about 500GB of RAM, which is definitely insufficient for high coverage metagenomes.

We expect that the CloudDSM based de-novo solution will be able to handle larger data sets, and produces more precise genome assembly on the large sequencing data, resulting in candidate enzymes which are not visible in case of lower data amounts. This is because the genome sequences of rare species in the metagenomic sample are only visible in case of high data volume. The average length of the assembled sequences is also expected to be higher, which increases the number of the enzyme candidates, and hence the of success probability of the project.

The impact of the CloudDSM based solution on our customers is that they can use the NGS short read sequencing approach to genome assembly projects where they needed to use traditional and more expensive solutions before. In other cases we will be able to deliver more precise results, because we can feed higher data volume to the de-novo assembly step of our analysis pipeline.
March 25, 2014, at 11:51 AM by 80.114.134.224 -
Added lines 1-35:
!! WP 7 End User Applications

* Task1: divide application into user client, computation kernels, and work division
* Milestone M6.: Month 12 --
* Deliverable D6.: Month 12 --
* Milestone M6.: Month 24 --
* Deliverable D6.: Month 24 --
* Milestone M6.: Month 36 --
* Deliverable D6.: Month 36 --
* Task2: mock up using annotations
* Milestone M6.: Month 12 --
* Deliverable D6.: Month 12 --
* Milestone M6.: Month 24 --
* Deliverable D6.: Month 24 --
* Milestone M6.: Month 36 --
* Deliverable D6.: Month 36 --
* Task3: employ the various interfaces
* Milestone M6.: Month 12 --
* Deliverable D6.: Month 12 --
* Milestone M6.: Month 24 --
* Deliverable D6.: Month 24 --
* Milestone M6.: Month 36 --
* Deliverable D6.: Month 36 --

* Task3: employ the various interfaces
* Milestone M7.: Month

* Task4: Develop a software solution for de-novo genome assembly using De-Brujin graphs, which utilizes the CloudDSM system. (LarkBio)
Metagenomic studies aim to identify novel enzyme candidates in bacterial communities, and has a variety of applications in medicine, biotechnology and agriculture. The sequence data to process is large, and come from the DNA of different species. More data means higher probability of covering also the least abundant of these species. In order to identify novel enzyme candidates, we have to run de-novo assembly algorithms to join the short reads building up longer fragments of DNA, and run gene predictions algorithms on these fragments. Current de-novo algorithms build up an in-memory graph processing all reads, then analyze this graph. The size of the available memory limits the number of reads that can be handled this way, limiting the enzyme candidates discovered in the sample. This is the problem that we address using the CloudDSM system.

* Task5: Validate the solution on a dataset which can be solved using currently available tools. (LarkBio)
As the validation of the software, we run the de-novo assembly on a smaller data set using CloudDSM and plain EC2 instance, and compare the results.

* Task6: Validate the solution on large dataset, which is not feasible without CloudDSM. (LarkBio)
The next step of the validation is to run the algorithm on an extended data set, and verify if novel fragments are assembled this way.
March 25, 2014, at 11:50 AM by 80.114.134.224 -
Deleted lines 0-16:
!!!Work Package 7 -- Dissemination and Exploitation


* Task1: Make working system available to end customers (Gigas, DC)
* Milestone M7.: Month 3 -- Gigas makes rapid prototype system available to small set of alpha customers
* Milestone M7.: Month 6 -- Gigas makes final prototype system available to small set of beta customers
* Milestone M7.: Month 12 -- DC deploys its first prototype application on the prototype system hosted by Gigas
* Milestone M7.: Month 24 -- DC offers a fully developed application based on CloudDSM to its customers
* Milestone M7.: Month 36 -- DC offers the CloudDSM system for use by its customers and incubated companies to develop their own applications

* Task2: Create Development tools to enhance exploitation by DC customers and incubated companies
* Milestone M6.: Month 24 -- Completion of minimal set of development tools
* Deliverable D6.: Month 24 -- minimum set of tools needed by DC's alpha customers in order to develop CloudDSM applications
* Milestone M6.: Month 36 -- Completion of final set of development tools
* Deliverable D6.: Month 36 -- final set of tools needed by DC's main customers in order to develop CloudDSM applications

In this workpackage, Gigas will make the CloudDSM system available to its customers, and Douglas Connect will deploy a Drug Discovery product on top of CloudDSM hosted by Gigas. Douglas Connect will also direct a sub contractor to create end-user development tools that simplify writing applications that use the CloudDSM system. Douglas Connect will make those development tools available to its customers and the startups that it incubates.
March 19, 2014, at 05:53 AM by 80.114.135.137 -
Changed lines 1-13 from:
* Task1:
* Milestone M7.
: Month 12 --
* Deliverable D7.: Month 12 --

* Milestone M7.: Month 24 --
* Deliverable D7.: Month 24 --

* Milestone M7.: Month 36 --
* Deliverable D7.: Month 36 --

* Gigas will make the results of the project available to its customers
* Douglas Connect will use the results of
the project within its Drug Discovery product, and make the project technology available to the startups that it incubates.

Question: "Are the 2 Power servers we have in our lab sufficient for the CloudDSM project?" Answer: probably, anyone have thoughts?
to:
!!!Work Package 7 -- Dissemination and Exploitation


* Task1
: Make working system available to end customers (Gigas, DC)
* Milestone
M7.: Month 3 -- Gigas makes rapid prototype system available to small set of alpha customers
* Milestone M7.: Month 6 -- Gigas makes final prototype system available to small set of beta customers
* Milestone M7.: Month 12 -- DC deploys its first prototype application on
the prototype system hosted by Gigas
* Milestone M7.: Month 24 -- DC offers a fully developed application based on CloudDSM to its customers
* Milestone M7.: Month 36 -- DC offers the CloudDSM system for use by its customers and incubated companies to develop their own applications

* Task2: Create Development tools to enhance exploitation by DC customers and incubated companies
* Milestone M6.: Month 24 -- Completion of minimal set of development tools
* Deliverable D6.: Month 24 -- minimum set of tools needed by DC's alpha customers in order to develop CloudDSM applications
* Milestone M6.: Month 36 -- Completion of final set of development tools
* Deliverable D6.: Month 36 -- final set of tools needed by DC's main customers in order to develop CloudDSM applications

In this workpackage, Gigas will make the CloudDSM system available to its customers, and Douglas Connect will deploy a Drug Discovery product on top of CloudDSM hosted by Gigas. Douglas Connect will also direct a sub contractor to create end-user development tools that simplify writing applications that use the CloudDSM system. Douglas Connect will make those development tools available to its customers and the startups that it incubates.
March 19, 2014, at 05:28 AM by 80.114.135.137 -
Added lines 1-13:

* Task1:
* Milestone M7.: Month 12 --
* Deliverable D7.: Month 12 --
* Milestone M7.: Month 24 --
* Deliverable D7.: Month 24 --
* Milestone M7.: Month 36 --
* Deliverable D7.: Month 36 --

* Gigas will make the results of the project available to its customers
* Douglas Connect will use the results of the project within its Drug Discovery product, and make the project technology available to the startups that it incubates.

Question: "Are the 2 Power servers we have in our lab sufficient for the CloudDSM project?" Answer: probably, anyone have thoughts?