ACCELERATING COMPARATIVE GENOMICS WORKFLOWS IN A DISTRIBUTED ENVIRONMENT WITH OPTIMIZED DATA PARTITIONING AND WORKFLOW FUSION.

Uložené v:
Podrobná bibliografia
Názov: ACCELERATING COMPARATIVE GENOMICS WORKFLOWS IN A DISTRIBUTED ENVIRONMENT WITH OPTIMIZED DATA PARTITIONING AND WORKFLOW FUSION.
Autori: CHOUDHURY, OLIVIA, HAZEKAMP, NICHOLAS L., THAIN, DOUGLAS, EMRICH, SCOTT J.
Zdroj: Scalable Computing: Practice & Experience; Mar2015, Vol. 16 Issue 1, p53-69, 17p
Predmety: COMPARATIVE genomics, WORKFLOW, CLOUD computing software, DATA analysis, COST
Abstrakt: The advent of next generation sequencing technology has generated massive amounts of biological data at unprecendented rates. Comparative genomics applications often require compute-intensive tools for subsequent analysis of high throughput data. Although cloud computing infrastructure plays an important role in this respect, the pressure from such computationally expensive tasks can be further alleviated using efficient data partitioning and workflow fusion. Here, we implement a workflow-based model for parallelizing the data-intensive tasks of genome alignment and variant calling with BWA and GATK's HaplotypeCaller. We explore three different approaches of partitioning data, granularity-based, individual-based, and alignment-based , and how each affect the run time. We observe granularity-based partitioning for BWA and alignment-based partitioning for HaplotypeCaller to be the optimal choices for the pipeline. We further discuss the methods and impact of workflow fusion on performance by considering different levels of fusion and how it affects our results. We identify the various open problems encountered, such as understanding the extent of parallelism, using heterogenous environments without a shared file system, and determining the granularity of inputs, and provide insights into addressing them. Finally, we report significant performance improvements, from 12 days to under 2 hours while running the BWA-GATK pipeline using partitioning and fusion. [ABSTRACT FROM AUTHOR]
Copyright of Scalable Computing: Practice & Experience is the property of Scalable Computing: Practice & Experience and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Databáza: Complementary Index
FullText Text:
  Availability: 0
CustomLinks:
  – Url: https://resolver.ebscohost.com/openurl?sid=EBSCO:edb&genre=article&issn=18951767&ISBN=&volume=16&issue=1&date=20150301&spage=53&pages=53-69&title=Scalable Computing: Practice & Experience&atitle=ACCELERATING%20COMPARATIVE%20GENOMICS%20WORKFLOWS%20IN%20A%20DISTRIBUTED%20ENVIRONMENT%20WITH%20OPTIMIZED%20DATA%20PARTITIONING%20AND%20WORKFLOW%20FUSION.&aulast=CHOUDHURY%2C%20OLIVIA&id=DOI:10.12694/scpe.v16i1.1060
    Name: Full Text Finder
    Category: fullText
    Text: Full Text Finder
    Icon: https://imageserver.ebscohost.com/branding/images/FTF.gif
    MouseOverText: Full Text Finder
  – Url: https://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=EBSCO&SrcAuth=EBSCO&DestApp=WOS&ServiceName=TransferToWoS&DestLinkType=GeneralSearchSummary&Func=Links&author=CHOUDHURY%20OLIVIA
    Name: ISI
    Category: fullText
    Text: Nájsť tento článok vo Web of Science
    Icon: https://imagesrvr.epnet.com/ls/20docs.gif
    MouseOverText: Nájsť tento článok vo Web of Science
Header DbId: edb
DbLabel: Complementary Index
An: 100872901
RelevancyScore: 845
AccessLevel: 6
PubType: Academic Journal
PubTypeId: academicJournal
PreciseRelevancyScore: 845.025634765625
IllustrationInfo
Items – Name: Title
  Label: Title
  Group: Ti
  Data: ACCELERATING COMPARATIVE GENOMICS WORKFLOWS IN A DISTRIBUTED ENVIRONMENT WITH OPTIMIZED DATA PARTITIONING AND WORKFLOW FUSION.
– Name: Author
  Label: Authors
  Group: Au
  Data: <searchLink fieldCode="AR" term="%22CHOUDHURY%2C+OLIVIA%22">CHOUDHURY, OLIVIA</searchLink><br /><searchLink fieldCode="AR" term="%22HAZEKAMP%2C+NICHOLAS+L%2E%22">HAZEKAMP, NICHOLAS L.</searchLink><br /><searchLink fieldCode="AR" term="%22THAIN%2C+DOUGLAS%22">THAIN, DOUGLAS</searchLink><br /><searchLink fieldCode="AR" term="%22EMRICH%2C+SCOTT+J%2E%22">EMRICH, SCOTT J.</searchLink>
– Name: TitleSource
  Label: Source
  Group: Src
  Data: Scalable Computing: Practice & Experience; Mar2015, Vol. 16 Issue 1, p53-69, 17p
– Name: Subject
  Label: Subject Terms
  Group: Su
  Data: <searchLink fieldCode="DE" term="%22COMPARATIVE+genomics%22">COMPARATIVE genomics</searchLink><br /><searchLink fieldCode="DE" term="%22WORKFLOW%22">WORKFLOW</searchLink><br /><searchLink fieldCode="DE" term="%22CLOUD+computing+software%22">CLOUD computing software</searchLink><br /><searchLink fieldCode="DE" term="%22DATA+analysis%22">DATA analysis</searchLink><br /><searchLink fieldCode="DE" term="%22COST%22">COST</searchLink>
– Name: Abstract
  Label: Abstract
  Group: Ab
  Data: The advent of next generation sequencing technology has generated massive amounts of biological data at unprecendented rates. Comparative genomics applications often require compute-intensive tools for subsequent analysis of high throughput data. Although cloud computing infrastructure plays an important role in this respect, the pressure from such computationally expensive tasks can be further alleviated using efficient data partitioning and workflow fusion. Here, we implement a workflow-based model for parallelizing the data-intensive tasks of genome alignment and variant calling with BWA and GATK's HaplotypeCaller. We explore three different approaches of partitioning data, granularity-based, individual-based, and alignment-based , and how each affect the run time. We observe granularity-based partitioning for BWA and alignment-based partitioning for HaplotypeCaller to be the optimal choices for the pipeline. We further discuss the methods and impact of workflow fusion on performance by considering different levels of fusion and how it affects our results. We identify the various open problems encountered, such as understanding the extent of parallelism, using heterogenous environments without a shared file system, and determining the granularity of inputs, and provide insights into addressing them. Finally, we report significant performance improvements, from 12 days to under 2 hours while running the BWA-GATK pipeline using partitioning and fusion. [ABSTRACT FROM AUTHOR]
– Name: Abstract
  Label:
  Group: Ab
  Data: <i>Copyright of Scalable Computing: Practice & Experience is the property of Scalable Computing: Practice & Experience and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract.</i> (Copyright applies to all Abstracts.)
PLink https://erproxy.cvtisr.sk/sfx/access?url=https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edb&AN=100872901
RecordInfo BibRecord:
  BibEntity:
    Identifiers:
      – Type: doi
        Value: 10.12694/scpe.v16i1.1060
    Languages:
      – Code: eng
        Text: English
    PhysicalDescription:
      Pagination:
        PageCount: 17
        StartPage: 53
    Subjects:
      – SubjectFull: COMPARATIVE genomics
        Type: general
      – SubjectFull: WORKFLOW
        Type: general
      – SubjectFull: CLOUD computing software
        Type: general
      – SubjectFull: DATA analysis
        Type: general
      – SubjectFull: COST
        Type: general
    Titles:
      – TitleFull: ACCELERATING COMPARATIVE GENOMICS WORKFLOWS IN A DISTRIBUTED ENVIRONMENT WITH OPTIMIZED DATA PARTITIONING AND WORKFLOW FUSION.
        Type: main
  BibRelationships:
    HasContributorRelationships:
      – PersonEntity:
          Name:
            NameFull: CHOUDHURY, OLIVIA
      – PersonEntity:
          Name:
            NameFull: HAZEKAMP, NICHOLAS L.
      – PersonEntity:
          Name:
            NameFull: THAIN, DOUGLAS
      – PersonEntity:
          Name:
            NameFull: EMRICH, SCOTT J.
    IsPartOfRelationships:
      – BibEntity:
          Dates:
            – D: 01
              M: 03
              Text: Mar2015
              Type: published
              Y: 2015
          Identifiers:
            – Type: issn-print
              Value: 18951767
          Numbering:
            – Type: volume
              Value: 16
            – Type: issue
              Value: 1
          Titles:
            – TitleFull: Scalable Computing: Practice & Experience
              Type: main
ResultId 1