pGraph: Efficient Parallel Construction of Large-Scale Protein Sequence Homology Graphs

Detecting sequence homology between protein sequences is a fundamental problem in computational molecular biology, with a pervasive application in nearly all analyses that aim to structurally and functionally characterize protein molecules. While detecting the homology between two protein sequences...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on parallel and distributed systems Jg. 23; H. 10; S. 1923 - 1933
Hauptverfasser: Changjun Wu, Kalyanaraman, Ananth, Cannon, William R.
Format: Journal Article
Sprache:Englisch
Veröffentlicht: IEEE 01.10.2012
Schlagworte:
ISSN:1045-9219
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Abstract Detecting sequence homology between protein sequences is a fundamental problem in computational molecular biology, with a pervasive application in nearly all analyses that aim to structurally and functionally characterize protein molecules. While detecting the homology between two protein sequences is relatively inexpensive, detecting pairwise homology for a large number of protein sequences can become computationally prohibitive for modern inputs, often requiring millions of CPU hours. Yet, there is currently no robust support to parallelize this kernel. In this paper, we identify the key characteristics that make this problem particularly hard to parallelize, and then propose a new parallel algorithm that is suited for detecting homology on large data sets using distributed memory parallel computers. Our method, called pGraph, is a novel hybrid between the hierarchical multiple-master/worker model and producer-consumer model, and is designed to break the irregularities imposed by alignment computation and work generation. Experimental results show that pGraph achieves linear scaling on a 2,048 processor distributed memory cluster for a wide range of inputs ranging from as small as 20,000 sequences to 2,560,000 sequences. In addition to demonstrating strong scaling, we present an extensive report on the performance of the various system components and related parametric studies.
AbstractList Detecting sequence homology between protein sequences is a fundamental problem in computational molecular biology, with a pervasive application in nearly all analyses that aim to structurally and functionally characterize protein molecules. While detecting the homology between two protein sequences is relatively inexpensive, detecting pairwise homology for a large number of protein sequences can become computationally prohibitive for modern inputs, often requiring millions of CPU hours. Yet, there is currently no robust support to parallelize this kernel. In this paper, we identify the key characteristics that make this problem particularly hard to parallelize, and then propose a new parallel algorithm that is suited for detecting homology on large data sets using distributed memory parallel computers. Our method, called pGraph, is a novel hybrid between the hierarchical multiple-master/worker model and producer-consumer model, and is designed to break the irregularities imposed by alignment computation and work generation. Experimental results show that pGraph achieves linear scaling on a 2,048 processor distributed memory cluster for a wide range of inputs ranging from as small as 20,000 sequences to 2,560,000 sequences. In addition to demonstrating strong scaling, we present an extensive report on the performance of the various system components and related parametric studies.
Author Cannon, William R.
Changjun Wu
Kalyanaraman, Ananth
Author_xml – sequence: 1
  surname: Changjun Wu
  fullname: Changjun Wu
  email: changjun.wu@xerox.com
  organization: Xerox Res. Center, Webster, NY, USA
– sequence: 2
  givenname: Ananth
  surname: Kalyanaraman
  fullname: Kalyanaraman, Ananth
  email: ananth@eecs.wsu.edu
  organization: Sch. of Electr. Eng. & Comput. Sci., Washington State Univ., Pullman, WA, USA
– sequence: 3
  givenname: William R.
  surname: Cannon
  fullname: Cannon, William R.
  email: william.cannon@pnnl.gov
  organization: Pacific Northwest Nat. Lab., Richland, WA, USA
BookMark eNp1kD1PwzAURT0UibawsbH4B5BiO86H2VApLVIlKrWIMXpxnouRGxc7HfrvIRQxIDHd5dyrqzMig9a3SMgVZxPOmbrdrB7WE8G4mHA1IEPOZJYowdU5GcX4zhiXGZND8rqfB9i_3dGZMVZbbDu6ggDOoaNT38YuHHRnfUu9oUsIW0zWGhzSVfAd2pau8eOArUa68Dvv_PZIv_fiBTkz4CJe_uSYvDzONtNFsnyeP03vl4kWpeiSQjOAQiNohWkjpDS8UKYpa5BGm7SsawNS5Dlokas0a0pdl1kjpcw4pkLydEzEaVcHH2NAU2nbQf-4C2BdxVnV66h6HVWvo-Lqq3Tzp7QPdgfh-B9-fcItIv6iORdFmafpJ7Hpb1o
CODEN ITDSEO
CitedBy_id crossref_primary_10_1002_cpe_3371
crossref_primary_10_1016_j_parco_2015_03_003
crossref_primary_10_1186_s12859_018_2453_2
crossref_primary_10_1016_j_jpdc_2016_06_006
crossref_primary_10_1109_MDAT_2013_2290116
crossref_primary_10_1016_j_compbiomed_2014_02_016
crossref_primary_10_1186_s12859_018_2080_y
crossref_primary_10_1128_IAI_00851_15
crossref_primary_10_1145_2499370_2462193
crossref_primary_10_3390_pathogens2040627
crossref_primary_10_1145_3411750_3411751
crossref_primary_10_1109_TPDS_2014_2325828
crossref_primary_10_1016_j_asoc_2019_105962
crossref_primary_10_1155_2013_191586
Cites_doi 10.1145/1654059.1654122
10.1371/journal.pbio.0050075
10.1109/SC.2008.5214891
10.1109/TPDS.2003.1255634
10.1109/SWAT.1973.13
10.1093/nar/gkg379
10.1093/nar/gkp848
10.1016/0888-7543(91)90071-L
10.1109/2.53356
10.1016/0022-2836(70)90057-4
10.1126/science.1058040
10.1371/journal.pbio.0050016
10.1109/TPDS.2010.101
10.1093/bioinformatics/18.suppl_2.S182
10.1109/TCBB.2007.70272
10.1007/978-3-540-77442-6_17
10.1093/nar/gkm869
10.1109/ISCAS.2010.5537736
10.1016/j.cbpa.2003.12.004
10.1093/nar/30.7.1575
10.1109/TPDS.2006.112
10.1128/JB.00542-10
10.1016/j.jpdc.2007.05.014
10.1016/0022-2836(81)90087-5
10.1093/nar/gkh121
10.1093/bioinformatics/btq461
10.1145/321941.321946
10.1006/geno.1996.0614
10.1073/pnas.85.8.2444
10.1093/nar/gkp887
10.1093/nar/gkl723
10.1109/CCGRID.2003.1199364
10.1016/j.jpdc.2005.09.006
10.1016/S0959-440X(00)00211-6
10.1145/1327452.1327492
10.1038/nature06244
10.1006/jmbi.2001.5080
10.1007/BF01840391
10.1006/jmbi.1990.9999
10.1128/MMBR.68.4.669-685.2004
10.1109/IPDPS.2004.1302958
ContentType Journal Article
DBID 97E
RIA
RIE
AAYXX
CITATION
DOI 10.1109/TPDS.2012.19
DatabaseName IEEE Xplore (IEEE)
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE Electronic Library (IEL)
CrossRef
DatabaseTitle CrossRef
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
Computer Science
EndPage 1933
ExternalDocumentID 10_1109_TPDS_2012_19
6127863
Genre orig-research
GroupedDBID --Z
-~X
.DC
0R~
29I
4.4
5GY
5VS
6IK
97E
AAJGR
AARMG
AASAJ
AAWTH
ABAZT
ABFSI
ABQJQ
ABVLG
ACGFO
ACIWK
AENEX
AETIX
AGQYO
AGSQL
AHBIQ
AI.
AIBXA
AKJIK
AKQYR
ALLEH
ALMA_UNASSIGNED_HOLDINGS
ASUFR
ATWAV
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CS3
DU5
E.L
EBS
EJD
HZ~
H~9
ICLAB
IEDLZ
IFIPE
IFJZH
IPLJI
JAVBF
LAI
M43
MS~
O9-
OCL
P2P
PQQKQ
RIA
RIE
RNI
RNS
RZB
TN5
TWZ
UHB
VH1
AAYXX
CITATION
ID FETCH-LOGICAL-c282t-7c0aa7ceac9e3d244f179fd8ba4fcf38bbfa4266ac26935d8cb85d44451e32413
IEDL.DBID RIE
ISICitedReferencesCount 5
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000307824600011&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 1045-9219
IngestDate Tue Nov 18 22:07:02 EST 2025
Sat Nov 29 08:12:51 EST 2025
Wed Aug 27 02:52:31 EDT 2025
IsPeerReviewed true
IsScholarly true
Issue 10
Language English
License https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c282t-7c0aa7ceac9e3d244f179fd8ba4fcf38bbfa4266ac26935d8cb85d44451e32413
PageCount 11
ParticipantIDs crossref_primary_10_1109_TPDS_2012_19
crossref_citationtrail_10_1109_TPDS_2012_19
ieee_primary_6127863
PublicationCentury 2000
PublicationDate 2012-10-01
PublicationDateYYYYMMDD 2012-10-01
PublicationDate_xml – month: 10
  year: 2012
  text: 2012-10-01
  day: 01
PublicationDecade 2010
PublicationTitle IEEE transactions on parallel and distributed systems
PublicationTitleAbbrev TPDS
PublicationYear 2012
Publisher IEEE
Publisher_xml – name: IEEE
References ref13
ref35
ref12
ref34
ref15
ref36
ref31
(ref14) 2011
ref30
ref11
ref10
ref32
ref2
ref1
Ronaldo (ref37) 2006
ref17
ref39
ref16
ref38
ref19
ref18
Olman (ref33) 2007; 5
ref24
ref46
(ref23) 2011
ref45
ref26
ref48
ref25
ref47
ref20
ref42
ref41
ref44
ref21
ref43
(ref6) 2011
ref28
ref27
ref29
ref9
ref4
Darling (ref8)
ref3
Cantú-Paz (ref7) 1998; 10
ref5
(ref22) 2011
ref40
References_xml – ident: ref13
  doi: 10.1145/1654059.1654122
– volume-title: The Nat’l Center for Biotechnology Information
  year: 2011
  ident: ref23
– ident: ref39
  doi: 10.1371/journal.pbio.0050075
– ident: ref47
  doi: 10.1109/SC.2008.5214891
– ident: ref24
  doi: 10.1109/TPDS.2003.1255634
– volume-title: Proc. Fourth Int’l Conf. Linux Clusters
  ident: ref8
  article-title: The Design Implementation, and Evaluation of mpiBLAST
– ident: ref46
  doi: 10.1109/SWAT.1973.13
– ident: ref25
  doi: 10.1093/nar/gkg379
– ident: ref16
  doi: 10.1093/nar/gkp848
– year: 2011
  ident: ref14
  article-title: Genomes OnLine Database
– ident: ref34
  doi: 10.1016/0888-7543(91)90071-L
– ident: ref17
  doi: 10.1109/2.53356
– ident: ref31
  doi: 10.1016/0022-2836(70)90057-4
– ident: ref45
  doi: 10.1126/science.1058040
– ident: ref48
  doi: 10.1371/journal.pbio.0050016
– ident: ref28
  doi: 10.1109/TPDS.2010.101
– ident: ref36
  doi: 10.1093/bioinformatics/18.suppl_2.S182
– volume: 5
  start-page: 344
  issue: 2
  year: 2007
  ident: ref33
  article-title: A Parallel Clustering Algorithm for Very Large Data Sets
  publication-title: IEEE/ACM Trans Computational Biology and Bioinformatics
  doi: 10.1109/TCBB.2007.70272
– ident: ref5
  doi: 10.1007/978-3-540-77442-6_17
– volume: 10
  start-page: 141
  issue: 2
  year: 1998
  ident: ref7
  article-title: A Survey of Parallel Genetic Algorithms
  publication-title: Calculateurs Parallèles, Réseaux et Systèmes Répartis
– ident: ref20
  doi: 10.1093/nar/gkm869
– ident: ref38
  doi: 10.1109/ISCAS.2010.5537736
– ident: ref3
  doi: 10.1016/j.cbpa.2003.12.004
– ident: ref12
  doi: 10.1093/nar/30.7.1575
– ident: ref32
  doi: 10.1109/TPDS.2006.112
– ident: ref10
  doi: 10.1128/JB.00542-10
– ident: ref26
  doi: 10.1016/j.jpdc.2007.05.014
– ident: ref41
  doi: 10.1016/0022-2836(81)90087-5
– ident: ref4
  doi: 10.1093/nar/gkh121
– ident: ref11
  doi: 10.1093/bioinformatics/btq461
– ident: ref29
  doi: 10.1145/321941.321946
– ident: ref40
  doi: 10.1006/geno.1996.0614
– ident: ref35
  doi: 10.1073/pnas.85.8.2444
– ident: ref21
  doi: 10.1093/nar/gkp887
– ident: ref30
  doi: 10.1093/nar/gkl723
– ident: ref1
  doi: 10.1109/CCGRID.2003.1199364
– year: 2006
  ident: ref37
  article-title: A Transparent Framework for Hierarchical Master-Slave Grid Computing
  publication-title: Proc. CoreGRID
– ident: ref42
  doi: 10.1016/j.jpdc.2005.09.006
– volume-title: Marine Microbial Initiative—Gordon and Betty Moore Foundation
  year: 2011
  ident: ref22
– ident: ref27
  doi: 10.1016/S0959-440X(00)00211-6
– ident: ref9
  doi: 10.1145/1327452.1327492
– ident: ref43
  doi: 10.1038/nature06244
– ident: ref15
  doi: 10.1006/jmbi.2001.5080
– ident: ref44
  doi: 10.1007/BF01840391
– ident: ref2
  doi: 10.1006/jmbi.1990.9999
– ident: ref18
  doi: 10.1128/MMBR.68.4.669-685.2004
– ident: ref19
  doi: 10.1109/IPDPS.2004.1302958
– volume-title: CAMERA—Community Cyberinfrastructure for Advanced Microbial Ecology Research & Analysis
  year: 2011
  ident: ref6
SSID ssj0014504
Score 2.1712644
Snippet Detecting sequence homology between protein sequences is a fundamental problem in computational molecular biology, with a pervasive application in nearly all...
SourceID crossref
ieee
SourceType Enrichment Source
Index Database
Publisher
StartPage 1923
SubjectTerms Amino acids
Computational modeling
DNA
Dynamic programming
hierarchical master-worker paradigm
Image edge detection
Parallel protein sequence homology detection
parallel sequence graph construction
producer-consumer model
Protein sequence
Title pGraph: Efficient Parallel Construction of Large-Scale Protein Sequence Homology Graphs
URI https://ieeexplore.ieee.org/document/6127863
Volume 23
WOSCitedRecordID wos000307824600011&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVIEE
  databaseName: IEEE Electronic Library (IEL)
  issn: 1045-9219
  databaseCode: RIE
  dateStart: 19900101
  customDbUrl:
  isFulltext: true
  dateEnd: 99991231
  titleUrlDefault: https://ieeexplore.ieee.org/
  omitProxy: false
  ssIdentifier: ssj0014504
  providerName: IEEE
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PT8IwFH5B4kEPoqARf6UHPWlhsB_tvBkFORiyBIzclq7tEhLcCKB_v33dWDjowduydM3Sb93X1_f6fQC3TCSGtNIedbUW1FMoedsTDhWBIUcPtXFdZc0m2HjMZ7MwqsFDdRZGa22Lz3QHL20uX-XyC7fKuoaNGQ_cPdhjjBVntaqMgedbq0ATXfg0NNOwKnIPu9PoZYJFXP0O6uns0M-On4qlk2Hjfy9yDEflspE8FTifQE1nTWhsLRlIOUObcLijL9iCj-Ur6lE_koEVijB9kkis0D1lQdCpc6sdS_KUvGFJOJ0YyDSJULxhnpFJWWdNRvmn3X4ntr_1KbwPB9PnES2dFKg0IdWGMukIwaT5yYbaVYbRUzMPU8UT4aUydXmSpAKpWsi-AcdXXCbcVx6Kl2kXM29nUM_yTJ8DUYFUQeJzpX2MVhyuUaENkfYcaRaLbbjfDnAsS5lxdLtYxDbccMIY4YgRjrgXtuGuar0s5DX-aNdCFKo2JQAXv9--hAN8rqi5u4K6GU19DfvyezNfr27sd_MDqYPB3w
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PT8IwFH5BNFEPoqARf_agJx1s7FfnzSiIEckSMHpburZLSHAjgP799nVj4aAHb8vSvCz92n197ev3AVz5LFaklViGLSUzHIGStxYzDeYpcnRQG9cW2mzCHw7px0cQVuC2vAsjpdTFZ7KFj_osX2T8C7fK2oqNferZG7DpOk7Hym9rlWcGjqvNAlV-4RqBmohlmXvQHoePIyzj6rRQUWeNgNYcVTSh9Gr_-5R92CsWjuQ-R_oAKjKtQ21lykCKOVqH3TWFwQa8z55QkfqOdLVUhIpJQjZH_5QpQa_OlXosyRIywKJwY6RAkyRE-YZJSkZFpTXpZ596A57oeItDeOt1xw99o_BSMLhKqpaGz03GfK5-s4G0heL0RM3ERNCYOQlPbBrHCUOyZryj4HEF5TF1hYPyZdLGs7cjqKZZKo-BCI8LL3apkC7mKyaVqNGGWDsmV8vFJtysOjjihdA4-l1MI51wmEGEcEQIR2QFTbguW89ygY0_2jUQhbJNAcDJ768vYbs_fh1Eg-fhyynsYIy8Au8Mqqpn5Tls8e_lZDG_0GPoB-ilxSY
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=pGraph%3A+Efficient+Parallel+Construction+of+Large-Scale+Protein+Sequence+Homology+Graphs&rft.jtitle=IEEE+transactions+on+parallel+and+distributed+systems&rft.au=Changjun+Wu&rft.au=Kalyanaraman%2C+Ananth&rft.au=Cannon%2C+William+R.&rft.date=2012-10-01&rft.pub=IEEE&rft.issn=1045-9219&rft.volume=23&rft.issue=10&rft.spage=1923&rft.epage=1933&rft_id=info:doi/10.1109%2FTPDS.2012.19&rft.externalDocID=6127863
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1045-9219&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1045-9219&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1045-9219&client=summon