pGraph: Efficient Parallel Construction of Large-Scale Protein Sequence Homology Graphs
Detecting sequence homology between protein sequences is a fundamental problem in computational molecular biology, with a pervasive application in nearly all analyses that aim to structurally and functionally characterize protein molecules. While detecting the homology between two protein sequences...
Gespeichert in:
| Veröffentlicht in: | IEEE transactions on parallel and distributed systems Jg. 23; H. 10; S. 1923 - 1933 |
|---|---|
| Hauptverfasser: | , , |
| Format: | Journal Article |
| Sprache: | Englisch |
| Veröffentlicht: |
IEEE
01.10.2012
|
| Schlagworte: | |
| ISSN: | 1045-9219 |
| Online-Zugang: | Volltext |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Abstract | Detecting sequence homology between protein sequences is a fundamental problem in computational molecular biology, with a pervasive application in nearly all analyses that aim to structurally and functionally characterize protein molecules. While detecting the homology between two protein sequences is relatively inexpensive, detecting pairwise homology for a large number of protein sequences can become computationally prohibitive for modern inputs, often requiring millions of CPU hours. Yet, there is currently no robust support to parallelize this kernel. In this paper, we identify the key characteristics that make this problem particularly hard to parallelize, and then propose a new parallel algorithm that is suited for detecting homology on large data sets using distributed memory parallel computers. Our method, called pGraph, is a novel hybrid between the hierarchical multiple-master/worker model and producer-consumer model, and is designed to break the irregularities imposed by alignment computation and work generation. Experimental results show that pGraph achieves linear scaling on a 2,048 processor distributed memory cluster for a wide range of inputs ranging from as small as 20,000 sequences to 2,560,000 sequences. In addition to demonstrating strong scaling, we present an extensive report on the performance of the various system components and related parametric studies. |
|---|---|
| AbstractList | Detecting sequence homology between protein sequences is a fundamental problem in computational molecular biology, with a pervasive application in nearly all analyses that aim to structurally and functionally characterize protein molecules. While detecting the homology between two protein sequences is relatively inexpensive, detecting pairwise homology for a large number of protein sequences can become computationally prohibitive for modern inputs, often requiring millions of CPU hours. Yet, there is currently no robust support to parallelize this kernel. In this paper, we identify the key characteristics that make this problem particularly hard to parallelize, and then propose a new parallel algorithm that is suited for detecting homology on large data sets using distributed memory parallel computers. Our method, called pGraph, is a novel hybrid between the hierarchical multiple-master/worker model and producer-consumer model, and is designed to break the irregularities imposed by alignment computation and work generation. Experimental results show that pGraph achieves linear scaling on a 2,048 processor distributed memory cluster for a wide range of inputs ranging from as small as 20,000 sequences to 2,560,000 sequences. In addition to demonstrating strong scaling, we present an extensive report on the performance of the various system components and related parametric studies. |
| Author | Cannon, William R. Changjun Wu Kalyanaraman, Ananth |
| Author_xml | – sequence: 1 surname: Changjun Wu fullname: Changjun Wu email: changjun.wu@xerox.com organization: Xerox Res. Center, Webster, NY, USA – sequence: 2 givenname: Ananth surname: Kalyanaraman fullname: Kalyanaraman, Ananth email: ananth@eecs.wsu.edu organization: Sch. of Electr. Eng. & Comput. Sci., Washington State Univ., Pullman, WA, USA – sequence: 3 givenname: William R. surname: Cannon fullname: Cannon, William R. email: william.cannon@pnnl.gov organization: Pacific Northwest Nat. Lab., Richland, WA, USA |
| BookMark | eNp1kD1PwzAURT0UibawsbH4B5BiO86H2VApLVIlKrWIMXpxnouRGxc7HfrvIRQxIDHd5dyrqzMig9a3SMgVZxPOmbrdrB7WE8G4mHA1IEPOZJYowdU5GcX4zhiXGZND8rqfB9i_3dGZMVZbbDu6ggDOoaNT38YuHHRnfUu9oUsIW0zWGhzSVfAd2pau8eOArUa68Dvv_PZIv_fiBTkz4CJe_uSYvDzONtNFsnyeP03vl4kWpeiSQjOAQiNohWkjpDS8UKYpa5BGm7SsawNS5Dlokas0a0pdl1kjpcw4pkLydEzEaVcHH2NAU2nbQf-4C2BdxVnV66h6HVWvo-Lqq3Tzp7QPdgfh-B9-fcItIv6iORdFmafpJ7Hpb1o |
| CODEN | ITDSEO |
| CitedBy_id | crossref_primary_10_1002_cpe_3371 crossref_primary_10_1016_j_parco_2015_03_003 crossref_primary_10_1186_s12859_018_2453_2 crossref_primary_10_1016_j_jpdc_2016_06_006 crossref_primary_10_1109_MDAT_2013_2290116 crossref_primary_10_1016_j_compbiomed_2014_02_016 crossref_primary_10_1186_s12859_018_2080_y crossref_primary_10_1128_IAI_00851_15 crossref_primary_10_1145_2499370_2462193 crossref_primary_10_3390_pathogens2040627 crossref_primary_10_1145_3411750_3411751 crossref_primary_10_1109_TPDS_2014_2325828 crossref_primary_10_1016_j_asoc_2019_105962 crossref_primary_10_1155_2013_191586 |
| Cites_doi | 10.1145/1654059.1654122 10.1371/journal.pbio.0050075 10.1109/SC.2008.5214891 10.1109/TPDS.2003.1255634 10.1109/SWAT.1973.13 10.1093/nar/gkg379 10.1093/nar/gkp848 10.1016/0888-7543(91)90071-L 10.1109/2.53356 10.1016/0022-2836(70)90057-4 10.1126/science.1058040 10.1371/journal.pbio.0050016 10.1109/TPDS.2010.101 10.1093/bioinformatics/18.suppl_2.S182 10.1109/TCBB.2007.70272 10.1007/978-3-540-77442-6_17 10.1093/nar/gkm869 10.1109/ISCAS.2010.5537736 10.1016/j.cbpa.2003.12.004 10.1093/nar/30.7.1575 10.1109/TPDS.2006.112 10.1128/JB.00542-10 10.1016/j.jpdc.2007.05.014 10.1016/0022-2836(81)90087-5 10.1093/nar/gkh121 10.1093/bioinformatics/btq461 10.1145/321941.321946 10.1006/geno.1996.0614 10.1073/pnas.85.8.2444 10.1093/nar/gkp887 10.1093/nar/gkl723 10.1109/CCGRID.2003.1199364 10.1016/j.jpdc.2005.09.006 10.1016/S0959-440X(00)00211-6 10.1145/1327452.1327492 10.1038/nature06244 10.1006/jmbi.2001.5080 10.1007/BF01840391 10.1006/jmbi.1990.9999 10.1128/MMBR.68.4.669-685.2004 10.1109/IPDPS.2004.1302958 |
| ContentType | Journal Article |
| DBID | 97E RIA RIE AAYXX CITATION |
| DOI | 10.1109/TPDS.2012.19 |
| DatabaseName | IEEE Xplore (IEEE) IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Electronic Library (IEL) CrossRef |
| DatabaseTitle | CrossRef |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Engineering Computer Science |
| EndPage | 1933 |
| ExternalDocumentID | 10_1109_TPDS_2012_19 6127863 |
| Genre | orig-research |
| GroupedDBID | --Z -~X .DC 0R~ 29I 4.4 5GY 5VS 6IK 97E AAJGR AARMG AASAJ AAWTH ABAZT ABFSI ABQJQ ABVLG ACGFO ACIWK AENEX AETIX AGQYO AGSQL AHBIQ AI. AIBXA AKJIK AKQYR ALLEH ALMA_UNASSIGNED_HOLDINGS ASUFR ATWAV BEFXN BFFAM BGNUA BKEBE BPEOZ CS3 DU5 E.L EBS EJD HZ~ H~9 ICLAB IEDLZ IFIPE IFJZH IPLJI JAVBF LAI M43 MS~ O9- OCL P2P PQQKQ RIA RIE RNI RNS RZB TN5 TWZ UHB VH1 AAYXX CITATION |
| ID | FETCH-LOGICAL-c282t-7c0aa7ceac9e3d244f179fd8ba4fcf38bbfa4266ac26935d8cb85d44451e32413 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 5 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000307824600011&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 1045-9219 |
| IngestDate | Tue Nov 18 22:07:02 EST 2025 Sat Nov 29 08:12:51 EST 2025 Wed Aug 27 02:52:31 EDT 2025 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 10 |
| Language | English |
| License | https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c282t-7c0aa7ceac9e3d244f179fd8ba4fcf38bbfa4266ac26935d8cb85d44451e32413 |
| PageCount | 11 |
| ParticipantIDs | crossref_primary_10_1109_TPDS_2012_19 crossref_citationtrail_10_1109_TPDS_2012_19 ieee_primary_6127863 |
| PublicationCentury | 2000 |
| PublicationDate | 2012-10-01 |
| PublicationDateYYYYMMDD | 2012-10-01 |
| PublicationDate_xml | – month: 10 year: 2012 text: 2012-10-01 day: 01 |
| PublicationDecade | 2010 |
| PublicationTitle | IEEE transactions on parallel and distributed systems |
| PublicationTitleAbbrev | TPDS |
| PublicationYear | 2012 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| References | ref13 ref35 ref12 ref34 ref15 ref36 ref31 (ref14) 2011 ref30 ref11 ref10 ref32 ref2 ref1 Ronaldo (ref37) 2006 ref17 ref39 ref16 ref38 ref19 ref18 Olman (ref33) 2007; 5 ref24 ref46 (ref23) 2011 ref45 ref26 ref48 ref25 ref47 ref20 ref42 ref41 ref44 ref21 ref43 (ref6) 2011 ref28 ref27 ref29 ref9 ref4 Darling (ref8) ref3 Cantú-Paz (ref7) 1998; 10 ref5 (ref22) 2011 ref40 |
| References_xml | – ident: ref13 doi: 10.1145/1654059.1654122 – volume-title: The Nat’l Center for Biotechnology Information year: 2011 ident: ref23 – ident: ref39 doi: 10.1371/journal.pbio.0050075 – ident: ref47 doi: 10.1109/SC.2008.5214891 – ident: ref24 doi: 10.1109/TPDS.2003.1255634 – volume-title: Proc. Fourth Int’l Conf. Linux Clusters ident: ref8 article-title: The Design Implementation, and Evaluation of mpiBLAST – ident: ref46 doi: 10.1109/SWAT.1973.13 – ident: ref25 doi: 10.1093/nar/gkg379 – ident: ref16 doi: 10.1093/nar/gkp848 – year: 2011 ident: ref14 article-title: Genomes OnLine Database – ident: ref34 doi: 10.1016/0888-7543(91)90071-L – ident: ref17 doi: 10.1109/2.53356 – ident: ref31 doi: 10.1016/0022-2836(70)90057-4 – ident: ref45 doi: 10.1126/science.1058040 – ident: ref48 doi: 10.1371/journal.pbio.0050016 – ident: ref28 doi: 10.1109/TPDS.2010.101 – ident: ref36 doi: 10.1093/bioinformatics/18.suppl_2.S182 – volume: 5 start-page: 344 issue: 2 year: 2007 ident: ref33 article-title: A Parallel Clustering Algorithm for Very Large Data Sets publication-title: IEEE/ACM Trans Computational Biology and Bioinformatics doi: 10.1109/TCBB.2007.70272 – ident: ref5 doi: 10.1007/978-3-540-77442-6_17 – volume: 10 start-page: 141 issue: 2 year: 1998 ident: ref7 article-title: A Survey of Parallel Genetic Algorithms publication-title: Calculateurs Parallèles, Réseaux et Systèmes Répartis – ident: ref20 doi: 10.1093/nar/gkm869 – ident: ref38 doi: 10.1109/ISCAS.2010.5537736 – ident: ref3 doi: 10.1016/j.cbpa.2003.12.004 – ident: ref12 doi: 10.1093/nar/30.7.1575 – ident: ref32 doi: 10.1109/TPDS.2006.112 – ident: ref10 doi: 10.1128/JB.00542-10 – ident: ref26 doi: 10.1016/j.jpdc.2007.05.014 – ident: ref41 doi: 10.1016/0022-2836(81)90087-5 – ident: ref4 doi: 10.1093/nar/gkh121 – ident: ref11 doi: 10.1093/bioinformatics/btq461 – ident: ref29 doi: 10.1145/321941.321946 – ident: ref40 doi: 10.1006/geno.1996.0614 – ident: ref35 doi: 10.1073/pnas.85.8.2444 – ident: ref21 doi: 10.1093/nar/gkp887 – ident: ref30 doi: 10.1093/nar/gkl723 – ident: ref1 doi: 10.1109/CCGRID.2003.1199364 – year: 2006 ident: ref37 article-title: A Transparent Framework for Hierarchical Master-Slave Grid Computing publication-title: Proc. CoreGRID – ident: ref42 doi: 10.1016/j.jpdc.2005.09.006 – volume-title: Marine Microbial Initiative—Gordon and Betty Moore Foundation year: 2011 ident: ref22 – ident: ref27 doi: 10.1016/S0959-440X(00)00211-6 – ident: ref9 doi: 10.1145/1327452.1327492 – ident: ref43 doi: 10.1038/nature06244 – ident: ref15 doi: 10.1006/jmbi.2001.5080 – ident: ref44 doi: 10.1007/BF01840391 – ident: ref2 doi: 10.1006/jmbi.1990.9999 – ident: ref18 doi: 10.1128/MMBR.68.4.669-685.2004 – ident: ref19 doi: 10.1109/IPDPS.2004.1302958 – volume-title: CAMERA—Community Cyberinfrastructure for Advanced Microbial Ecology Research & Analysis year: 2011 ident: ref6 |
| SSID | ssj0014504 |
| Score | 2.1712644 |
| Snippet | Detecting sequence homology between protein sequences is a fundamental problem in computational molecular biology, with a pervasive application in nearly all... |
| SourceID | crossref ieee |
| SourceType | Enrichment Source Index Database Publisher |
| StartPage | 1923 |
| SubjectTerms | Amino acids Computational modeling DNA Dynamic programming hierarchical master-worker paradigm Image edge detection Parallel protein sequence homology detection parallel sequence graph construction producer-consumer model Protein sequence |
| Title | pGraph: Efficient Parallel Construction of Large-Scale Protein Sequence Homology Graphs |
| URI | https://ieeexplore.ieee.org/document/6127863 |
| Volume | 23 |
| WOSCitedRecordID | wos000307824600011&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVIEE databaseName: IEEE Electronic Library (IEL) issn: 1045-9219 databaseCode: RIE dateStart: 19900101 customDbUrl: isFulltext: true dateEnd: 99991231 titleUrlDefault: https://ieeexplore.ieee.org/ omitProxy: false ssIdentifier: ssj0014504 providerName: IEEE |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PT8IwFH5B4kEPoqARf6UHPWlhsB_tvBkFORiyBIzclq7tEhLcCKB_v33dWDjowduydM3Sb93X1_f6fQC3TCSGtNIedbUW1FMoedsTDhWBIUcPtXFdZc0m2HjMZ7MwqsFDdRZGa22Lz3QHL20uX-XyC7fKuoaNGQ_cPdhjjBVntaqMgedbq0ATXfg0NNOwKnIPu9PoZYJFXP0O6uns0M-On4qlk2Hjfy9yDEflspE8FTifQE1nTWhsLRlIOUObcLijL9iCj-Ur6lE_koEVijB9kkis0D1lQdCpc6sdS_KUvGFJOJ0YyDSJULxhnpFJWWdNRvmn3X4ntr_1KbwPB9PnES2dFKg0IdWGMukIwaT5yYbaVYbRUzMPU8UT4aUydXmSpAKpWsi-AcdXXCbcVx6Kl2kXM29nUM_yTJ8DUYFUQeJzpX2MVhyuUaENkfYcaRaLbbjfDnAsS5lxdLtYxDbccMIY4YgRjrgXtuGuar0s5DX-aNdCFKo2JQAXv9--hAN8rqi5u4K6GU19DfvyezNfr27sd_MDqYPB3w |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PT8IwFH5BNFEPoqARf_agJx1s7FfnzSiIEckSMHpburZLSHAjgP799nVj4aAHb8vSvCz92n197ev3AVz5LFaklViGLSUzHIGStxYzDeYpcnRQG9cW2mzCHw7px0cQVuC2vAsjpdTFZ7KFj_osX2T8C7fK2oqNferZG7DpOk7Hym9rlWcGjqvNAlV-4RqBmohlmXvQHoePIyzj6rRQUWeNgNYcVTSh9Gr_-5R92CsWjuQ-R_oAKjKtQ21lykCKOVqH3TWFwQa8z55QkfqOdLVUhIpJQjZH_5QpQa_OlXosyRIywKJwY6RAkyRE-YZJSkZFpTXpZ596A57oeItDeOt1xw99o_BSMLhKqpaGz03GfK5-s4G0heL0RM3ERNCYOQlPbBrHCUOyZryj4HEF5TF1hYPyZdLGs7cjqKZZKo-BCI8LL3apkC7mKyaVqNGGWDsmV8vFJtysOjjihdA4-l1MI51wmEGEcEQIR2QFTbguW89ygY0_2jUQhbJNAcDJ768vYbs_fh1Eg-fhyynsYIy8Au8Mqqpn5Tls8e_lZDG_0GPoB-ilxSY |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=pGraph%3A+Efficient+Parallel+Construction+of+Large-Scale+Protein+Sequence+Homology+Graphs&rft.jtitle=IEEE+transactions+on+parallel+and+distributed+systems&rft.au=Changjun+Wu&rft.au=Kalyanaraman%2C+Ananth&rft.au=Cannon%2C+William+R.&rft.date=2012-10-01&rft.pub=IEEE&rft.issn=1045-9219&rft.volume=23&rft.issue=10&rft.spage=1923&rft.epage=1933&rft_id=info:doi/10.1109%2FTPDS.2012.19&rft.externalDocID=6127863 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1045-9219&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1045-9219&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1045-9219&client=summon |