Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)
In this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea of PPE segmentation is inspired by the byte-pair encoding (BPE) text compression algorithm, which has recently gain...
Uloženo v:
| Vydáno v: | Scientific reports Ročník 9; číslo 1; s. 3577 |
|---|---|
| Hlavní autoři: | , , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
London
Nature Publishing Group UK
05.03.2019
Nature Publishing Group |
| Témata: | |
| ISSN: | 2045-2322, 2045-2322 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | In this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea of PPE segmentation is inspired by the byte-pair encoding (BPE) text compression algorithm, which has recently gained popularity in subword neural machine translation. We modify this algorithm by adding a sampling framework allowing for multiple ways of segmenting a sequence. PPE segmentation steps can be learned over a large set of protein sequences (Swiss-Prot) or even a domain-specific dataset and then applied to a set of unseen sequences. This representation can be widely used as the input to any downstream machine learning tasks in protein bioinformatics. In particular, here, we introduce this representation through protein motif discovery and protein sequence embedding. (i) DiMotif: we present DiMotif as an alignment-free discriminative motif discovery method and evaluate the method for finding protein motifs in three different settings: (1) comparison of DiMotif with two existing approaches on 20 distinct motif discovery problems which are experimentally verified, (2) classification-based approach for the motifs extracted for integrins, integrin-binding proteins, and biofilm formation, and (3) in sequence pattern searching for nuclear localization signal. The DiMotif, in general, obtained high recall scores, while having a comparable F1 score with other methods in the discovery of experimentally verified motifs. Having high recall suggests that the DiMotif can be used for short-list creation for further experimental investigations on motifs. In the classification-based evaluation, the extracted motifs could reliably detect the integrins, integrin-binding, and biofilm formation-related proteins on a reserved set of sequences with high F1 scores. (ii) ProtVecX: we extend k-mer based protein vector (ProtVec) embedding to variablelength protein embedding using PPE sub-sequences. We show that the new method of embedding can marginally outperform ProtVec in enzyme prediction as well as toxin prediction tasks. In addition, we conclude that the embeddings are beneficial in protein classification tasks when they are combined with raw amino acids k-mer features. |
|---|---|
| AbstractList | In this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea of PPE segmentation is inspired by the byte-pair encoding (BPE) text compression algorithm, which has recently gained popularity in subword neural machine translation. We modify this algorithm by adding a sampling framework allowing for multiple ways of segmenting a sequence. PPE segmentation steps can be learned over a large set of protein sequences (Swiss-Prot) or even a domain-specific dataset and then applied to a set of unseen sequences. This representation can be widely used as the input to any downstream machine learning tasks in protein bioinformatics. In particular, here, we introduce this representation through protein motif discovery and protein sequence embedding. (i) DiMotif: we present DiMotif as an alignment-free discriminative motif discovery method and evaluate the method for finding protein motifs in three different settings: (1) comparison of DiMotif with two existing approaches on 20 distinct motif discovery problems which are experimentally verified, (2) classification-based approach for the motifs extracted for integrins, integrin-binding proteins, and biofilm formation, and (3) in sequence pattern searching for nuclear localization signal. The DiMotif, in general, obtained high recall scores, while having a comparable F1 score with other methods in the discovery of experimentally verified motifs. Having high recall suggests that the DiMotif can be used for short-list creation for further experimental investigations on motifs. In the classification-based evaluation, the extracted motifs could reliably detect the integrins, integrin-binding, and biofilm formation-related proteins on a reserved set of sequences with high F1 scores. (ii) ProtVecX: we extend k-mer based protein vector (ProtVec) embedding to variablelength protein embedding using PPE sub-sequences. We show that the new method of embedding can marginally outperform ProtVec in enzyme prediction as well as toxin prediction tasks. In addition, we conclude that the embeddings are beneficial in protein classification tasks when they are combined with raw amino acids k-mer features. In this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea of PPE segmentation is inspired by the byte-pair encoding (BPE) text compression algorithm, which has recently gained popularity in subword neural machine translation. We modify this algorithm by adding a sampling framework allowing for multiple ways of segmenting a sequence. PPE segmentation steps can be learned over a large set of protein sequences (Swiss-Prot) or even a domain-specific dataset and then applied to a set of unseen sequences. This representation can be widely used as the input to any downstream machine learning tasks in protein bioinformatics. In particular, here, we introduce this representation through protein motif discovery and protein sequence embedding. (i) DiMotif: we present DiMotif as an alignment-free discriminative motif discovery method and evaluate the method for finding protein motifs in three different settings: (1) comparison of DiMotif with two existing approaches on 20 distinct motif discovery problems which are experimentally verified, (2) classification-based approach for the motifs extracted for integrins, integrin-binding proteins, and biofilm formation, and (3) in sequence pattern searching for nuclear localization signal. The DiMotif, in general, obtained high recall scores, while having a comparable F1 score with other methods in the discovery of experimentally verified motifs. Having high recall suggests that the DiMotif can be used for short-list creation for further experimental investigations on motifs. In the classification-based evaluation, the extracted motifs could reliably detect the integrins, integrin-binding, and biofilm formation-related proteins on a reserved set of sequences with high F1 scores. (ii) ProtVecX: we extend k-mer based protein vector (ProtVec) embedding to variablelength protein embedding using PPE sub-sequences. We show that the new method of embedding can marginally outperform ProtVec in enzyme prediction as well as toxin prediction tasks. In addition, we conclude that the embeddings are beneficial in protein classification tasks when they are combined with raw amino acids k-mer features.In this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea of PPE segmentation is inspired by the byte-pair encoding (BPE) text compression algorithm, which has recently gained popularity in subword neural machine translation. We modify this algorithm by adding a sampling framework allowing for multiple ways of segmenting a sequence. PPE segmentation steps can be learned over a large set of protein sequences (Swiss-Prot) or even a domain-specific dataset and then applied to a set of unseen sequences. This representation can be widely used as the input to any downstream machine learning tasks in protein bioinformatics. In particular, here, we introduce this representation through protein motif discovery and protein sequence embedding. (i) DiMotif: we present DiMotif as an alignment-free discriminative motif discovery method and evaluate the method for finding protein motifs in three different settings: (1) comparison of DiMotif with two existing approaches on 20 distinct motif discovery problems which are experimentally verified, (2) classification-based approach for the motifs extracted for integrins, integrin-binding proteins, and biofilm formation, and (3) in sequence pattern searching for nuclear localization signal. The DiMotif, in general, obtained high recall scores, while having a comparable F1 score with other methods in the discovery of experimentally verified motifs. Having high recall suggests that the DiMotif can be used for short-list creation for further experimental investigations on motifs. In the classification-based evaluation, the extracted motifs could reliably detect the integrins, integrin-binding, and biofilm formation-related proteins on a reserved set of sequences with high F1 scores. (ii) ProtVecX: we extend k-mer based protein vector (ProtVec) embedding to variablelength protein embedding using PPE sub-sequences. We show that the new method of embedding can marginally outperform ProtVec in enzyme prediction as well as toxin prediction tasks. In addition, we conclude that the embeddings are beneficial in protein classification tasks when they are combined with raw amino acids k-mer features. In this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea of PPE segmentation is inspired by the byte-pair encoding (BPE) text compression algorithm, which has recently gained popularity in subword neural machine translation. We modify this algorithm by adding a sampling framework allowing for multiple ways of segmenting a sequence. PPE segmentation steps can be learned over a large set of protein sequences (Swiss-Prot) or even a domain-specific dataset and then applied to a set of unseen sequences. This representation can be widely used as the input to any downstream machine learning tasks in protein bioinformatics. In particular, here, we introduce this representation through protein motif discovery and protein sequence embedding. (i) DiMotif: we present DiMotif as an alignment-free discriminative motif discovery method and evaluate the method for finding protein motifs in three different settings: (1) comparison of DiMotif with two existing approaches on 20 distinct motif discovery problems which are experimentally verified, (2) classification-based approach for the motifs extracted for integrins, integrin-binding proteins, and biofilm formation, and (3) in sequence pattern searching for nuclear localization signal. The DiMotif, in general, obtained high recall scores, while having a comparable F1 score with other methods in the discovery of experimentally verified motifs. Having high recall suggests that the DiMotif can be used for short-list creation for further experimental investigations on motifs. In the classification-based evaluation, the extracted motifs could reliably detect the integrins, integrin-binding, and biofilm formation-related proteins on a reserved set of sequences with high F1 scores. (ii) ProtVecX: we extend k-mer based protein vector (ProtVec) embedding to variablelength protein embedding using PPE sub-sequences. We show that the new method of embedding can marginally outperform ProtVec in enzyme prediction as well as toxin prediction tasks. In addition, we conclude that the embeddings are beneficial in protein classification tasks when they are combined with raw amino acids k-mer features. |
| ArticleNumber | 3577 |
| Author | Asgari, Ehsaneddin McHardy, Alice C. Mofrad, Mohammad R. K. |
| Author_xml | – sequence: 1 givenname: Ehsaneddin orcidid: 0000-0002-6518-7238 surname: Asgari fullname: Asgari, Ehsaneddin organization: Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Computational Biology of Infection Research, Helmholtz Centre for Infection Research – sequence: 2 givenname: Alice C. orcidid: 0000-0003-2370-3430 surname: McHardy fullname: McHardy, Alice C. organization: Computational Biology of Infection Research, Helmholtz Centre for Infection Research – sequence: 3 givenname: Mohammad R. K. orcidid: 0000-0001-7004-4859 surname: Mofrad fullname: Mofrad, Mohammad R. K. email: mofrad@berkeley.edu organization: Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Molecular Biophysics and Integrated Bioimaging, Lawrence Berkeley National Lab |
| BackLink | https://www.ncbi.nlm.nih.gov/pubmed/30837494$$D View this record in MEDLINE/PubMed https://www.osti.gov/servlets/purl/1559191$$D View this record in Osti.gov |
| BookMark | eNp9Uk1v1DAQjVArWkr_AAdkwWV7CPgriX1BQuWrUhEcAHGzHGeSdZXYi-3dqn-C34x3U5bSQ32xNX7vzbyZeVIcOO-gKJ4R_IpgJl5HTiopSkxkyUTD6_L6UXFMMa9Kyig9uPM-Kk5jvML5VFRyIh8XRwwL1nDJj4vfX4NvdWtHG5M1aKOD1e0I5QhuSEsUYZjAJZ2sd8j3aBV8Auty_NcanIGIeh9QZ6MJdrIu4zaAJp9svwv6DYQbtHhnP29DZ0i7bk9FMLXQddYNaJGLSD_A_Dx7Whz2eoxwenufFN8_vP92_qm8_PLx4vztZWlqzFMJHTQdIy2vasJwTToMxAhMwZhOswZ4azjXDCSu-4ZDI7veaCaEMZI0uu3ZSfFm1l2t2wk6kz0GPapVdqHDjfLaqv9_nF2qwW9UzTHBQmSBF7OAz31T0dgEZmm8c2CSIlUliSQZtLjNEnw2HZOaclNgHLUDv46KEiEqUVO21Xt5D3rl18HlHuxQFJOGNRn1_G7Z-3r_zjMD6AwwwccYoN9DCFbbvVHz3qi8N2q3N-o6k8Q9UrazG3m2bseHqWymxpzHDRD-lf0A6w8e5ttm |
| CitedBy_id | crossref_primary_10_1038_s42256_022_00457_9 crossref_primary_10_1007_s00438_019_01570_y crossref_primary_10_1007_s00726_022_03228_3 crossref_primary_10_1016_j_compbiomed_2024_109598 crossref_primary_10_1099_mgen_0_000637 crossref_primary_10_2174_0929867327666200907141016 crossref_primary_10_1186_s12859_019_3220_8 crossref_primary_10_1088_2632_2153_ad3ee4 crossref_primary_10_1371_journal_pone_0216636 crossref_primary_10_1016_j_csbj_2021_05_039 crossref_primary_10_1016_j_plantsci_2020_110527 crossref_primary_10_3389_fchem_2023_1107400 crossref_primary_10_1080_19420862_2023_2285904 crossref_primary_10_1016_j_bbadis_2022_166466 crossref_primary_10_1093_nargab_lqae103 crossref_primary_10_3390_a14010028 crossref_primary_10_3390_app13052858 crossref_primary_10_1186_s13321_024_00884_3 crossref_primary_10_1093_database_baaf027 crossref_primary_10_1109_TCBB_2020_2999262 crossref_primary_10_1016_j_bpj_2024_11_002 crossref_primary_10_1109_TCBB_2019_2911677 crossref_primary_10_1109_TCBB_2021_3137325 crossref_primary_10_1016_j_procs_2024_06_106 crossref_primary_10_3389_fgene_2022_854571 crossref_primary_10_3390_cancers16223768 crossref_primary_10_1109_TPAMI_2021_3095381 crossref_primary_10_1109_JBHI_2024_3400521 crossref_primary_10_3389_fcell_2022_863825 crossref_primary_10_3389_fphys_2019_01501 crossref_primary_10_1016_j_csbj_2021_03_022 crossref_primary_10_1128_mmbr_00022_25 crossref_primary_10_1371_journal_pone_0290899 crossref_primary_10_1093_bib_bbab146 crossref_primary_10_1093_nargab_lqac012 crossref_primary_10_1038_s42256_023_00637_1 crossref_primary_10_1109_TCBB_2020_2973563 crossref_primary_10_1109_TCBB_2021_3108718 crossref_primary_10_1007_s11427_024_2906_3 crossref_primary_10_3390_foods14122014 crossref_primary_10_1093_nar_gkab354 crossref_primary_10_3389_fimmu_2023_1228873 crossref_primary_10_2174_1574893618666230612161210 crossref_primary_10_7717_peerj_8965 |
| Cites_doi | 10.1074/jbc.R000003200 10.3115/v1/P14-1146 10.1093/bioinformatics/bts654 10.1038/nbt.3300 10.1371/journal.pcbi.1000071 10.7717/peerj-cs.90 10.1016/j.bpj.2017.06.064 10.1371/journal.pone.0141287 10.1016/j.jcp.2012.09.010 10.1093/nar/gkp335 10.1038/nbt.1883 10.1093/bioinformatics/bty296 10.1093/nar/gkx1021 10.1002/prot.340190207 10.1073/pnas.78.6.3824 10.1186/gb-2014-15-3-r46 10.1038/nrg861 10.1146/annurev.cellbio.12.1.697 10.1186/s12859-018-2020-x 10.1016/0001-8708(76)90202-4 10.1073/pnas.82.23.8057 10.1093/bioinformatics/btw562 10.1016/0092-8674(90)90715-Q 10.1186/1471-2105-8-385 10.1016/j.toxicon.2004.10.018 10.1038/srep39805 10.1093/nar/gkr1064 10.1021/acs.jcim.7b00616 10.1016/j.biomaterials.2005.12.012 10.1016/j.bpj.2009.08.059 10.1016/0022-2836(82)90515-0 10.1242/jcs.184184 10.1371/journal.pcbi.1002948 10.1038/nprot.2007.131 10.1093/nar/gkr402 10.1093/bioinformatics/btv295 10.1371/journal.pone.0000967 10.1186/s12920-018-0349-7 10.1214/aoms/1177729694 10.1016/j.cell.2012.12.009 10.1371/journal.pone.0106081 10.1016/B978-0-12-386043-9.00006-2 10.1038/nature01255 10.1093/protein/4.2.155 10.18653/v1/P16-1162 10.1101/255505 10.1145/3107411.3107489 10.1093/bioinformatics/bty954 10.1115/1.4038812 10.1039/C5IB00133A 10.1007/978-1-4939-3167-5_2 10.1016/j.bpj.2013.07.055 10.1101/286096 10.1093/nar/gkx810 10.18653/v1/N16-1030 10.1128/jvi.55.3.836-839.1985 10.1093/bioinformatics/btx823 10.18653/v1/W16-1208 10.1162/tacl_a_00051 10.1093/bib/bbx026 |
| ContentType | Journal Article |
| Copyright | The Author(s) 2019 This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. |
| Copyright_xml | – notice: The Author(s) 2019 – notice: This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. |
| CorporateAuthor | Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States) |
| CorporateAuthor_xml | – name: Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States) |
| DBID | C6C AAYXX CITATION NPM 3V. 7X7 7XB 88A 88E 88I 8FE 8FH 8FI 8FJ 8FK ABUWG AEUYN AFKRA AZQEC BBNVY BENPR BHPHI CCPQU DWQXO FYUFA GHDGH GNUQQ HCIFZ K9. LK8 M0S M1P M2P M7P PHGZM PHGZT PIMPY PJZUB PKEHL PPXIY PQEST PQGLB PQQKQ PQUKI PRINS Q9U 7X8 OIOZB OTOTI 5PM |
| DOI | 10.1038/s41598-019-38746-w |
| DatabaseName | Springer Nature OA Free Journals CrossRef PubMed ProQuest Central (Corporate) Health & Medical Collection ProQuest Central (purchase pre-March 2016) Biology Database (Alumni Edition) Medical Database (Alumni Edition) Science Database (Alumni Edition) ProQuest SciTech Collection ProQuest Natural Science Collection Hospital Premium Collection Hospital Premium Collection (Alumni Edition) ProQuest Central (Alumni) (purchase pre-March 2016) ProQuest Central (Alumni) ProQuest One Sustainability (subscription) ProQuest Central UK/Ireland ProQuest Central Essentials Biological Science Collection ProQuest Central (subscription) Natural Science Collection ProQuest One ProQuest Central Health Research Premium Collection Health Research Premium Collection (Alumni) ProQuest Central Student SciTech Premium Collection ProQuest Health & Medical Complete (Alumni) ProQuest Biological Science Collection Health & Medical Collection (Alumni Edition) PML(ProQuest Medical Library) Science Database (subscription) Biological Science Database ProQuest Central Premium ProQuest One Academic Publicly Available Content Database ProQuest Health & Medical Research Collection ProQuest One Academic Middle East (New) ProQuest One Health & Nursing ProQuest One Academic Eastern Edition (DO NOT USE) ProQuest One Applied & Life Sciences ProQuest One Academic (retired) ProQuest One Academic UKI Edition ProQuest Central China ProQuest Central Basic MEDLINE - Academic OSTI.GOV - Hybrid OSTI.GOV PubMed Central (Full Participant titles) |
| DatabaseTitle | CrossRef PubMed Publicly Available Content Database ProQuest Central Student ProQuest One Academic Middle East (New) ProQuest Central Essentials ProQuest Health & Medical Complete (Alumni) ProQuest Central (Alumni Edition) SciTech Premium Collection ProQuest One Community College ProQuest One Health & Nursing ProQuest Natural Science Collection ProQuest Central China ProQuest Biology Journals (Alumni Edition) ProQuest Central ProQuest One Applied & Life Sciences ProQuest One Sustainability ProQuest Health & Medical Research Collection Health Research Premium Collection Health and Medicine Complete (Alumni Edition) Natural Science Collection ProQuest Central Korea Health & Medical Research Collection Biological Science Collection ProQuest Central (New) ProQuest Medical Library (Alumni) ProQuest Science Journals (Alumni Edition) ProQuest Biological Science Collection ProQuest Central Basic ProQuest Science Journals ProQuest One Academic Eastern Edition ProQuest Hospital Collection Health Research Premium Collection (Alumni) Biological Science Database ProQuest SciTech Collection ProQuest Hospital Collection (Alumni) ProQuest Health & Medical Complete ProQuest Medical Library ProQuest One Academic UKI Edition ProQuest One Academic ProQuest One Academic (New) ProQuest Central (Alumni) MEDLINE - Academic |
| DatabaseTitleList | MEDLINE - Academic Publicly Available Content Database PubMed CrossRef |
| Database_xml | – sequence: 1 dbid: NPM name: PubMed url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: PIMPY name: Publicly Available Content Database url: http://search.proquest.com/publiccontent sourceTypes: Aggregation Database |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Biology |
| EISSN | 2045-2322 |
| ExternalDocumentID | PMC6401088 1559191 30837494 10_1038_s41598_019_38746_w |
| Genre | Journal Article |
| GroupedDBID | 0R~ 3V. 4.4 53G 5VS 7X7 88A 88E 88I 8FE 8FH 8FI 8FJ AAFWJ AAJSJ AAKDD ABDBF ABUWG ACGFS ACSMW ACUHS ADBBV ADRAZ AENEX AEUYN AFKRA AJTQC ALIPV ALMA_UNASSIGNED_HOLDINGS AOIJS AZQEC BAWUL BBNVY BCNDV BENPR BHPHI BPHCQ BVXVI C6C CCPQU DIK DWQXO EBD EBLON EBS EJD ESX FYUFA GNUQQ GROUPED_DOAJ GX1 HCIFZ HH5 HMCUK HYE KQ8 LK8 M0L M1P M2P M48 M7P M~E NAO OK1 PIMPY PQQKQ PROAC PSQYO RNT RNTTT RPM SNYQT UKHRP AASML AAYXX AFFHD AFPKN CITATION PHGZM PHGZT PJZUB PPXIY PQGLB NPM 7XB 8FK K9. PKEHL PQEST PQUKI PRINS Q9U 7X8 PUEGO AAADF OIOZB OTOTI U1R 5PM |
| ID | FETCH-LOGICAL-c604t-ede7d31b45613061d0e1c802eccda37e4bc44a3e906f74e79dfca388cc917abf3 |
| IEDL.DBID | M2P |
| ISICitedReferencesCount | 54 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000460381600150&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 2045-2322 |
| IngestDate | Tue Nov 04 01:59:03 EST 2025 Mon Jul 03 03:58:42 EDT 2023 Wed Oct 01 13:56:26 EDT 2025 Tue Oct 07 07:33:04 EDT 2025 Thu Jan 02 23:00:16 EST 2025 Sat Nov 29 04:37:23 EST 2025 Tue Nov 18 22:14:22 EST 2025 Fri Feb 21 02:40:49 EST 2025 |
| IsDoiOpenAccess | true |
| IsOpenAccess | true |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 1 |
| Language | English |
| License | Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c604t-ede7d31b45613061d0e1c802eccda37e4bc44a3e906f74e79dfca388cc917abf3 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 AC02-05CH11231 USDOE Office of Science (SC) |
| ORCID | 0000-0002-6518-7238 0000-0003-2370-3430 0000-0001-7004-4859 0000000323703430 0000000170044859 0000000265187238 |
| OpenAccessLink | https://www.proquest.com/docview/2188201737?pq-origsite=%requestingapplication% |
| PMID | 30837494 |
| PQID | 2188201737 |
| PQPubID | 2041939 |
| ParticipantIDs | pubmedcentral_primary_oai_pubmedcentral_nih_gov_6401088 osti_scitechconnect_1559191 proquest_miscellaneous_2188586238 proquest_journals_2188201737 pubmed_primary_30837494 crossref_primary_10_1038_s41598_019_38746_w crossref_citationtrail_10_1038_s41598_019_38746_w springer_journals_10_1038_s41598_019_38746_w |
| PublicationCentury | 2000 |
| PublicationDate | 2019-03-05 |
| PublicationDateYYYYMMDD | 2019-03-05 |
| PublicationDate_xml | – month: 03 year: 2019 text: 2019-03-05 day: 05 |
| PublicationDecade | 2010 |
| PublicationPlace | London |
| PublicationPlace_xml | – name: London – name: England – name: United States |
| PublicationTitle | Scientific reports |
| PublicationTitleAbbrev | Sci Rep |
| PublicationTitleAlternate | Sci Rep |
| PublicationYear | 2019 |
| Publisher | Nature Publishing Group UK Nature Publishing Group |
| Publisher_xml | – name: Nature Publishing Group UK – name: Nature Publishing Group |
| References | Shams, Mofrad (CR47) 2017; 113 Kapp (CR71) 2017; 7 Consortium (CR51) 2016; 45 Emini, Hughes, Perlow, Boger (CR63) 1985; 55 CR38 CR37 Bailey (CR25) 2009; 37 Kelil, Dubreuil, Levy, Michnick (CR29) 2014; 9 CR36 Jamali, Jamali, Mofrad (CR46) 2012; 244 CR35 Kim, Lee, Kim, Kang (CR40) 2018; 11 Redhead, Bailey (CR28) 2007; 8 Asgari, Mofrad (CR5) 2015; 10 Guruprasad, Reddy, Pandit (CR62) 1990; 4 Searls (CR3) 1993; 2 Wood, Salzberg (CR14) 2014; 15 Ochsenhirt, Kokkoli, McCarthy, Tirrell (CR72) 2006; 27 Jolma (CR10) 2013; 152 Min, Lee, Yoon (CR39) 2017; 18 CR6 Waterman, Smith, Beyer (CR2) 1976; 20 Alipanahi, Delong, Weirauch, Frey (CR11) 2015; 33 Dinkel (CR21) 2011; 40 Vihinen, Torkkila, Riikonen (CR61) 1994; 19 CR7 CR48 CR45 Davey, Haslam, Shields, Edwards (CR22) 2011; 39 Prytuliak, Pfeiffer, Habermann (CR32) 2018; 19 CR44 Emanuelsson, Brunak, Von Heijne, Nielsen (CR54) 2007; 2 CR43 CR42 Jaeger, Fulle, Turk (CR41) 2018; 58 Guan, Hynes (CR67) 1990; 60 Gacesa, Barlow, Long (CR55) 2016; 2 Gage (CR17) 1994; 12 CR19 Frith, Saunders, Kobe, Bailey (CR24) 2008; 4 CR18 CR16 Li (CR57) 2017; 1 Searls (CR4) 2002; 420 CR58 Jamali, Jamali, Mehrbod, Mofrad (CR53) 2011; 287 Tang (CR34) 2014; 1 Hopp, Woods (CR65) 1981; 78 Plow, Pierschbacher, Ruoslahti, Marguerie, Ginsberg (CR70) 1985; 82 Levenshtein (CR1) 1966; 10 Plow, Haas, Zhang, Loftus, Smith (CR69) 2000; 275 Chen, Kolahi, Mofrad (CR50) 2009; 97 Mehrbod, Mofrad (CR49) 2013; 9 Awazu (CR12) 2016; 33 Jahed, Soheilypour, Peyro, Mofrad (CR52) 2016; 129.17 Giancarlo, Rombo, Utro (CR13) 2015; 31 Yandell, Majoros (CR8) 2002; 3 Edwards, Davey, Shields (CR23) 2007; 2 Bernhofer (CR31) 2017; 46 CR27 CR26 Collobert (CR33) 2011; 12 Mehdi, Sehgal, Kobe, Bailey, Bodén (CR30) 2013; 29 CR66 Grabherr (CR9) 2011; 29 CR20 Kullback, Leibler (CR59) 1951; 22 Kyte, Doolittle (CR64) 1982; 157 Ruoslahti (CR68) 1996; 12 CR60 Jungo, Bairoch (CR56) 2005; 45 Asgari, Garakani, McHardy, Mofrad (CR15) 2018; 34 38746_CR16 M Bernhofer (38746_CR31) 2017; 46 38746_CR58 R Prytuliak (38746_CR32) 2018; 19 P Gage (38746_CR17) 1994; 12 DB Searls (38746_CR4) 2002; 420 R Giancarlo (38746_CR13) 2015; 31 DE Wood (38746_CR14) 2014; 15 Y Li (38746_CR57) 2017; 1 A Kelil (38746_CR29) 2014; 9 H Dinkel (38746_CR21) 2011; 40 EF Plow (38746_CR70) 1985; 82 M Vihinen (38746_CR61) 1994; 19 R Collobert (38746_CR33) 2011; 12 J Kyte (38746_CR64) 1982; 157 K Guruprasad (38746_CR62) 1990; 4 S Jaeger (38746_CR41) 2018; 58 38746_CR48 E Asgari (38746_CR5) 2015; 10 TP Hopp (38746_CR65) 1981; 78 EF Plow (38746_CR69) 2000; 275 MD Yandell (38746_CR8) 2002; 3 B Alipanahi (38746_CR11) 2015; 33 38746_CR44 38746_CR45 AM Mehdi (38746_CR30) 2013; 29 38746_CR42 O Emanuelsson (38746_CR54) 2007; 2 38746_CR43 TL Bailey (38746_CR25) 2009; 37 DB Searls (38746_CR3) 1993; 2 MG Grabherr (38746_CR9) 2011; 29 T Jamali (38746_CR53) 2011; 287 EA Emini (38746_CR63) 1985; 55 38746_CR37 38746_CR38 38746_CR35 38746_CR36 VI Levenshtein (38746_CR1) 1966; 10 J-L Guan (38746_CR67) 1990; 60 TG Kapp (38746_CR71) 2017; 7 U Consortium (38746_CR51) 2016; 45 MS Waterman (38746_CR2) 1976; 20 SE Ochsenhirt (38746_CR72) 2006; 27 H Shams (38746_CR47) 2017; 113 R Gacesa (38746_CR55) 2016; 2 E Redhead (38746_CR28) 2007; 8 A Jolma (38746_CR10) 2013; 152 E Asgari (38746_CR15) 2018; 34 RJ Edwards (38746_CR23) 2007; 2 MC Frith (38746_CR24) 2008; 4 Z Jahed (38746_CR52) 2016; 129.17 S Kullback (38746_CR59) 1951; 22 NE Davey (38746_CR22) 2011; 39 38746_CR26 38746_CR27 D Tang (38746_CR34) 2014; 1 HS Chen (38746_CR50) 2009; 97 F Jungo (38746_CR56) 2005; 45 S Kim (38746_CR40) 2018; 11 38746_CR66 38746_CR7 38746_CR20 38746_CR6 38746_CR60 E Ruoslahti (38746_CR68) 1996; 12 A Awazu (38746_CR12) 2016; 33 S Min (38746_CR39) 2017; 18 Y Jamali (38746_CR46) 2012; 244 38746_CR19 M Mehrbod (38746_CR49) 2013; 9 38746_CR18 |
| References_xml | – ident: CR45 – volume: 275 start-page: 21785 year: 2000 end-page: 21788 ident: CR69 article-title: Ligand binding to integrins publication-title: J. Biol. Chem. doi: 10.1074/jbc.R000003200 – volume: 1 start-page: 1555 year: 2014 end-page: 1565 ident: CR34 article-title: Learning sentiment-specific word embedding for twitter sentiment classification publication-title: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) doi: 10.3115/v1/P14-1146 – volume: 29 start-page: 39 year: 2013 end-page: 46 ident: CR30 article-title: Dlocalmotif: A discriminative approach for discovering local motifs in protein sequences publication-title: Bioinforma. doi: 10.1093/bioinformatics/bts654 – volume: 33 start-page: 831 year: 2015 end-page: 838 ident: CR11 article-title: Predicting the sequence specificities of dna-and rna-binding proteins by deep learning publication-title: Nat. Biotechnol. doi: 10.1038/nbt.3300 – ident: CR16 – volume: 4 start-page: e1000071 year: 2008 ident: CR24 article-title: Discovering sequence motifs with arbitrary insertions and deletions publication-title: PLoS Compu. Biol. doi: 10.1371/journal.pcbi.1000071 – volume: 2 start-page: e90 year: 2016 ident: CR55 article-title: Machine learning can differentiate venom toxins from other proteins having non-toxic physiological functions publication-title: PeerJ Comput. Sci. doi: 10.7717/peerj-cs.90 – volume: 113 start-page: 948 issue: 4 year: 2017 end-page: 956 ident: CR47 article-title: Interaction with α-actinin induces a structural kink in the transmembrane domain of β3-integrin and impairs signal transduction publication-title: Biophysical Journal doi: 10.1016/j.bpj.2017.06.064 – volume: 10 start-page: e0141287 year: 2015 ident: CR5 article-title: Continuous distributed representation of biological sequences for deep proteomics and genomics publication-title: PloS One doi: 10.1371/journal.pone.0141287 – volume: 244 start-page: 264 year: 2012 end-page: 278 ident: CR46 article-title: An Agent Based Model of Integrin Clustering: Exploring the Role of Ligand Clustering, Integrin Homo-Oligomerization, Integrin-Ligand Affinity, Membrane Crowdedness and Ligand Mobility publication-title: Journal of Computational Physics doi: 10.1016/j.jcp.2012.09.010 – ident: CR35 – volume: 37 start-page: W202 year: 2009 end-page: W208 ident: CR25 article-title: Meme suite: Tools for motif discovery and searching publication-title: Nucleic Acids Res. doi: 10.1093/nar/gkp335 – ident: CR58 – volume: 29 start-page: 644 year: 2011 end-page: 652 ident: CR9 article-title: Full-length transcriptome assembly from rna-seq data without a reference genome publication-title: Nat. Biotechnol. doi: 10.1038/nbt.1883 – volume: 12 start-page: 23 year: 1994 end-page: 38 ident: CR17 article-title: A new algorithm for data compression publication-title: The C Users J. – ident: CR42 – volume: 34 start-page: i32 year: 2018 end-page: i42 ident: CR15 article-title: Micropheno: predicting environments and host phenotypes from 16s rrna gene sequencing using a k-mer based representation of shallow sub-samples publication-title: Bioinforma. doi: 10.1093/bioinformatics/bty296 – volume: 46 start-page: D503 year: 2017 end-page: D508 ident: CR31 article-title: Nlsdb—major update for database of nuclear localization signals and nuclear export signals publication-title: Nucleic Acids Res. doi: 10.1093/nar/gkx1021 – ident: CR19 – volume: 19 start-page: 141 year: 1994 end-page: 149 ident: CR61 article-title: Accuracy of protein flexibility predictions publication-title: Proteins doi: 10.1002/prot.340190207 – volume: 78 start-page: 3824 year: 1981 end-page: 3828 ident: CR65 article-title: Prediction of protein antigenic determinants from amino acid sequences publication-title: Proc. Natl. Acad. Sci. USA doi: 10.1073/pnas.78.6.3824 – volume: 15 year: 2014 ident: CR14 article-title: Kraken: Ultrafast metagenomic sequence classification using exact alignments publication-title: Genome Biol. doi: 10.1186/gb-2014-15-3-r46 – volume: 3 start-page: 601 year: 2002 ident: CR8 article-title: Genomics and natural language processing publication-title: Nat. Rev. Genet. doi: 10.1038/nrg861 – volume: 12 start-page: 697 year: 1996 end-page: 715 ident: CR68 article-title: Rgd and other recognition sequences for integrins publication-title: Annu. Rev. Cell Dev. Biol. doi: 10.1146/annurev.cellbio.12.1.697 – ident: CR60 – ident: CR36 – volume: 19 year: 2018 ident: CR32 article-title: Slalom, a flexible method for the identification and statistical analysis of overlapping continuous sequence elements in sequence-and time-series data publication-title: BMC bioinformatics doi: 10.1186/s12859-018-2020-x – volume: 10 start-page: 707 year: 1966 end-page: 710 ident: CR1 article-title: Binary codes capable of correcting deletions, insertions, and reversals publication-title: In Soviet Physics Doklady – volume: 20 start-page: 367 year: 1976 end-page: 387 ident: CR2 article-title: Some biological sequence metrics publication-title: Adv. Math. (NY) doi: 10.1016/0001-8708(76)90202-4 – ident: CR26 – volume: 82 start-page: 8057 year: 1985 end-page: 8061 ident: CR70 article-title: The effect of arg-gly-asp-containing peptides on fibrinogen and von willebrand factor binding to platelets publication-title: Proc. Natl. Acad. Sci. USA doi: 10.1073/pnas.82.23.8057 – volume: 33 start-page: 42 year: 2016 end-page: 48 ident: CR12 article-title: Prediction of nucleosome positioning by the incorporation of frequencies and distributions of three different nucleotide segment lengths into a general pseudo k-tuple nucleotide composition publication-title: Bioinforma. doi: 10.1093/bioinformatics/btw562 – volume: 60 start-page: 53 year: 1990 end-page: 61 ident: CR67 article-title: Lymphoid cells recognize an alternatively spliced segment of fibronectin via the integrin receptor a4b1 publication-title: Cell doi: 10.1016/0092-8674(90)90715-Q – volume: 12 start-page: 2493 year: 2011 end-page: 2537 ident: CR33 article-title: Natural language processing (almost) from scratch publication-title: J. Mach. Learn. Res. – volume: 8 year: 2007 ident: CR28 article-title: Discriminative motif discovery in dna and protein sequences using the deme algorithm publication-title: BMC Bioinforma. doi: 10.1186/1471-2105-8-385 – ident: CR18 – ident: CR43 – ident: CR66 – volume: 18 start-page: 851 year: 2017 end-page: 869 ident: CR39 article-title: Deep learning in bioinformatics publication-title: Brief. Bioinform. – volume: 2 start-page: 47 year: 1993 end-page: 120 ident: CR3 article-title: The computational linguistics of biological sequences publication-title: Artif. intelligence molecular biology – ident: CR37 – volume: 55 start-page: 836 year: 1985 end-page: 839 ident: CR63 article-title: Induction of hepatitis a virus-neutralizing antibody by a virus-specific synthetic peptide publication-title: J. Virol. – volume: 45 start-page: 293 year: 2005 end-page: 301 ident: CR56 article-title: Tox-prot, the toxin protein annotation program of the swiss-prot protein knowledgebase publication-title: Toxicon doi: 10.1016/j.toxicon.2004.10.018 – volume: 7 year: 2017 ident: CR71 article-title: A comprehensive evaluation of the activity and selectivity profile of ligands for rgd-binding integrins publication-title: Sci. Rep. doi: 10.1038/srep39805 – volume: 40 start-page: D242 year: 2011 end-page: D251 ident: CR21 article-title: Elm—the database of eukaryotic linear motifs publication-title: Nucleic Acids Res. doi: 10.1093/nar/gkr1064 – volume: 58 start-page: 27 year: 2018 end-page: 35 ident: CR41 article-title: Mol2vec: Unsupervised machine learning approach with chemical intuition publication-title: J. Chem. Inf. Model. doi: 10.1021/acs.jcim.7b00616 – volume: 27 start-page: 3863 year: 2006 end-page: 3874 ident: CR72 article-title: Effect of rgd secondary structure and the synergy site phsrn on cell adhesion, spreading and specific integrin engagement publication-title: Biomater. doi: 10.1016/j.biomaterials.2005.12.012 – ident: CR6 – ident: CR27 – volume: 97 start-page: 3095 issue: 12 year: 2009 end-page: 104 ident: CR50 article-title: Phosphorylation Facilitates the Integrin Binding of Filamin Under Force publication-title: Biophysical Journal doi: 10.1016/j.bpj.2009.08.059 – ident: CR44 – volume: 157 start-page: 105 year: 1982 end-page: 132 ident: CR64 article-title: A simple method for displaying the hydropathic character of a protein publication-title: J. Mol. Biol. doi: 10.1016/0022-2836(82)90515-0 – ident: CR48 – volume: 129.17 start-page: 3219 year: 2016 end-page: 3229 ident: CR52 article-title: The LINC and NPC relationship: it’s complicated! Journal of Cell Science publication-title: J Cell Sci doi: 10.1242/jcs.184184 – volume: 9 start-page: e1002948 issue: 3 year: 2013 ident: CR49 article-title: Localized Lipid Packing of Transmembrane Domains Impedes Integrin Clustering publication-title: PLoS Computational Biology doi: 10.1371/journal.pcbi.1002948 – volume: 2 start-page: 953 year: 2007 end-page: 971 ident: CR54 article-title: Locating proteins in the cell using targetp, signalp and related tools publication-title: Nat. Protoc. doi: 10.1038/nprot.2007.131 – volume: 39 start-page: W56 year: 2011 end-page: W60 ident: CR22 article-title: Slimsearch 2.0: biological context for short linear motifs in proteins publication-title: Nucleic Acids Res. doi: 10.1093/nar/gkr402 – volume: 31 start-page: 2939 year: 2015 end-page: 2946 ident: CR13 article-title: Epigenomic k-mer dictionaries: shedding light on how sequence composition influences nucleosome positioning publication-title: Bioinforma. doi: 10.1093/bioinformatics/btv295 – ident: CR38 – volume: 1 start-page: 760 year: 2017 end-page: 769 ident: CR57 article-title: Deepre: Sequence-based enzyme ec number prediction by deep learning publication-title: Bioinforma. – volume: 2 start-page: e967 year: 2007 ident: CR23 article-title: Slimfinder: a probabilistic method for identifying over-represented, convergently evolved, short linear motifs in proteins publication-title: PloS one doi: 10.1371/journal.pone.0000967 – volume: 11 year: 2018 ident: CR40 article-title: Mut2vec: Distributed representation of cancerous mutations publication-title: BMC Med. Genomics doi: 10.1186/s12920-018-0349-7 – volume: 22 start-page: 79 year: 1951 end-page: 86 ident: CR59 article-title: On information and sufficiency publication-title: The annals mathematical statistics doi: 10.1214/aoms/1177729694 – volume: 152 start-page: 327 year: 2013 end-page: 339 ident: CR10 article-title: Dna-binding specificities of human transcription factors publication-title: Cell doi: 10.1016/j.cell.2012.12.009 – ident: CR7 – volume: 9 start-page: e106081 year: 2014 ident: CR29 article-title: Fast and accurate discovery of degenerate linear motifs in protein sequences publication-title: PLoS One doi: 10.1371/journal.pone.0106081 – volume: 287 start-page: 233 year: 2011 end-page: 286 ident: CR53 article-title: Nuclear Pore Complex: Biochemistry and Biophysics of Nucleocytoplasmic Transport in Health and Disease publication-title: International Review of Cell and Molecular Biology doi: 10.1016/B978-0-12-386043-9.00006-2 – volume: 420 start-page: 211 year: 2002 ident: CR4 article-title: The language of genes publication-title: Nat. doi: 10.1038/nature01255 – ident: CR20 – volume: 45 start-page: D158 year: 2016 end-page: D169 ident: CR51 article-title: Uniprot: the universal protein knowledgebase publication-title: Nucleic Acids Res. – volume: 4 start-page: 155 year: 1990 end-page: 161 ident: CR62 article-title: Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting stability of a protein from its primary sequence publication-title: Protein Eng. Des. Sel. doi: 10.1093/protein/4.2.155 – volume: 129.17 start-page: 3219 year: 2016 ident: 38746_CR52 publication-title: J Cell Sci doi: 10.1242/jcs.184184 – volume: 78 start-page: 3824 year: 1981 ident: 38746_CR65 publication-title: Proc. Natl. Acad. Sci. USA doi: 10.1073/pnas.78.6.3824 – volume: 420 start-page: 211 year: 2002 ident: 38746_CR4 publication-title: Nat. doi: 10.1038/nature01255 – ident: 38746_CR35 – ident: 38746_CR60 – ident: 38746_CR19 doi: 10.18653/v1/P16-1162 – volume: 1 start-page: 1555 year: 2014 ident: 38746_CR34 publication-title: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) doi: 10.3115/v1/P14-1146 – volume: 12 start-page: 697 year: 1996 ident: 38746_CR68 publication-title: Annu. Rev. Cell Dev. Biol. doi: 10.1146/annurev.cellbio.12.1.697 – volume: 19 year: 2018 ident: 38746_CR32 publication-title: BMC bioinformatics doi: 10.1186/s12859-018-2020-x – volume: 39 start-page: W56 year: 2011 ident: 38746_CR22 publication-title: Nucleic Acids Res. doi: 10.1093/nar/gkr402 – volume: 15 year: 2014 ident: 38746_CR14 publication-title: Genome Biol. doi: 10.1186/gb-2014-15-3-r46 – ident: 38746_CR43 doi: 10.1101/255505 – volume: 287 start-page: 233 year: 2011 ident: 38746_CR53 publication-title: International Review of Cell and Molecular Biology doi: 10.1016/B978-0-12-386043-9.00006-2 – ident: 38746_CR7 doi: 10.1145/3107411.3107489 – volume: 33 start-page: 831 year: 2015 ident: 38746_CR11 publication-title: Nat. Biotechnol. doi: 10.1038/nbt.3300 – ident: 38746_CR16 doi: 10.1093/bioinformatics/bty954 – volume: 19 start-page: 141 year: 1994 ident: 38746_CR61 publication-title: Proteins doi: 10.1002/prot.340190207 – volume: 3 start-page: 601 year: 2002 ident: 38746_CR8 publication-title: Nat. Rev. Genet. doi: 10.1038/nrg861 – volume: 31 start-page: 2939 year: 2015 ident: 38746_CR13 publication-title: Bioinforma. doi: 10.1093/bioinformatics/btv295 – volume: 37 start-page: W202 year: 2009 ident: 38746_CR25 publication-title: Nucleic Acids Res. doi: 10.1093/nar/gkp335 – ident: 38746_CR44 doi: 10.1115/1.4038812 – volume: 27 start-page: 3863 year: 2006 ident: 38746_CR72 publication-title: Biomater. doi: 10.1016/j.biomaterials.2005.12.012 – ident: 38746_CR36 – volume: 10 start-page: 707 year: 1966 ident: 38746_CR1 publication-title: In Soviet Physics Doklady – ident: 38746_CR48 doi: 10.1039/C5IB00133A – volume: 7 year: 2017 ident: 38746_CR71 publication-title: Sci. Rep. doi: 10.1038/srep39805 – volume: 152 start-page: 327 year: 2013 ident: 38746_CR10 publication-title: Cell doi: 10.1016/j.cell.2012.12.009 – volume: 275 start-page: 21785 year: 2000 ident: 38746_CR69 publication-title: J. Biol. Chem. doi: 10.1074/jbc.R000003200 – volume: 12 start-page: 23 year: 1994 ident: 38746_CR17 publication-title: The C Users J. – ident: 38746_CR58 doi: 10.1007/978-1-4939-3167-5_2 – volume: 34 start-page: i32 year: 2018 ident: 38746_CR15 publication-title: Bioinforma. doi: 10.1093/bioinformatics/bty296 – volume: 11 year: 2018 ident: 38746_CR40 publication-title: BMC Med. Genomics doi: 10.1186/s12920-018-0349-7 – ident: 38746_CR45 doi: 10.1016/j.bpj.2013.07.055 – volume: 9 start-page: e106081 year: 2014 ident: 38746_CR29 publication-title: PLoS One doi: 10.1371/journal.pone.0106081 – volume: 46 start-page: D503 year: 2017 ident: 38746_CR31 publication-title: Nucleic Acids Res. doi: 10.1093/nar/gkx1021 – volume: 8 year: 2007 ident: 38746_CR28 publication-title: BMC Bioinforma. doi: 10.1186/1471-2105-8-385 – volume: 4 start-page: e1000071 year: 2008 ident: 38746_CR24 publication-title: PLoS Compu. Biol. doi: 10.1371/journal.pcbi.1000071 – volume: 244 start-page: 264 year: 2012 ident: 38746_CR46 publication-title: Journal of Computational Physics doi: 10.1016/j.jcp.2012.09.010 – volume: 45 start-page: D158 year: 2016 ident: 38746_CR51 publication-title: Nucleic Acids Res. – volume: 97 start-page: 3095 issue: 12 year: 2009 ident: 38746_CR50 publication-title: Biophysical Journal doi: 10.1016/j.bpj.2009.08.059 – volume: 2 start-page: e90 year: 2016 ident: 38746_CR55 publication-title: PeerJ Comput. Sci. doi: 10.7717/peerj-cs.90 – volume: 10 start-page: e0141287 year: 2015 ident: 38746_CR5 publication-title: PloS One doi: 10.1371/journal.pone.0141287 – ident: 38746_CR42 doi: 10.1101/286096 – volume: 2 start-page: 953 year: 2007 ident: 38746_CR54 publication-title: Nat. Protoc. doi: 10.1038/nprot.2007.131 – volume: 4 start-page: 155 year: 1990 ident: 38746_CR62 publication-title: Protein Eng. Des. Sel. doi: 10.1093/protein/4.2.155 – ident: 38746_CR18 – volume: 22 start-page: 79 year: 1951 ident: 38746_CR59 publication-title: The annals mathematical statistics doi: 10.1214/aoms/1177729694 – volume: 12 start-page: 2493 year: 2011 ident: 38746_CR33 publication-title: J. Mach. Learn. Res. – volume: 60 start-page: 53 year: 1990 ident: 38746_CR67 publication-title: Cell doi: 10.1016/0092-8674(90)90715-Q – ident: 38746_CR26 doi: 10.1093/nar/gkx810 – volume: 157 start-page: 105 year: 1982 ident: 38746_CR64 publication-title: J. Mol. Biol. doi: 10.1016/0022-2836(82)90515-0 – volume: 58 start-page: 27 year: 2018 ident: 38746_CR41 publication-title: J. Chem. Inf. Model. doi: 10.1021/acs.jcim.7b00616 – volume: 82 start-page: 8057 year: 1985 ident: 38746_CR70 publication-title: Proc. Natl. Acad. Sci. USA doi: 10.1073/pnas.82.23.8057 – volume: 29 start-page: 644 year: 2011 ident: 38746_CR9 publication-title: Nat. Biotechnol. doi: 10.1038/nbt.1883 – volume: 40 start-page: D242 year: 2011 ident: 38746_CR21 publication-title: Nucleic Acids Res. doi: 10.1093/nar/gkr1064 – volume: 18 start-page: 851 year: 2017 ident: 38746_CR39 publication-title: Brief. Bioinform. – ident: 38746_CR6 doi: 10.18653/v1/N16-1030 – volume: 55 start-page: 836 year: 1985 ident: 38746_CR63 publication-title: J. Virol. doi: 10.1128/jvi.55.3.836-839.1985 – ident: 38746_CR38 doi: 10.1093/bioinformatics/btx823 – volume: 2 start-page: 47 year: 1993 ident: 38746_CR3 publication-title: Artif. intelligence molecular biology – volume: 45 start-page: 293 year: 2005 ident: 38746_CR56 publication-title: Toxicon doi: 10.1016/j.toxicon.2004.10.018 – volume: 33 start-page: 42 year: 2016 ident: 38746_CR12 publication-title: Bioinforma. doi: 10.1093/bioinformatics/btw562 – volume: 20 start-page: 367 year: 1976 ident: 38746_CR2 publication-title: Adv. Math. (NY) doi: 10.1016/0001-8708(76)90202-4 – ident: 38746_CR20 – ident: 38746_CR37 doi: 10.18653/v1/W16-1208 – ident: 38746_CR66 doi: 10.1162/tacl_a_00051 – volume: 2 start-page: e967 year: 2007 ident: 38746_CR23 publication-title: PloS one doi: 10.1371/journal.pone.0000967 – volume: 29 start-page: 39 year: 2013 ident: 38746_CR30 publication-title: Bioinforma. doi: 10.1093/bioinformatics/bts654 – volume: 1 start-page: 760 year: 2017 ident: 38746_CR57 publication-title: Bioinforma. – ident: 38746_CR27 doi: 10.1093/bib/bbx026 – volume: 9 start-page: e1002948 issue: 3 year: 2013 ident: 38746_CR49 publication-title: PLoS Computational Biology doi: 10.1371/journal.pcbi.1002948 – volume: 113 start-page: 948 issue: 4 year: 2017 ident: 38746_CR47 publication-title: Biophysical Journal doi: 10.1016/j.bpj.2017.06.064 |
| SSID | ssj0000529419 |
| Score | 2.5099783 |
| Snippet | In this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring... |
| SourceID | pubmedcentral osti proquest pubmed crossref springer |
| SourceType | Open Access Repository Aggregation Database Index Database Enrichment Source Publisher |
| StartPage | 3577 |
| SubjectTerms | 631/114/1305 631/114/2184 631/114/2403 631/114/2410 Algorithms Amino acid sequence Amino acids BASIC BIOLOGICAL SCIENCES Biofilms Bioinformatics Classification Compression Embedding Humanities and Social Sciences Integrins Learning algorithms Localization Machine learning multidisciplinary Proteins Science Science (multidisciplinary) Segmentation Toxins |
| Title | Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX) |
| URI | https://link.springer.com/article/10.1038/s41598-019-38746-w https://www.ncbi.nlm.nih.gov/pubmed/30837494 https://www.proquest.com/docview/2188201737 https://www.proquest.com/docview/2188586238 https://www.osti.gov/servlets/purl/1559191 https://pubmed.ncbi.nlm.nih.gov/PMC6401088 |
| Volume | 9 |
| WOSCitedRecordID | wos000460381600150&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVAON databaseName: DOAJ Directory of Open Access Journals customDbUrl: eissn: 2045-2322 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0000529419 issn: 2045-2322 databaseCode: DOA dateStart: 20110101 isFulltext: true titleUrlDefault: https://www.doaj.org/ providerName: Directory of Open Access Journals – providerCode: PRVHPJ databaseName: ROAD: Directory of Open Access Scholarly Resources customDbUrl: eissn: 2045-2322 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0000529419 issn: 2045-2322 databaseCode: M~E dateStart: 20110101 isFulltext: true titleUrlDefault: https://road.issn.org providerName: ISSN International Centre – providerCode: PRVPQU databaseName: Biological Science Database customDbUrl: eissn: 2045-2322 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0000529419 issn: 2045-2322 databaseCode: M7P dateStart: 20110101 isFulltext: true titleUrlDefault: http://search.proquest.com/biologicalscijournals providerName: ProQuest – providerCode: PRVPQU databaseName: Health & Medical Collection customDbUrl: eissn: 2045-2322 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0000529419 issn: 2045-2322 databaseCode: 7X7 dateStart: 20110101 isFulltext: true titleUrlDefault: https://search.proquest.com/healthcomplete providerName: ProQuest – providerCode: PRVPQU databaseName: ProQuest Central customDbUrl: eissn: 2045-2322 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0000529419 issn: 2045-2322 databaseCode: BENPR dateStart: 20110101 isFulltext: true titleUrlDefault: https://www.proquest.com/central providerName: ProQuest – providerCode: PRVPQU databaseName: Publicly Available Content Database customDbUrl: eissn: 2045-2322 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0000529419 issn: 2045-2322 databaseCode: PIMPY dateStart: 20110101 isFulltext: true titleUrlDefault: http://search.proquest.com/publiccontent providerName: ProQuest – providerCode: PRVPQU databaseName: Science Database (subscription) customDbUrl: eissn: 2045-2322 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0000529419 issn: 2045-2322 databaseCode: M2P dateStart: 20110101 isFulltext: true titleUrlDefault: https://search.proquest.com/sciencejournals providerName: ProQuest |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpR3JbtQw9InOgMSFfQktIyNxaAVRk7ETOyfE0goOM4oQoOEUJbZDR6JJmaSt-hN8M-85mVTD0gsXK4ptydbb_TaA5zYoTFmoyJc2zH2BIsxPYvLCq0RZITXXReGaTcj5XC0WSdo_uDV9WOWaJzpGbWpNb-T7KIpIWEkuX5388KlrFHlX-xYaWzBGzSakkK7ZNB3eWMiLJcKkz5UJuNpvUF5RTlmYUFFZEfvnG_JoVCNd_U3X_DNk8je_qRNHh7f_9yJ34FaviLLXHebchWu2ugc3utaUF_fhZ7pCSqfIWSrkzM7QpKYkK58ar7RHrLHfjvuspYrVJXPlHpYVGyKzGSrDjFJ-u7ZhxFQZxf2V7ifFjV6w3XfLGf3aY3llhq3MHhfWkEhlu3iI9ovVi70H8Pnw4NPb937fu8HXcSBa3xorDQ8L4QyUODSBDbUKpogxJufSikILkXObBHEphZWJKXXOldIa7ce8KPlDGFV1ZR8DE5GeWpGYsDShMDgZ4TeqsTnaeshwhAfhGoKZ7gubU3-N75lzsHOVdVDPEOqZg3p27sGLYc9JV9bjytXbhBgZKiVUWVdTCJJuM_Loornrwc4a0FnPAJrsEsoePBumkXTJH5NXtj7t1kRoUXLlwaMOvYbDcFSNpUjwcnID8YYFVBZ8c6ZaHrny4DGazCg7PHi5RtHLY_37jk-uvsU23JwS1VD0XbQDo3Z1ap_CdX3WLpvVBLbkQrpRTWD85mCefpy4142JI0gaJY7j9MMs_foLHYI_fw |
| linkProvider | ProQuest |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1Lb9QwEB6VBVQuvB-hBYwEUiuImsTeODkghChVq7arPRS0NzexnXalNim7aVf7J_gp_EZm8qqWR289cFvFzspOvnllPs8AvLFearI06rvS-okr0IS5cUhZ-CiOrJCa6zStmk3IwSAajeLhEvxsz8IQrbLViZWiNoWmb-QbaIrIWEkuP559d6lrFGVX2xYaNSx27XyGIdv0w84mvt-3QbD15eDzttt0FXB16InStcZKw_1UVK5z6BvP-jryAtyLSbi0ItVCJNzGXphJYWVsMp3wKNIaI5skzTj-7w24KaiyGFEFg2H3TYeyZsKPm7M5Ho82pmgf6QybH1MRWxG6swX71ytQjv_m2_5J0fwtT1uZv617_9uDuw93G0ebfaol4wEs2fwh3K5bb84fwY_hBDUZMYOpUDW7SFAQ0xPrUmOZ8phN7dFpcyorZ0XGqnIW45x1zHOGzj6jI811WzQyGox4jVl1kXixc7a2Od6nS-ssyU13K7OnqTXkMrA1XET5zerR-mP4ei0P4wn08iK3z4CJvg6siI2fGV8YHOzjb3TTE4xlUaEKB_wWMUo3hdupf8iJqggEPFI1yhSiTFUoUzMH3nX3nNVlS66cvUJAVOh0UeVgTRQrXSrKWGM478BqCyzVKLipukSVA6-7YVRNlG9Kcluc13P6GDHzyIGnNZy7xXB0_aWIcXNyAejdBCp7vjiSj4-r8ueh8Hy0jQ68b0Xicln_3uPzq3fxCpa3D_b31N7OYHcF7gQkscQ07K9Cr5yc2xdwS1-U4-nkZSXyDA6vW1R-Adfal8w |
| linkToPdf | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1Lb9NAEB6VFBAX3g_TAosEUiuwYns3XvuAECJERKVRDoDCabF31zRS65TEbZQ_wQ_i1zHjVxUevfXALfKuo137m5fn2xmAZ9ZLTZZGPVdaP3EFmjA3DikLH8WRFVJznaZlswk5GkWTSTzegJ_NWRiiVTY6sVTUZqbpG3kXTREZK8llN6tpEeP-4PXxd5c6SFGmtWmnUUFkz66WGL4tXg37-K6fB8Hg3ce37926w4CrQ08UrjVWGu6nonSjQ9941teRF-C-TMKlFakWIuE29sJMCitjk-mER5HWGOUkacbxfy_BJrrkIujA5ni4P_7SfuGhHJrw4_qkjsej7gKtJZ1o82MqaStCd7lmDTszlOq_ebp_EjZ_y9qWxnBw439-jDfheu2CszeVzNyCDZvfhitVU87VHfgxnqOOI84wlbBmpwmKaHpoXWo5Uxywhf12VJ_XytksY2Whi2nOWk46wzCA0WHnqmEamRNGjMesvEiM2RXb6U_36dIuS3LT3srsUWoNORNsBxdRfLZ6snsXPl3Iw7gHnXyW2wfARE8HVsTGz4wvDA728Dc68AlGuahqhQN-gx6l65Lu1FnkUJXUAh6pCnEKEadKxKmlAy_ae46rgibnzt4iUCp0x6imsCbylS4U5bIx0HdguwGZqlXfQp0hzIGn7TAqLcpEJbmdnVRzehhL88iB-xW028VwDAqkiHFzcg307QQqiL4-kk8PysLoofB8tJoOvGzE42xZ_97jw_N38QSuooSoD8PR3hZcC0h4iYLY24ZOMT-xj-CyPi2mi_njWv4ZfL1oWfkFytmiFQ |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Probabilistic+variable-length+segmentation+of+protein+sequences+for+discriminative+motif+discovery+%28DiMotif%29+and+sequence+embedding+%28ProtVecX%29&rft.jtitle=Scientific+reports&rft.au=Asgari%2C+Ehsaneddin&rft.au=McHardy%2C+Alice+C&rft.au=Mofrad%2C+Mohammad+R+K&rft.date=2019-03-05&rft.issn=2045-2322&rft.eissn=2045-2322&rft.volume=9&rft.issue=1&rft.spage=3577&rft_id=info:doi/10.1038%2Fs41598-019-38746-w&rft.externalDBID=NO_FULL_TEXT |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2045-2322&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2045-2322&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2045-2322&client=summon |