Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)

In this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea of PPE segmentation is inspired by the byte-pair encoding (BPE) text compression algorithm, which has recently gain...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Scientific reports Ročník 9; číslo 1; s. 3577
Hlavní autoři: Asgari, Ehsaneddin, McHardy, Alice C., Mofrad, Mohammad R. K.
Médium: Journal Article
Jazyk:angličtina
Vydáno: London Nature Publishing Group UK 05.03.2019
Nature Publishing Group
Témata:
ISSN:2045-2322, 2045-2322
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract In this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea of PPE segmentation is inspired by the byte-pair encoding (BPE) text compression algorithm, which has recently gained popularity in subword neural machine translation. We modify this algorithm by adding a sampling framework allowing for multiple ways of segmenting a sequence. PPE segmentation steps can be learned over a large set of protein sequences (Swiss-Prot) or even a domain-specific dataset and then applied to a set of unseen sequences. This representation can be widely used as the input to any downstream machine learning tasks in protein bioinformatics. In particular, here, we introduce this representation through protein motif discovery and protein sequence embedding. (i) DiMotif: we present DiMotif as an alignment-free discriminative motif discovery method and evaluate the method for finding protein motifs in three different settings: (1) comparison of DiMotif with two existing approaches on 20 distinct motif discovery problems which are experimentally verified, (2) classification-based approach for the motifs extracted for integrins, integrin-binding proteins, and biofilm formation, and (3) in sequence pattern searching for nuclear localization signal. The DiMotif, in general, obtained high recall scores, while having a comparable F1 score with other methods in the discovery of experimentally verified motifs. Having high recall suggests that the DiMotif can be used for short-list creation for further experimental investigations on motifs. In the classification-based evaluation, the extracted motifs could reliably detect the integrins, integrin-binding, and biofilm formation-related proteins on a reserved set of sequences with high F1 scores. (ii) ProtVecX: we extend k-mer based protein vector (ProtVec) embedding to variablelength protein embedding using PPE sub-sequences. We show that the new method of embedding can marginally outperform ProtVec in enzyme prediction as well as toxin prediction tasks. In addition, we conclude that the embeddings are beneficial in protein classification tasks when they are combined with raw amino acids k-mer features.
AbstractList In this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea of PPE segmentation is inspired by the byte-pair encoding (BPE) text compression algorithm, which has recently gained popularity in subword neural machine translation. We modify this algorithm by adding a sampling framework allowing for multiple ways of segmenting a sequence. PPE segmentation steps can be learned over a large set of protein sequences (Swiss-Prot) or even a domain-specific dataset and then applied to a set of unseen sequences. This representation can be widely used as the input to any downstream machine learning tasks in protein bioinformatics. In particular, here, we introduce this representation through protein motif discovery and protein sequence embedding. (i) DiMotif: we present DiMotif as an alignment-free discriminative motif discovery method and evaluate the method for finding protein motifs in three different settings: (1) comparison of DiMotif with two existing approaches on 20 distinct motif discovery problems which are experimentally verified, (2) classification-based approach for the motifs extracted for integrins, integrin-binding proteins, and biofilm formation, and (3) in sequence pattern searching for nuclear localization signal. The DiMotif, in general, obtained high recall scores, while having a comparable F1 score with other methods in the discovery of experimentally verified motifs. Having high recall suggests that the DiMotif can be used for short-list creation for further experimental investigations on motifs. In the classification-based evaluation, the extracted motifs could reliably detect the integrins, integrin-binding, and biofilm formation-related proteins on a reserved set of sequences with high F1 scores. (ii) ProtVecX: we extend k-mer based protein vector (ProtVec) embedding to variablelength protein embedding using PPE sub-sequences. We show that the new method of embedding can marginally outperform ProtVec in enzyme prediction as well as toxin prediction tasks. In addition, we conclude that the embeddings are beneficial in protein classification tasks when they are combined with raw amino acids k-mer features.
In this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea of PPE segmentation is inspired by the byte-pair encoding (BPE) text compression algorithm, which has recently gained popularity in subword neural machine translation. We modify this algorithm by adding a sampling framework allowing for multiple ways of segmenting a sequence. PPE segmentation steps can be learned over a large set of protein sequences (Swiss-Prot) or even a domain-specific dataset and then applied to a set of unseen sequences. This representation can be widely used as the input to any downstream machine learning tasks in protein bioinformatics. In particular, here, we introduce this representation through protein motif discovery and protein sequence embedding. (i) DiMotif: we present DiMotif as an alignment-free discriminative motif discovery method and evaluate the method for finding protein motifs in three different settings: (1) comparison of DiMotif with two existing approaches on 20 distinct motif discovery problems which are experimentally verified, (2) classification-based approach for the motifs extracted for integrins, integrin-binding proteins, and biofilm formation, and (3) in sequence pattern searching for nuclear localization signal. The DiMotif, in general, obtained high recall scores, while having a comparable F1 score with other methods in the discovery of experimentally verified motifs. Having high recall suggests that the DiMotif can be used for short-list creation for further experimental investigations on motifs. In the classification-based evaluation, the extracted motifs could reliably detect the integrins, integrin-binding, and biofilm formation-related proteins on a reserved set of sequences with high F1 scores. (ii) ProtVecX: we extend k-mer based protein vector (ProtVec) embedding to variablelength protein embedding using PPE sub-sequences. We show that the new method of embedding can marginally outperform ProtVec in enzyme prediction as well as toxin prediction tasks. In addition, we conclude that the embeddings are beneficial in protein classification tasks when they are combined with raw amino acids k-mer features.In this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea of PPE segmentation is inspired by the byte-pair encoding (BPE) text compression algorithm, which has recently gained popularity in subword neural machine translation. We modify this algorithm by adding a sampling framework allowing for multiple ways of segmenting a sequence. PPE segmentation steps can be learned over a large set of protein sequences (Swiss-Prot) or even a domain-specific dataset and then applied to a set of unseen sequences. This representation can be widely used as the input to any downstream machine learning tasks in protein bioinformatics. In particular, here, we introduce this representation through protein motif discovery and protein sequence embedding. (i) DiMotif: we present DiMotif as an alignment-free discriminative motif discovery method and evaluate the method for finding protein motifs in three different settings: (1) comparison of DiMotif with two existing approaches on 20 distinct motif discovery problems which are experimentally verified, (2) classification-based approach for the motifs extracted for integrins, integrin-binding proteins, and biofilm formation, and (3) in sequence pattern searching for nuclear localization signal. The DiMotif, in general, obtained high recall scores, while having a comparable F1 score with other methods in the discovery of experimentally verified motifs. Having high recall suggests that the DiMotif can be used for short-list creation for further experimental investigations on motifs. In the classification-based evaluation, the extracted motifs could reliably detect the integrins, integrin-binding, and biofilm formation-related proteins on a reserved set of sequences with high F1 scores. (ii) ProtVecX: we extend k-mer based protein vector (ProtVec) embedding to variablelength protein embedding using PPE sub-sequences. We show that the new method of embedding can marginally outperform ProtVec in enzyme prediction as well as toxin prediction tasks. In addition, we conclude that the embeddings are beneficial in protein classification tasks when they are combined with raw amino acids k-mer features.
In this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea of PPE segmentation is inspired by the byte-pair encoding (BPE) text compression algorithm, which has recently gained popularity in subword neural machine translation. We modify this algorithm by adding a sampling framework allowing for multiple ways of segmenting a sequence. PPE segmentation steps can be learned over a large set of protein sequences (Swiss-Prot) or even a domain-specific dataset and then applied to a set of unseen sequences. This representation can be widely used as the input to any downstream machine learning tasks in protein bioinformatics. In particular, here, we introduce this representation through protein motif discovery and protein sequence embedding. (i) DiMotif: we present DiMotif as an alignment-free discriminative motif discovery method and evaluate the method for finding protein motifs in three different settings: (1) comparison of DiMotif with two existing approaches on 20 distinct motif discovery problems which are experimentally verified, (2) classification-based approach for the motifs extracted for integrins, integrin-binding proteins, and biofilm formation, and (3) in sequence pattern searching for nuclear localization signal. The DiMotif, in general, obtained high recall scores, while having a comparable F1 score with other methods in the discovery of experimentally verified motifs. Having high recall suggests that the DiMotif can be used for short-list creation for further experimental investigations on motifs. In the classification-based evaluation, the extracted motifs could reliably detect the integrins, integrin-binding, and biofilm formation-related proteins on a reserved set of sequences with high F1 scores. (ii) ProtVecX: we extend k-mer based protein vector (ProtVec) embedding to variablelength protein embedding using PPE sub-sequences. We show that the new method of embedding can marginally outperform ProtVec in enzyme prediction as well as toxin prediction tasks. In addition, we conclude that the embeddings are beneficial in protein classification tasks when they are combined with raw amino acids k-mer features.
ArticleNumber 3577
Author Asgari, Ehsaneddin
McHardy, Alice C.
Mofrad, Mohammad R. K.
Author_xml – sequence: 1
  givenname: Ehsaneddin
  orcidid: 0000-0002-6518-7238
  surname: Asgari
  fullname: Asgari, Ehsaneddin
  organization: Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Computational Biology of Infection Research, Helmholtz Centre for Infection Research
– sequence: 2
  givenname: Alice C.
  orcidid: 0000-0003-2370-3430
  surname: McHardy
  fullname: McHardy, Alice C.
  organization: Computational Biology of Infection Research, Helmholtz Centre for Infection Research
– sequence: 3
  givenname: Mohammad R. K.
  orcidid: 0000-0001-7004-4859
  surname: Mofrad
  fullname: Mofrad, Mohammad R. K.
  email: mofrad@berkeley.edu
  organization: Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Molecular Biophysics and Integrated Bioimaging, Lawrence Berkeley National Lab
BackLink https://www.ncbi.nlm.nih.gov/pubmed/30837494$$D View this record in MEDLINE/PubMed
https://www.osti.gov/servlets/purl/1559191$$D View this record in Osti.gov
BookMark eNp9Uk1v1DAQjVArWkr_AAdkwWV7CPgriX1BQuWrUhEcAHGzHGeSdZXYi-3dqn-C34x3U5bSQ32xNX7vzbyZeVIcOO-gKJ4R_IpgJl5HTiopSkxkyUTD6_L6UXFMMa9Kyig9uPM-Kk5jvML5VFRyIh8XRwwL1nDJj4vfX4NvdWtHG5M1aKOD1e0I5QhuSEsUYZjAJZ2sd8j3aBV8Auty_NcanIGIeh9QZ6MJdrIu4zaAJp9svwv6DYQbtHhnP29DZ0i7bk9FMLXQddYNaJGLSD_A_Dx7Whz2eoxwenufFN8_vP92_qm8_PLx4vztZWlqzFMJHTQdIy2vasJwTToMxAhMwZhOswZ4azjXDCSu-4ZDI7veaCaEMZI0uu3ZSfFm1l2t2wk6kz0GPapVdqHDjfLaqv9_nF2qwW9UzTHBQmSBF7OAz31T0dgEZmm8c2CSIlUliSQZtLjNEnw2HZOaclNgHLUDv46KEiEqUVO21Xt5D3rl18HlHuxQFJOGNRn1_G7Z-3r_zjMD6AwwwccYoN9DCFbbvVHz3qi8N2q3N-o6k8Q9UrazG3m2bseHqWymxpzHDRD-lf0A6w8e5ttm
CitedBy_id crossref_primary_10_1038_s42256_022_00457_9
crossref_primary_10_1007_s00438_019_01570_y
crossref_primary_10_1007_s00726_022_03228_3
crossref_primary_10_1016_j_compbiomed_2024_109598
crossref_primary_10_1099_mgen_0_000637
crossref_primary_10_2174_0929867327666200907141016
crossref_primary_10_1186_s12859_019_3220_8
crossref_primary_10_1088_2632_2153_ad3ee4
crossref_primary_10_1371_journal_pone_0216636
crossref_primary_10_1016_j_csbj_2021_05_039
crossref_primary_10_1016_j_plantsci_2020_110527
crossref_primary_10_3389_fchem_2023_1107400
crossref_primary_10_1080_19420862_2023_2285904
crossref_primary_10_1016_j_bbadis_2022_166466
crossref_primary_10_1093_nargab_lqae103
crossref_primary_10_3390_a14010028
crossref_primary_10_3390_app13052858
crossref_primary_10_1186_s13321_024_00884_3
crossref_primary_10_1093_database_baaf027
crossref_primary_10_1109_TCBB_2020_2999262
crossref_primary_10_1016_j_bpj_2024_11_002
crossref_primary_10_1109_TCBB_2019_2911677
crossref_primary_10_1109_TCBB_2021_3137325
crossref_primary_10_1016_j_procs_2024_06_106
crossref_primary_10_3389_fgene_2022_854571
crossref_primary_10_3390_cancers16223768
crossref_primary_10_1109_TPAMI_2021_3095381
crossref_primary_10_1109_JBHI_2024_3400521
crossref_primary_10_3389_fcell_2022_863825
crossref_primary_10_3389_fphys_2019_01501
crossref_primary_10_1016_j_csbj_2021_03_022
crossref_primary_10_1128_mmbr_00022_25
crossref_primary_10_1371_journal_pone_0290899
crossref_primary_10_1093_bib_bbab146
crossref_primary_10_1093_nargab_lqac012
crossref_primary_10_1038_s42256_023_00637_1
crossref_primary_10_1109_TCBB_2020_2973563
crossref_primary_10_1109_TCBB_2021_3108718
crossref_primary_10_1007_s11427_024_2906_3
crossref_primary_10_3390_foods14122014
crossref_primary_10_1093_nar_gkab354
crossref_primary_10_3389_fimmu_2023_1228873
crossref_primary_10_2174_1574893618666230612161210
crossref_primary_10_7717_peerj_8965
Cites_doi 10.1074/jbc.R000003200
10.3115/v1/P14-1146
10.1093/bioinformatics/bts654
10.1038/nbt.3300
10.1371/journal.pcbi.1000071
10.7717/peerj-cs.90
10.1016/j.bpj.2017.06.064
10.1371/journal.pone.0141287
10.1016/j.jcp.2012.09.010
10.1093/nar/gkp335
10.1038/nbt.1883
10.1093/bioinformatics/bty296
10.1093/nar/gkx1021
10.1002/prot.340190207
10.1073/pnas.78.6.3824
10.1186/gb-2014-15-3-r46
10.1038/nrg861
10.1146/annurev.cellbio.12.1.697
10.1186/s12859-018-2020-x
10.1016/0001-8708(76)90202-4
10.1073/pnas.82.23.8057
10.1093/bioinformatics/btw562
10.1016/0092-8674(90)90715-Q
10.1186/1471-2105-8-385
10.1016/j.toxicon.2004.10.018
10.1038/srep39805
10.1093/nar/gkr1064
10.1021/acs.jcim.7b00616
10.1016/j.biomaterials.2005.12.012
10.1016/j.bpj.2009.08.059
10.1016/0022-2836(82)90515-0
10.1242/jcs.184184
10.1371/journal.pcbi.1002948
10.1038/nprot.2007.131
10.1093/nar/gkr402
10.1093/bioinformatics/btv295
10.1371/journal.pone.0000967
10.1186/s12920-018-0349-7
10.1214/aoms/1177729694
10.1016/j.cell.2012.12.009
10.1371/journal.pone.0106081
10.1016/B978-0-12-386043-9.00006-2
10.1038/nature01255
10.1093/protein/4.2.155
10.18653/v1/P16-1162
10.1101/255505
10.1145/3107411.3107489
10.1093/bioinformatics/bty954
10.1115/1.4038812
10.1039/C5IB00133A
10.1007/978-1-4939-3167-5_2
10.1016/j.bpj.2013.07.055
10.1101/286096
10.1093/nar/gkx810
10.18653/v1/N16-1030
10.1128/jvi.55.3.836-839.1985
10.1093/bioinformatics/btx823
10.18653/v1/W16-1208
10.1162/tacl_a_00051
10.1093/bib/bbx026
ContentType Journal Article
Copyright The Author(s) 2019
This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Copyright_xml – notice: The Author(s) 2019
– notice: This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
CorporateAuthor Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
CorporateAuthor_xml – name: Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
DBID C6C
AAYXX
CITATION
NPM
3V.
7X7
7XB
88A
88E
88I
8FE
8FH
8FI
8FJ
8FK
ABUWG
AEUYN
AFKRA
AZQEC
BBNVY
BENPR
BHPHI
CCPQU
DWQXO
FYUFA
GHDGH
GNUQQ
HCIFZ
K9.
LK8
M0S
M1P
M2P
M7P
PHGZM
PHGZT
PIMPY
PJZUB
PKEHL
PPXIY
PQEST
PQGLB
PQQKQ
PQUKI
PRINS
Q9U
7X8
OIOZB
OTOTI
5PM
DOI 10.1038/s41598-019-38746-w
DatabaseName Springer Nature OA Free Journals
CrossRef
PubMed
ProQuest Central (Corporate)
Health & Medical Collection
ProQuest Central (purchase pre-March 2016)
Biology Database (Alumni Edition)
Medical Database (Alumni Edition)
Science Database (Alumni Edition)
ProQuest SciTech Collection
ProQuest Natural Science Collection
Hospital Premium Collection
Hospital Premium Collection (Alumni Edition)
ProQuest Central (Alumni) (purchase pre-March 2016)
ProQuest Central (Alumni)
ProQuest One Sustainability (subscription)
ProQuest Central UK/Ireland
ProQuest Central Essentials
Biological Science Collection
ProQuest Central (subscription)
Natural Science Collection
ProQuest One
ProQuest Central
Health Research Premium Collection
Health Research Premium Collection (Alumni)
ProQuest Central Student
SciTech Premium Collection
ProQuest Health & Medical Complete (Alumni)
ProQuest Biological Science Collection
Health & Medical Collection (Alumni Edition)
PML(ProQuest Medical Library)
Science Database (subscription)
Biological Science Database
ProQuest Central Premium
ProQuest One Academic
Publicly Available Content Database
ProQuest Health & Medical Research Collection
ProQuest One Academic Middle East (New)
ProQuest One Health & Nursing
ProQuest One Academic Eastern Edition (DO NOT USE)
ProQuest One Applied & Life Sciences
ProQuest One Academic (retired)
ProQuest One Academic UKI Edition
ProQuest Central China
ProQuest Central Basic
MEDLINE - Academic
OSTI.GOV - Hybrid
OSTI.GOV
PubMed Central (Full Participant titles)
DatabaseTitle CrossRef
PubMed
Publicly Available Content Database
ProQuest Central Student
ProQuest One Academic Middle East (New)
ProQuest Central Essentials
ProQuest Health & Medical Complete (Alumni)
ProQuest Central (Alumni Edition)
SciTech Premium Collection
ProQuest One Community College
ProQuest One Health & Nursing
ProQuest Natural Science Collection
ProQuest Central China
ProQuest Biology Journals (Alumni Edition)
ProQuest Central
ProQuest One Applied & Life Sciences
ProQuest One Sustainability
ProQuest Health & Medical Research Collection
Health Research Premium Collection
Health and Medicine Complete (Alumni Edition)
Natural Science Collection
ProQuest Central Korea
Health & Medical Research Collection
Biological Science Collection
ProQuest Central (New)
ProQuest Medical Library (Alumni)
ProQuest Science Journals (Alumni Edition)
ProQuest Biological Science Collection
ProQuest Central Basic
ProQuest Science Journals
ProQuest One Academic Eastern Edition
ProQuest Hospital Collection
Health Research Premium Collection (Alumni)
Biological Science Database
ProQuest SciTech Collection
ProQuest Hospital Collection (Alumni)
ProQuest Health & Medical Complete
ProQuest Medical Library
ProQuest One Academic UKI Edition
ProQuest One Academic
ProQuest One Academic (New)
ProQuest Central (Alumni)
MEDLINE - Academic
DatabaseTitleList

MEDLINE - Academic
Publicly Available Content Database
PubMed
CrossRef

Database_xml – sequence: 1
  dbid: NPM
  name: PubMed
  url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 2
  dbid: PIMPY
  name: Publicly Available Content Database
  url: http://search.proquest.com/publiccontent
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Biology
EISSN 2045-2322
ExternalDocumentID PMC6401088
1559191
30837494
10_1038_s41598_019_38746_w
Genre Journal Article
GroupedDBID 0R~
3V.
4.4
53G
5VS
7X7
88A
88E
88I
8FE
8FH
8FI
8FJ
AAFWJ
AAJSJ
AAKDD
ABDBF
ABUWG
ACGFS
ACSMW
ACUHS
ADBBV
ADRAZ
AENEX
AEUYN
AFKRA
AJTQC
ALIPV
ALMA_UNASSIGNED_HOLDINGS
AOIJS
AZQEC
BAWUL
BBNVY
BCNDV
BENPR
BHPHI
BPHCQ
BVXVI
C6C
CCPQU
DIK
DWQXO
EBD
EBLON
EBS
EJD
ESX
FYUFA
GNUQQ
GROUPED_DOAJ
GX1
HCIFZ
HH5
HMCUK
HYE
KQ8
LK8
M0L
M1P
M2P
M48
M7P
M~E
NAO
OK1
PIMPY
PQQKQ
PROAC
PSQYO
RNT
RNTTT
RPM
SNYQT
UKHRP
AASML
AAYXX
AFFHD
AFPKN
CITATION
PHGZM
PHGZT
PJZUB
PPXIY
PQGLB
NPM
7XB
8FK
K9.
PKEHL
PQEST
PQUKI
PRINS
Q9U
7X8
PUEGO
AAADF
OIOZB
OTOTI
U1R
5PM
ID FETCH-LOGICAL-c604t-ede7d31b45613061d0e1c802eccda37e4bc44a3e906f74e79dfca388cc917abf3
IEDL.DBID M2P
ISICitedReferencesCount 54
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000460381600150&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 2045-2322
IngestDate Tue Nov 04 01:59:03 EST 2025
Mon Jul 03 03:58:42 EDT 2023
Wed Oct 01 13:56:26 EDT 2025
Tue Oct 07 07:33:04 EDT 2025
Thu Jan 02 23:00:16 EST 2025
Sat Nov 29 04:37:23 EST 2025
Tue Nov 18 22:14:22 EST 2025
Fri Feb 21 02:40:49 EST 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 1
Language English
License Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c604t-ede7d31b45613061d0e1c802eccda37e4bc44a3e906f74e79dfca388cc917abf3
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
AC02-05CH11231
USDOE Office of Science (SC)
ORCID 0000-0002-6518-7238
0000-0003-2370-3430
0000-0001-7004-4859
0000000323703430
0000000170044859
0000000265187238
OpenAccessLink https://www.proquest.com/docview/2188201737?pq-origsite=%requestingapplication%
PMID 30837494
PQID 2188201737
PQPubID 2041939
ParticipantIDs pubmedcentral_primary_oai_pubmedcentral_nih_gov_6401088
osti_scitechconnect_1559191
proquest_miscellaneous_2188586238
proquest_journals_2188201737
pubmed_primary_30837494
crossref_primary_10_1038_s41598_019_38746_w
crossref_citationtrail_10_1038_s41598_019_38746_w
springer_journals_10_1038_s41598_019_38746_w
PublicationCentury 2000
PublicationDate 2019-03-05
PublicationDateYYYYMMDD 2019-03-05
PublicationDate_xml – month: 03
  year: 2019
  text: 2019-03-05
  day: 05
PublicationDecade 2010
PublicationPlace London
PublicationPlace_xml – name: London
– name: England
– name: United States
PublicationTitle Scientific reports
PublicationTitleAbbrev Sci Rep
PublicationTitleAlternate Sci Rep
PublicationYear 2019
Publisher Nature Publishing Group UK
Nature Publishing Group
Publisher_xml – name: Nature Publishing Group UK
– name: Nature Publishing Group
References Shams, Mofrad (CR47) 2017; 113
Kapp (CR71) 2017; 7
Consortium (CR51) 2016; 45
Emini, Hughes, Perlow, Boger (CR63) 1985; 55
CR38
CR37
Bailey (CR25) 2009; 37
Kelil, Dubreuil, Levy, Michnick (CR29) 2014; 9
CR36
Jamali, Jamali, Mofrad (CR46) 2012; 244
CR35
Kim, Lee, Kim, Kang (CR40) 2018; 11
Redhead, Bailey (CR28) 2007; 8
Asgari, Mofrad (CR5) 2015; 10
Guruprasad, Reddy, Pandit (CR62) 1990; 4
Searls (CR3) 1993; 2
Wood, Salzberg (CR14) 2014; 15
Ochsenhirt, Kokkoli, McCarthy, Tirrell (CR72) 2006; 27
Jolma (CR10) 2013; 152
Min, Lee, Yoon (CR39) 2017; 18
CR6
Waterman, Smith, Beyer (CR2) 1976; 20
Alipanahi, Delong, Weirauch, Frey (CR11) 2015; 33
Dinkel (CR21) 2011; 40
Vihinen, Torkkila, Riikonen (CR61) 1994; 19
CR7
CR48
CR45
Davey, Haslam, Shields, Edwards (CR22) 2011; 39
Prytuliak, Pfeiffer, Habermann (CR32) 2018; 19
CR44
Emanuelsson, Brunak, Von Heijne, Nielsen (CR54) 2007; 2
CR43
CR42
Jaeger, Fulle, Turk (CR41) 2018; 58
Guan, Hynes (CR67) 1990; 60
Gacesa, Barlow, Long (CR55) 2016; 2
Gage (CR17) 1994; 12
CR19
Frith, Saunders, Kobe, Bailey (CR24) 2008; 4
CR18
CR16
Li (CR57) 2017; 1
Searls (CR4) 2002; 420
CR58
Jamali, Jamali, Mehrbod, Mofrad (CR53) 2011; 287
Tang (CR34) 2014; 1
Hopp, Woods (CR65) 1981; 78
Plow, Pierschbacher, Ruoslahti, Marguerie, Ginsberg (CR70) 1985; 82
Levenshtein (CR1) 1966; 10
Plow, Haas, Zhang, Loftus, Smith (CR69) 2000; 275
Chen, Kolahi, Mofrad (CR50) 2009; 97
Mehrbod, Mofrad (CR49) 2013; 9
Awazu (CR12) 2016; 33
Jahed, Soheilypour, Peyro, Mofrad (CR52) 2016; 129.17
Giancarlo, Rombo, Utro (CR13) 2015; 31
Yandell, Majoros (CR8) 2002; 3
Edwards, Davey, Shields (CR23) 2007; 2
Bernhofer (CR31) 2017; 46
CR27
CR26
Collobert (CR33) 2011; 12
Mehdi, Sehgal, Kobe, Bailey, Bodén (CR30) 2013; 29
CR66
Grabherr (CR9) 2011; 29
CR20
Kullback, Leibler (CR59) 1951; 22
Kyte, Doolittle (CR64) 1982; 157
Ruoslahti (CR68) 1996; 12
CR60
Jungo, Bairoch (CR56) 2005; 45
Asgari, Garakani, McHardy, Mofrad (CR15) 2018; 34
38746_CR16
M Bernhofer (38746_CR31) 2017; 46
38746_CR58
R Prytuliak (38746_CR32) 2018; 19
P Gage (38746_CR17) 1994; 12
DB Searls (38746_CR4) 2002; 420
R Giancarlo (38746_CR13) 2015; 31
DE Wood (38746_CR14) 2014; 15
Y Li (38746_CR57) 2017; 1
A Kelil (38746_CR29) 2014; 9
H Dinkel (38746_CR21) 2011; 40
EF Plow (38746_CR70) 1985; 82
M Vihinen (38746_CR61) 1994; 19
R Collobert (38746_CR33) 2011; 12
J Kyte (38746_CR64) 1982; 157
K Guruprasad (38746_CR62) 1990; 4
S Jaeger (38746_CR41) 2018; 58
38746_CR48
E Asgari (38746_CR5) 2015; 10
TP Hopp (38746_CR65) 1981; 78
EF Plow (38746_CR69) 2000; 275
MD Yandell (38746_CR8) 2002; 3
B Alipanahi (38746_CR11) 2015; 33
38746_CR44
38746_CR45
AM Mehdi (38746_CR30) 2013; 29
38746_CR42
O Emanuelsson (38746_CR54) 2007; 2
38746_CR43
TL Bailey (38746_CR25) 2009; 37
DB Searls (38746_CR3) 1993; 2
MG Grabherr (38746_CR9) 2011; 29
T Jamali (38746_CR53) 2011; 287
EA Emini (38746_CR63) 1985; 55
38746_CR37
38746_CR38
38746_CR35
38746_CR36
VI Levenshtein (38746_CR1) 1966; 10
J-L Guan (38746_CR67) 1990; 60
TG Kapp (38746_CR71) 2017; 7
U Consortium (38746_CR51) 2016; 45
MS Waterman (38746_CR2) 1976; 20
SE Ochsenhirt (38746_CR72) 2006; 27
H Shams (38746_CR47) 2017; 113
R Gacesa (38746_CR55) 2016; 2
E Redhead (38746_CR28) 2007; 8
A Jolma (38746_CR10) 2013; 152
E Asgari (38746_CR15) 2018; 34
RJ Edwards (38746_CR23) 2007; 2
MC Frith (38746_CR24) 2008; 4
Z Jahed (38746_CR52) 2016; 129.17
S Kullback (38746_CR59) 1951; 22
NE Davey (38746_CR22) 2011; 39
38746_CR26
38746_CR27
D Tang (38746_CR34) 2014; 1
HS Chen (38746_CR50) 2009; 97
F Jungo (38746_CR56) 2005; 45
S Kim (38746_CR40) 2018; 11
38746_CR66
38746_CR7
38746_CR20
38746_CR6
38746_CR60
E Ruoslahti (38746_CR68) 1996; 12
A Awazu (38746_CR12) 2016; 33
S Min (38746_CR39) 2017; 18
Y Jamali (38746_CR46) 2012; 244
38746_CR19
M Mehrbod (38746_CR49) 2013; 9
38746_CR18
References_xml – ident: CR45
– volume: 275
  start-page: 21785
  year: 2000
  end-page: 21788
  ident: CR69
  article-title: Ligand binding to integrins
  publication-title: J. Biol. Chem.
  doi: 10.1074/jbc.R000003200
– volume: 1
  start-page: 1555
  year: 2014
  end-page: 1565
  ident: CR34
  article-title: Learning sentiment-specific word embedding for twitter sentiment classification
  publication-title: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
  doi: 10.3115/v1/P14-1146
– volume: 29
  start-page: 39
  year: 2013
  end-page: 46
  ident: CR30
  article-title: Dlocalmotif: A discriminative approach for discovering local motifs in protein sequences
  publication-title: Bioinforma.
  doi: 10.1093/bioinformatics/bts654
– volume: 33
  start-page: 831
  year: 2015
  end-page: 838
  ident: CR11
  article-title: Predicting the sequence specificities of dna-and rna-binding proteins by deep learning
  publication-title: Nat. Biotechnol.
  doi: 10.1038/nbt.3300
– ident: CR16
– volume: 4
  start-page: e1000071
  year: 2008
  ident: CR24
  article-title: Discovering sequence motifs with arbitrary insertions and deletions
  publication-title: PLoS Compu. Biol.
  doi: 10.1371/journal.pcbi.1000071
– volume: 2
  start-page: e90
  year: 2016
  ident: CR55
  article-title: Machine learning can differentiate venom toxins from other proteins having non-toxic physiological functions
  publication-title: PeerJ Comput. Sci.
  doi: 10.7717/peerj-cs.90
– volume: 113
  start-page: 948
  issue: 4
  year: 2017
  end-page: 956
  ident: CR47
  article-title: Interaction with α-actinin induces a structural kink in the transmembrane domain of β3-integrin and impairs signal transduction
  publication-title: Biophysical Journal
  doi: 10.1016/j.bpj.2017.06.064
– volume: 10
  start-page: e0141287
  year: 2015
  ident: CR5
  article-title: Continuous distributed representation of biological sequences for deep proteomics and genomics
  publication-title: PloS One
  doi: 10.1371/journal.pone.0141287
– volume: 244
  start-page: 264
  year: 2012
  end-page: 278
  ident: CR46
  article-title: An Agent Based Model of Integrin Clustering: Exploring the Role of Ligand Clustering, Integrin Homo-Oligomerization, Integrin-Ligand Affinity, Membrane Crowdedness and Ligand Mobility
  publication-title: Journal of Computational Physics
  doi: 10.1016/j.jcp.2012.09.010
– ident: CR35
– volume: 37
  start-page: W202
  year: 2009
  end-page: W208
  ident: CR25
  article-title: Meme suite: Tools for motif discovery and searching
  publication-title: Nucleic Acids Res.
  doi: 10.1093/nar/gkp335
– ident: CR58
– volume: 29
  start-page: 644
  year: 2011
  end-page: 652
  ident: CR9
  article-title: Full-length transcriptome assembly from rna-seq data without a reference genome
  publication-title: Nat. Biotechnol.
  doi: 10.1038/nbt.1883
– volume: 12
  start-page: 23
  year: 1994
  end-page: 38
  ident: CR17
  article-title: A new algorithm for data compression
  publication-title: The C Users J.
– ident: CR42
– volume: 34
  start-page: i32
  year: 2018
  end-page: i42
  ident: CR15
  article-title: Micropheno: predicting environments and host phenotypes from 16s rrna gene sequencing using a k-mer based representation of shallow sub-samples
  publication-title: Bioinforma.
  doi: 10.1093/bioinformatics/bty296
– volume: 46
  start-page: D503
  year: 2017
  end-page: D508
  ident: CR31
  article-title: Nlsdb—major update for database of nuclear localization signals and nuclear export signals
  publication-title: Nucleic Acids Res.
  doi: 10.1093/nar/gkx1021
– ident: CR19
– volume: 19
  start-page: 141
  year: 1994
  end-page: 149
  ident: CR61
  article-title: Accuracy of protein flexibility predictions
  publication-title: Proteins
  doi: 10.1002/prot.340190207
– volume: 78
  start-page: 3824
  year: 1981
  end-page: 3828
  ident: CR65
  article-title: Prediction of protein antigenic determinants from amino acid sequences
  publication-title: Proc. Natl. Acad. Sci. USA
  doi: 10.1073/pnas.78.6.3824
– volume: 15
  year: 2014
  ident: CR14
  article-title: Kraken: Ultrafast metagenomic sequence classification using exact alignments
  publication-title: Genome Biol.
  doi: 10.1186/gb-2014-15-3-r46
– volume: 3
  start-page: 601
  year: 2002
  ident: CR8
  article-title: Genomics and natural language processing
  publication-title: Nat. Rev. Genet.
  doi: 10.1038/nrg861
– volume: 12
  start-page: 697
  year: 1996
  end-page: 715
  ident: CR68
  article-title: Rgd and other recognition sequences for integrins
  publication-title: Annu. Rev. Cell Dev. Biol.
  doi: 10.1146/annurev.cellbio.12.1.697
– ident: CR60
– ident: CR36
– volume: 19
  year: 2018
  ident: CR32
  article-title: Slalom, a flexible method for the identification and statistical analysis of overlapping continuous sequence elements in sequence-and time-series data
  publication-title: BMC bioinformatics
  doi: 10.1186/s12859-018-2020-x
– volume: 10
  start-page: 707
  year: 1966
  end-page: 710
  ident: CR1
  article-title: Binary codes capable of correcting deletions, insertions, and reversals
  publication-title: In Soviet Physics Doklady
– volume: 20
  start-page: 367
  year: 1976
  end-page: 387
  ident: CR2
  article-title: Some biological sequence metrics
  publication-title: Adv. Math. (NY)
  doi: 10.1016/0001-8708(76)90202-4
– ident: CR26
– volume: 82
  start-page: 8057
  year: 1985
  end-page: 8061
  ident: CR70
  article-title: The effect of arg-gly-asp-containing peptides on fibrinogen and von willebrand factor binding to platelets
  publication-title: Proc. Natl. Acad. Sci. USA
  doi: 10.1073/pnas.82.23.8057
– volume: 33
  start-page: 42
  year: 2016
  end-page: 48
  ident: CR12
  article-title: Prediction of nucleosome positioning by the incorporation of frequencies and distributions of three different nucleotide segment lengths into a general pseudo k-tuple nucleotide composition
  publication-title: Bioinforma.
  doi: 10.1093/bioinformatics/btw562
– volume: 60
  start-page: 53
  year: 1990
  end-page: 61
  ident: CR67
  article-title: Lymphoid cells recognize an alternatively spliced segment of fibronectin via the integrin receptor a4b1
  publication-title: Cell
  doi: 10.1016/0092-8674(90)90715-Q
– volume: 12
  start-page: 2493
  year: 2011
  end-page: 2537
  ident: CR33
  article-title: Natural language processing (almost) from scratch
  publication-title: J. Mach. Learn. Res.
– volume: 8
  year: 2007
  ident: CR28
  article-title: Discriminative motif discovery in dna and protein sequences using the deme algorithm
  publication-title: BMC Bioinforma.
  doi: 10.1186/1471-2105-8-385
– ident: CR18
– ident: CR43
– ident: CR66
– volume: 18
  start-page: 851
  year: 2017
  end-page: 869
  ident: CR39
  article-title: Deep learning in bioinformatics
  publication-title: Brief. Bioinform.
– volume: 2
  start-page: 47
  year: 1993
  end-page: 120
  ident: CR3
  article-title: The computational linguistics of biological sequences
  publication-title: Artif. intelligence molecular biology
– ident: CR37
– volume: 55
  start-page: 836
  year: 1985
  end-page: 839
  ident: CR63
  article-title: Induction of hepatitis a virus-neutralizing antibody by a virus-specific synthetic peptide
  publication-title: J. Virol.
– volume: 45
  start-page: 293
  year: 2005
  end-page: 301
  ident: CR56
  article-title: Tox-prot, the toxin protein annotation program of the swiss-prot protein knowledgebase
  publication-title: Toxicon
  doi: 10.1016/j.toxicon.2004.10.018
– volume: 7
  year: 2017
  ident: CR71
  article-title: A comprehensive evaluation of the activity and selectivity profile of ligands for rgd-binding integrins
  publication-title: Sci. Rep.
  doi: 10.1038/srep39805
– volume: 40
  start-page: D242
  year: 2011
  end-page: D251
  ident: CR21
  article-title: Elm—the database of eukaryotic linear motifs
  publication-title: Nucleic Acids Res.
  doi: 10.1093/nar/gkr1064
– volume: 58
  start-page: 27
  year: 2018
  end-page: 35
  ident: CR41
  article-title: Mol2vec: Unsupervised machine learning approach with chemical intuition
  publication-title: J. Chem. Inf. Model.
  doi: 10.1021/acs.jcim.7b00616
– volume: 27
  start-page: 3863
  year: 2006
  end-page: 3874
  ident: CR72
  article-title: Effect of rgd secondary structure and the synergy site phsrn on cell adhesion, spreading and specific integrin engagement
  publication-title: Biomater.
  doi: 10.1016/j.biomaterials.2005.12.012
– ident: CR6
– ident: CR27
– volume: 97
  start-page: 3095
  issue: 12
  year: 2009
  end-page: 104
  ident: CR50
  article-title: Phosphorylation Facilitates the Integrin Binding of Filamin Under Force
  publication-title: Biophysical Journal
  doi: 10.1016/j.bpj.2009.08.059
– ident: CR44
– volume: 157
  start-page: 105
  year: 1982
  end-page: 132
  ident: CR64
  article-title: A simple method for displaying the hydropathic character of a protein
  publication-title: J. Mol. Biol.
  doi: 10.1016/0022-2836(82)90515-0
– ident: CR48
– volume: 129.17
  start-page: 3219
  year: 2016
  end-page: 3229
  ident: CR52
  article-title: The LINC and NPC relationship: it’s complicated! Journal of Cell Science
  publication-title: J Cell Sci
  doi: 10.1242/jcs.184184
– volume: 9
  start-page: e1002948
  issue: 3
  year: 2013
  ident: CR49
  article-title: Localized Lipid Packing of Transmembrane Domains Impedes Integrin Clustering
  publication-title: PLoS Computational Biology
  doi: 10.1371/journal.pcbi.1002948
– volume: 2
  start-page: 953
  year: 2007
  end-page: 971
  ident: CR54
  article-title: Locating proteins in the cell using targetp, signalp and related tools
  publication-title: Nat. Protoc.
  doi: 10.1038/nprot.2007.131
– volume: 39
  start-page: W56
  year: 2011
  end-page: W60
  ident: CR22
  article-title: Slimsearch 2.0: biological context for short linear motifs in proteins
  publication-title: Nucleic Acids Res.
  doi: 10.1093/nar/gkr402
– volume: 31
  start-page: 2939
  year: 2015
  end-page: 2946
  ident: CR13
  article-title: Epigenomic k-mer dictionaries: shedding light on how sequence composition influences nucleosome positioning
  publication-title: Bioinforma.
  doi: 10.1093/bioinformatics/btv295
– ident: CR38
– volume: 1
  start-page: 760
  year: 2017
  end-page: 769
  ident: CR57
  article-title: Deepre: Sequence-based enzyme ec number prediction by deep learning
  publication-title: Bioinforma.
– volume: 2
  start-page: e967
  year: 2007
  ident: CR23
  article-title: Slimfinder: a probabilistic method for identifying over-represented, convergently evolved, short linear motifs in proteins
  publication-title: PloS one
  doi: 10.1371/journal.pone.0000967
– volume: 11
  year: 2018
  ident: CR40
  article-title: Mut2vec: Distributed representation of cancerous mutations
  publication-title: BMC Med. Genomics
  doi: 10.1186/s12920-018-0349-7
– volume: 22
  start-page: 79
  year: 1951
  end-page: 86
  ident: CR59
  article-title: On information and sufficiency
  publication-title: The annals mathematical statistics
  doi: 10.1214/aoms/1177729694
– volume: 152
  start-page: 327
  year: 2013
  end-page: 339
  ident: CR10
  article-title: Dna-binding specificities of human transcription factors
  publication-title: Cell
  doi: 10.1016/j.cell.2012.12.009
– ident: CR7
– volume: 9
  start-page: e106081
  year: 2014
  ident: CR29
  article-title: Fast and accurate discovery of degenerate linear motifs in protein sequences
  publication-title: PLoS One
  doi: 10.1371/journal.pone.0106081
– volume: 287
  start-page: 233
  year: 2011
  end-page: 286
  ident: CR53
  article-title: Nuclear Pore Complex: Biochemistry and Biophysics of Nucleocytoplasmic Transport in Health and Disease
  publication-title: International Review of Cell and Molecular Biology
  doi: 10.1016/B978-0-12-386043-9.00006-2
– volume: 420
  start-page: 211
  year: 2002
  ident: CR4
  article-title: The language of genes
  publication-title: Nat.
  doi: 10.1038/nature01255
– ident: CR20
– volume: 45
  start-page: D158
  year: 2016
  end-page: D169
  ident: CR51
  article-title: Uniprot: the universal protein knowledgebase
  publication-title: Nucleic Acids Res.
– volume: 4
  start-page: 155
  year: 1990
  end-page: 161
  ident: CR62
  article-title: Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting stability of a protein from its primary sequence
  publication-title: Protein Eng. Des. Sel.
  doi: 10.1093/protein/4.2.155
– volume: 129.17
  start-page: 3219
  year: 2016
  ident: 38746_CR52
  publication-title: J Cell Sci
  doi: 10.1242/jcs.184184
– volume: 78
  start-page: 3824
  year: 1981
  ident: 38746_CR65
  publication-title: Proc. Natl. Acad. Sci. USA
  doi: 10.1073/pnas.78.6.3824
– volume: 420
  start-page: 211
  year: 2002
  ident: 38746_CR4
  publication-title: Nat.
  doi: 10.1038/nature01255
– ident: 38746_CR35
– ident: 38746_CR60
– ident: 38746_CR19
  doi: 10.18653/v1/P16-1162
– volume: 1
  start-page: 1555
  year: 2014
  ident: 38746_CR34
  publication-title: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
  doi: 10.3115/v1/P14-1146
– volume: 12
  start-page: 697
  year: 1996
  ident: 38746_CR68
  publication-title: Annu. Rev. Cell Dev. Biol.
  doi: 10.1146/annurev.cellbio.12.1.697
– volume: 19
  year: 2018
  ident: 38746_CR32
  publication-title: BMC bioinformatics
  doi: 10.1186/s12859-018-2020-x
– volume: 39
  start-page: W56
  year: 2011
  ident: 38746_CR22
  publication-title: Nucleic Acids Res.
  doi: 10.1093/nar/gkr402
– volume: 15
  year: 2014
  ident: 38746_CR14
  publication-title: Genome Biol.
  doi: 10.1186/gb-2014-15-3-r46
– ident: 38746_CR43
  doi: 10.1101/255505
– volume: 287
  start-page: 233
  year: 2011
  ident: 38746_CR53
  publication-title: International Review of Cell and Molecular Biology
  doi: 10.1016/B978-0-12-386043-9.00006-2
– ident: 38746_CR7
  doi: 10.1145/3107411.3107489
– volume: 33
  start-page: 831
  year: 2015
  ident: 38746_CR11
  publication-title: Nat. Biotechnol.
  doi: 10.1038/nbt.3300
– ident: 38746_CR16
  doi: 10.1093/bioinformatics/bty954
– volume: 19
  start-page: 141
  year: 1994
  ident: 38746_CR61
  publication-title: Proteins
  doi: 10.1002/prot.340190207
– volume: 3
  start-page: 601
  year: 2002
  ident: 38746_CR8
  publication-title: Nat. Rev. Genet.
  doi: 10.1038/nrg861
– volume: 31
  start-page: 2939
  year: 2015
  ident: 38746_CR13
  publication-title: Bioinforma.
  doi: 10.1093/bioinformatics/btv295
– volume: 37
  start-page: W202
  year: 2009
  ident: 38746_CR25
  publication-title: Nucleic Acids Res.
  doi: 10.1093/nar/gkp335
– ident: 38746_CR44
  doi: 10.1115/1.4038812
– volume: 27
  start-page: 3863
  year: 2006
  ident: 38746_CR72
  publication-title: Biomater.
  doi: 10.1016/j.biomaterials.2005.12.012
– ident: 38746_CR36
– volume: 10
  start-page: 707
  year: 1966
  ident: 38746_CR1
  publication-title: In Soviet Physics Doklady
– ident: 38746_CR48
  doi: 10.1039/C5IB00133A
– volume: 7
  year: 2017
  ident: 38746_CR71
  publication-title: Sci. Rep.
  doi: 10.1038/srep39805
– volume: 152
  start-page: 327
  year: 2013
  ident: 38746_CR10
  publication-title: Cell
  doi: 10.1016/j.cell.2012.12.009
– volume: 275
  start-page: 21785
  year: 2000
  ident: 38746_CR69
  publication-title: J. Biol. Chem.
  doi: 10.1074/jbc.R000003200
– volume: 12
  start-page: 23
  year: 1994
  ident: 38746_CR17
  publication-title: The C Users J.
– ident: 38746_CR58
  doi: 10.1007/978-1-4939-3167-5_2
– volume: 34
  start-page: i32
  year: 2018
  ident: 38746_CR15
  publication-title: Bioinforma.
  doi: 10.1093/bioinformatics/bty296
– volume: 11
  year: 2018
  ident: 38746_CR40
  publication-title: BMC Med. Genomics
  doi: 10.1186/s12920-018-0349-7
– ident: 38746_CR45
  doi: 10.1016/j.bpj.2013.07.055
– volume: 9
  start-page: e106081
  year: 2014
  ident: 38746_CR29
  publication-title: PLoS One
  doi: 10.1371/journal.pone.0106081
– volume: 46
  start-page: D503
  year: 2017
  ident: 38746_CR31
  publication-title: Nucleic Acids Res.
  doi: 10.1093/nar/gkx1021
– volume: 8
  year: 2007
  ident: 38746_CR28
  publication-title: BMC Bioinforma.
  doi: 10.1186/1471-2105-8-385
– volume: 4
  start-page: e1000071
  year: 2008
  ident: 38746_CR24
  publication-title: PLoS Compu. Biol.
  doi: 10.1371/journal.pcbi.1000071
– volume: 244
  start-page: 264
  year: 2012
  ident: 38746_CR46
  publication-title: Journal of Computational Physics
  doi: 10.1016/j.jcp.2012.09.010
– volume: 45
  start-page: D158
  year: 2016
  ident: 38746_CR51
  publication-title: Nucleic Acids Res.
– volume: 97
  start-page: 3095
  issue: 12
  year: 2009
  ident: 38746_CR50
  publication-title: Biophysical Journal
  doi: 10.1016/j.bpj.2009.08.059
– volume: 2
  start-page: e90
  year: 2016
  ident: 38746_CR55
  publication-title: PeerJ Comput. Sci.
  doi: 10.7717/peerj-cs.90
– volume: 10
  start-page: e0141287
  year: 2015
  ident: 38746_CR5
  publication-title: PloS One
  doi: 10.1371/journal.pone.0141287
– ident: 38746_CR42
  doi: 10.1101/286096
– volume: 2
  start-page: 953
  year: 2007
  ident: 38746_CR54
  publication-title: Nat. Protoc.
  doi: 10.1038/nprot.2007.131
– volume: 4
  start-page: 155
  year: 1990
  ident: 38746_CR62
  publication-title: Protein Eng. Des. Sel.
  doi: 10.1093/protein/4.2.155
– ident: 38746_CR18
– volume: 22
  start-page: 79
  year: 1951
  ident: 38746_CR59
  publication-title: The annals mathematical statistics
  doi: 10.1214/aoms/1177729694
– volume: 12
  start-page: 2493
  year: 2011
  ident: 38746_CR33
  publication-title: J. Mach. Learn. Res.
– volume: 60
  start-page: 53
  year: 1990
  ident: 38746_CR67
  publication-title: Cell
  doi: 10.1016/0092-8674(90)90715-Q
– ident: 38746_CR26
  doi: 10.1093/nar/gkx810
– volume: 157
  start-page: 105
  year: 1982
  ident: 38746_CR64
  publication-title: J. Mol. Biol.
  doi: 10.1016/0022-2836(82)90515-0
– volume: 58
  start-page: 27
  year: 2018
  ident: 38746_CR41
  publication-title: J. Chem. Inf. Model.
  doi: 10.1021/acs.jcim.7b00616
– volume: 82
  start-page: 8057
  year: 1985
  ident: 38746_CR70
  publication-title: Proc. Natl. Acad. Sci. USA
  doi: 10.1073/pnas.82.23.8057
– volume: 29
  start-page: 644
  year: 2011
  ident: 38746_CR9
  publication-title: Nat. Biotechnol.
  doi: 10.1038/nbt.1883
– volume: 40
  start-page: D242
  year: 2011
  ident: 38746_CR21
  publication-title: Nucleic Acids Res.
  doi: 10.1093/nar/gkr1064
– volume: 18
  start-page: 851
  year: 2017
  ident: 38746_CR39
  publication-title: Brief. Bioinform.
– ident: 38746_CR6
  doi: 10.18653/v1/N16-1030
– volume: 55
  start-page: 836
  year: 1985
  ident: 38746_CR63
  publication-title: J. Virol.
  doi: 10.1128/jvi.55.3.836-839.1985
– ident: 38746_CR38
  doi: 10.1093/bioinformatics/btx823
– volume: 2
  start-page: 47
  year: 1993
  ident: 38746_CR3
  publication-title: Artif. intelligence molecular biology
– volume: 45
  start-page: 293
  year: 2005
  ident: 38746_CR56
  publication-title: Toxicon
  doi: 10.1016/j.toxicon.2004.10.018
– volume: 33
  start-page: 42
  year: 2016
  ident: 38746_CR12
  publication-title: Bioinforma.
  doi: 10.1093/bioinformatics/btw562
– volume: 20
  start-page: 367
  year: 1976
  ident: 38746_CR2
  publication-title: Adv. Math. (NY)
  doi: 10.1016/0001-8708(76)90202-4
– ident: 38746_CR20
– ident: 38746_CR37
  doi: 10.18653/v1/W16-1208
– ident: 38746_CR66
  doi: 10.1162/tacl_a_00051
– volume: 2
  start-page: e967
  year: 2007
  ident: 38746_CR23
  publication-title: PloS one
  doi: 10.1371/journal.pone.0000967
– volume: 29
  start-page: 39
  year: 2013
  ident: 38746_CR30
  publication-title: Bioinforma.
  doi: 10.1093/bioinformatics/bts654
– volume: 1
  start-page: 760
  year: 2017
  ident: 38746_CR57
  publication-title: Bioinforma.
– ident: 38746_CR27
  doi: 10.1093/bib/bbx026
– volume: 9
  start-page: e1002948
  issue: 3
  year: 2013
  ident: 38746_CR49
  publication-title: PLoS Computational Biology
  doi: 10.1371/journal.pcbi.1002948
– volume: 113
  start-page: 948
  issue: 4
  year: 2017
  ident: 38746_CR47
  publication-title: Biophysical Journal
  doi: 10.1016/j.bpj.2017.06.064
SSID ssj0000529419
Score 2.5099783
Snippet In this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring...
SourceID pubmedcentral
osti
proquest
pubmed
crossref
springer
SourceType Open Access Repository
Aggregation Database
Index Database
Enrichment Source
Publisher
StartPage 3577
SubjectTerms 631/114/1305
631/114/2184
631/114/2403
631/114/2410
Algorithms
Amino acid sequence
Amino acids
BASIC BIOLOGICAL SCIENCES
Biofilms
Bioinformatics
Classification
Compression
Embedding
Humanities and Social Sciences
Integrins
Learning algorithms
Localization
Machine learning
multidisciplinary
Proteins
Science
Science (multidisciplinary)
Segmentation
Toxins
Title Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)
URI https://link.springer.com/article/10.1038/s41598-019-38746-w
https://www.ncbi.nlm.nih.gov/pubmed/30837494
https://www.proquest.com/docview/2188201737
https://www.proquest.com/docview/2188586238
https://www.osti.gov/servlets/purl/1559191
https://pubmed.ncbi.nlm.nih.gov/PMC6401088
Volume 9
WOSCitedRecordID wos000460381600150&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVAON
  databaseName: DOAJ Directory of Open Access Journals
  customDbUrl:
  eissn: 2045-2322
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0000529419
  issn: 2045-2322
  databaseCode: DOA
  dateStart: 20110101
  isFulltext: true
  titleUrlDefault: https://www.doaj.org/
  providerName: Directory of Open Access Journals
– providerCode: PRVHPJ
  databaseName: ROAD: Directory of Open Access Scholarly Resources
  customDbUrl:
  eissn: 2045-2322
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0000529419
  issn: 2045-2322
  databaseCode: M~E
  dateStart: 20110101
  isFulltext: true
  titleUrlDefault: https://road.issn.org
  providerName: ISSN International Centre
– providerCode: PRVPQU
  databaseName: Biological Science Database
  customDbUrl:
  eissn: 2045-2322
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0000529419
  issn: 2045-2322
  databaseCode: M7P
  dateStart: 20110101
  isFulltext: true
  titleUrlDefault: http://search.proquest.com/biologicalscijournals
  providerName: ProQuest
– providerCode: PRVPQU
  databaseName: Health & Medical Collection
  customDbUrl:
  eissn: 2045-2322
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0000529419
  issn: 2045-2322
  databaseCode: 7X7
  dateStart: 20110101
  isFulltext: true
  titleUrlDefault: https://search.proquest.com/healthcomplete
  providerName: ProQuest
– providerCode: PRVPQU
  databaseName: ProQuest Central
  customDbUrl:
  eissn: 2045-2322
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0000529419
  issn: 2045-2322
  databaseCode: BENPR
  dateStart: 20110101
  isFulltext: true
  titleUrlDefault: https://www.proquest.com/central
  providerName: ProQuest
– providerCode: PRVPQU
  databaseName: Publicly Available Content Database
  customDbUrl:
  eissn: 2045-2322
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0000529419
  issn: 2045-2322
  databaseCode: PIMPY
  dateStart: 20110101
  isFulltext: true
  titleUrlDefault: http://search.proquest.com/publiccontent
  providerName: ProQuest
– providerCode: PRVPQU
  databaseName: Science Database (subscription)
  customDbUrl:
  eissn: 2045-2322
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0000529419
  issn: 2045-2322
  databaseCode: M2P
  dateStart: 20110101
  isFulltext: true
  titleUrlDefault: https://search.proquest.com/sciencejournals
  providerName: ProQuest
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpR3JbtQw9InOgMSFfQktIyNxaAVRk7ETOyfE0goOM4oQoOEUJbZDR6JJmaSt-hN8M-85mVTD0gsXK4ptydbb_TaA5zYoTFmoyJc2zH2BIsxPYvLCq0RZITXXReGaTcj5XC0WSdo_uDV9WOWaJzpGbWpNb-T7KIpIWEkuX5388KlrFHlX-xYaWzBGzSakkK7ZNB3eWMiLJcKkz5UJuNpvUF5RTlmYUFFZEfvnG_JoVCNd_U3X_DNk8je_qRNHh7f_9yJ34FaviLLXHebchWu2ugc3utaUF_fhZ7pCSqfIWSrkzM7QpKYkK58ar7RHrLHfjvuspYrVJXPlHpYVGyKzGSrDjFJ-u7ZhxFQZxf2V7ifFjV6w3XfLGf3aY3llhq3MHhfWkEhlu3iI9ovVi70H8Pnw4NPb937fu8HXcSBa3xorDQ8L4QyUODSBDbUKpogxJufSikILkXObBHEphZWJKXXOldIa7ce8KPlDGFV1ZR8DE5GeWpGYsDShMDgZ4TeqsTnaeshwhAfhGoKZ7gubU3-N75lzsHOVdVDPEOqZg3p27sGLYc9JV9bjytXbhBgZKiVUWVdTCJJuM_Loornrwc4a0FnPAJrsEsoePBumkXTJH5NXtj7t1kRoUXLlwaMOvYbDcFSNpUjwcnID8YYFVBZ8c6ZaHrny4DGazCg7PHi5RtHLY_37jk-uvsU23JwS1VD0XbQDo3Z1ap_CdX3WLpvVBLbkQrpRTWD85mCefpy4142JI0gaJY7j9MMs_foLHYI_fw
linkProvider ProQuest
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1Lb9QwEB6VBVQuvB-hBYwEUiuImsTeODkghChVq7arPRS0NzexnXalNim7aVf7J_gp_EZm8qqWR289cFvFzspOvnllPs8AvLFearI06rvS-okr0IS5cUhZ-CiOrJCa6zStmk3IwSAajeLhEvxsz8IQrbLViZWiNoWmb-QbaIrIWEkuP559d6lrFGVX2xYaNSx27XyGIdv0w84mvt-3QbD15eDzttt0FXB16InStcZKw_1UVK5z6BvP-jryAtyLSbi0ItVCJNzGXphJYWVsMp3wKNIaI5skzTj-7w24KaiyGFEFg2H3TYeyZsKPm7M5Ho82pmgf6QybH1MRWxG6swX71ytQjv_m2_5J0fwtT1uZv617_9uDuw93G0ebfaol4wEs2fwh3K5bb84fwY_hBDUZMYOpUDW7SFAQ0xPrUmOZ8phN7dFpcyorZ0XGqnIW45x1zHOGzj6jI811WzQyGox4jVl1kXixc7a2Od6nS-ssyU13K7OnqTXkMrA1XET5zerR-mP4ei0P4wn08iK3z4CJvg6siI2fGV8YHOzjb3TTE4xlUaEKB_wWMUo3hdupf8iJqggEPFI1yhSiTFUoUzMH3nX3nNVlS66cvUJAVOh0UeVgTRQrXSrKWGM478BqCyzVKLipukSVA6-7YVRNlG9Kcluc13P6GDHzyIGnNZy7xXB0_aWIcXNyAejdBCp7vjiSj4-r8ueh8Hy0jQ68b0Xicln_3uPzq3fxCpa3D_b31N7OYHcF7gQkscQ07K9Cr5yc2xdwS1-U4-nkZSXyDA6vW1R-Adfal8w
linkToPdf http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1Lb9NAEB6VFBAX3g_TAosEUiuwYns3XvuAECJERKVRDoDCabF31zRS65TEbZQ_wQ_i1zHjVxUevfXALfKuo137m5fn2xmAZ9ZLTZZGPVdaP3EFmjA3DikLH8WRFVJznaZlswk5GkWTSTzegJ_NWRiiVTY6sVTUZqbpG3kXTREZK8llN6tpEeP-4PXxd5c6SFGmtWmnUUFkz66WGL4tXg37-K6fB8Hg3ce37926w4CrQ08UrjVWGu6nonSjQ9941teRF-C-TMKlFakWIuE29sJMCitjk-mER5HWGOUkacbxfy_BJrrkIujA5ni4P_7SfuGhHJrw4_qkjsej7gKtJZ1o82MqaStCd7lmDTszlOq_ebp_EjZ_y9qWxnBw439-jDfheu2CszeVzNyCDZvfhitVU87VHfgxnqOOI84wlbBmpwmKaHpoXWo5Uxywhf12VJ_XytksY2Whi2nOWk46wzCA0WHnqmEamRNGjMesvEiM2RXb6U_36dIuS3LT3srsUWoNORNsBxdRfLZ6snsXPl3Iw7gHnXyW2wfARE8HVsTGz4wvDA728Dc68AlGuahqhQN-gx6l65Lu1FnkUJXUAh6pCnEKEadKxKmlAy_ae46rgibnzt4iUCp0x6imsCbylS4U5bIx0HdguwGZqlXfQp0hzIGn7TAqLcpEJbmdnVRzehhL88iB-xW028VwDAqkiHFzcg307QQqiL4-kk8PysLoofB8tJoOvGzE42xZ_97jw_N38QSuooSoD8PR3hZcC0h4iYLY24ZOMT-xj-CyPi2mi_njWv4ZfL1oWfkFytmiFQ
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Probabilistic+variable-length+segmentation+of+protein+sequences+for+discriminative+motif+discovery+%28DiMotif%29+and+sequence+embedding+%28ProtVecX%29&rft.jtitle=Scientific+reports&rft.au=Asgari%2C+Ehsaneddin&rft.au=McHardy%2C+Alice+C&rft.au=Mofrad%2C+Mohammad+R+K&rft.date=2019-03-05&rft.issn=2045-2322&rft.eissn=2045-2322&rft.volume=9&rft.issue=1&rft.spage=3577&rft_id=info:doi/10.1038%2Fs41598-019-38746-w&rft.externalDBID=NO_FULL_TEXT
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2045-2322&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2045-2322&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2045-2322&client=summon