The language of proteins: NLP, machine learning & protein sequences

Natural language processing (NLP) is a field of computer science concerned with automated text and language analysis. In recent years, following a series of breakthroughs in deep and machine learning, NLP methods have shown overwhelming progress. Here, we review the success, promise and pitfalls of...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Computational and structural biotechnology journal Jg. 19; S. 1750 - 1758
Hauptverfasser: Ofer, Dan, Brandes, Nadav, Linial, Michal
Format: Journal Article
Sprache:Englisch
Veröffentlicht: Netherlands Elsevier B.V 01.01.2021
Research Network of Computational and Structural Biotechnology
Elsevier
Schlagworte:
ISSN:2001-0370, 2001-0370
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Abstract Natural language processing (NLP) is a field of computer science concerned with automated text and language analysis. In recent years, following a series of breakthroughs in deep and machine learning, NLP methods have shown overwhelming progress. Here, we review the success, promise and pitfalls of applying NLP algorithms to the study of proteins. Proteins, which can be represented as strings of amino-acid letters, are a natural fit to many NLP methods. We explore the conceptual similarities and differences between proteins and language, and review a range of protein-related tasks amenable to machine learning. We present methods for encoding the information of proteins as text and analyzing it with NLP methods, reviewing classic concepts such as bag-of-words, k-mers/n-grams and text search, as well as modern techniques such as word embedding, contextualized embedding, deep learning and neural language models. In particular, we focus on recent innovations such as masked language modeling, self-supervised learning and attention-based models. Finally, we discuss trends and challenges in the intersection of NLP and protein research.
AbstractList Natural language processing (NLP) is a field of computer science concerned with automated text and language analysis. In recent years, following a series of breakthroughs in deep and machine learning, NLP methods have shown overwhelming progress. Here, we review the success, promise and pitfalls of applying NLP algorithms to the study of proteins. Proteins, which can be represented as strings of amino-acid letters, are a natural fit to many NLP methods. We explore the conceptual similarities and differences between proteins and language, and review a range of protein-related tasks amenable to machine learning. We present methods for encoding the information of proteins as text and analyzing it with NLP methods, reviewing classic concepts such as bag-of-words, k-mers/n-grams and text search, as well as modern techniques such as word embedding, contextualized embedding, deep learning and neural language models. In particular, we focus on recent innovations such as masked language modeling, self-supervised learning and attention-based models. Finally, we discuss trends and challenges in the intersection of NLP and protein research.Natural language processing (NLP) is a field of computer science concerned with automated text and language analysis. In recent years, following a series of breakthroughs in deep and machine learning, NLP methods have shown overwhelming progress. Here, we review the success, promise and pitfalls of applying NLP algorithms to the study of proteins. Proteins, which can be represented as strings of amino-acid letters, are a natural fit to many NLP methods. We explore the conceptual similarities and differences between proteins and language, and review a range of protein-related tasks amenable to machine learning. We present methods for encoding the information of proteins as text and analyzing it with NLP methods, reviewing classic concepts such as bag-of-words, k-mers/n-grams and text search, as well as modern techniques such as word embedding, contextualized embedding, deep learning and neural language models. In particular, we focus on recent innovations such as masked language modeling, self-supervised learning and attention-based models. Finally, we discuss trends and challenges in the intersection of NLP and protein research.
Natural language processing (NLP) is a field of computer science concerned with automated text and language analysis. In recent years, following a series of breakthroughs in deep and machine learning, NLP methods have shown overwhelming progress. Here, we review the success, promise and pitfalls of applying NLP algorithms to the study of proteins. Proteins, which can be represented as strings of amino-acid letters, are a natural fit to many NLP methods. We explore the conceptual similarities and differences between proteins and language, and review a range of protein-related tasks amenable to machine learning. We present methods for encoding the information of proteins as text and analyzing it with NLP methods, reviewing classic concepts such as bag-of-words, k-mers/n-grams and text search, as well as modern techniques such as word embedding, contextualized embedding, deep learning and neural language models. In particular, we focus on recent innovations such as masked language modeling, self-supervised learning and attention-based models. Finally, we discuss trends and challenges in the intersection of NLP and protein research.
Author Brandes, Nadav
Linial, Michal
Ofer, Dan
Author_xml – sequence: 1
  givenname: Dan
  surname: Ofer
  fullname: Ofer, Dan
  organization: Medtronic, Inc, Israel
– sequence: 2
  givenname: Nadav
  surname: Brandes
  fullname: Brandes, Nadav
  email: nadav.brandes@mail.huji.ac.il
  organization: The Rachel and Selim Benin School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel
– sequence: 3
  givenname: Michal
  surname: Linial
  fullname: Linial, Michal
  organization: Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem, Israel
BackLink https://www.ncbi.nlm.nih.gov/pubmed/33897979$$D View this record in MEDLINE/PubMed
BookMark eNqNUk1v1DAUtFAR_aB_gAPKCfXQDc92vowQElpBW2kFHMrZsp2XrKOsvdjZSv33OGwXtRwq7IMte2be6L05JUfOOyTkDYWcAq3eD7mJesgZMJoDz4GxF-SEAdAF8BqOHt2PyXmMA6TV0EpweEWOOW9EnfYJWd6uMRuV63eqx8x32Tb4Ca2LH7Jvqx-X2UaZtXUJgio46_rs3QGRRfy1Q2cwviYvOzVGPH84z8jPr19ul9eL1ferm-Xn1cKUZTEtWMdFU7eFErrQbVMpFB20oqJCG8CS6k6Zqq1KoGhYpbFFilyb5LirOEPkZ-Rmr9t6NchtsBsV7qVXVv558KGXKkzWjCgFKxlVQgAvUsEuVdKpVss6IXStQCWtT3ut7U5vsDXopqDGJ6JPf5xdy97fyQZKKBhNAhcPAsGnPsRJbmw0OKZeot9Fycq64LSuofgPKG1qXtW0SdC3j2399XMYWAI0e4AJPsaAnTR2UpP1s0s7Sgpyjocc5BwPOcdDApcpHonK_qEe1J8lfdyTMA32zmKQ0dh56q0NaKbUefsc_TdrhdMs
CitedBy_id crossref_primary_10_1016_j_compbiomed_2024_108385
crossref_primary_10_2196_37213
crossref_primary_10_3390_electronics14030496
crossref_primary_10_1039_D5SC04513D
crossref_primary_10_1038_s42003_024_07262_7
crossref_primary_10_1016_j_heliyon_2023_e23781
crossref_primary_10_1093_nargab_lqac043
crossref_primary_10_1016_j_csbj_2025_03_037
crossref_primary_10_3389_fimmu_2024_1463931
crossref_primary_10_1111_imr_13309
crossref_primary_10_1039_D3BM00412K
crossref_primary_10_1371_journal_pone_0325531
crossref_primary_10_1016_j_sbi_2023_102641
crossref_primary_10_3390_sym16040464
crossref_primary_10_1093_nsr_nwaf056
crossref_primary_10_1007_s10489_022_04052_8
crossref_primary_10_1016_j_cels_2023_12_003
crossref_primary_10_1128_msystems_01004_23
crossref_primary_10_1093_bib_bbae077
crossref_primary_10_1371_journal_pone_0296737
crossref_primary_10_1007_s10489_024_06223_1
crossref_primary_10_1016_j_comnet_2025_111181
crossref_primary_10_1016_j_measurement_2022_111588
crossref_primary_10_1186_s12859_022_04604_2
crossref_primary_10_1016_j_artmed_2024_102900
crossref_primary_10_1002_pro_4524
crossref_primary_10_1038_s41598_025_03275_2
crossref_primary_10_3389_fbioe_2025_1506508
crossref_primary_10_1038_s42256_023_00637_1
crossref_primary_10_3389_fmed_2025_1594442
crossref_primary_10_1016_j_jmb_2025_169236
crossref_primary_10_1109_ACCESS_2024_3368382
crossref_primary_10_3390_app14031265
crossref_primary_10_3390_antibiotics11101451
crossref_primary_10_1155_2022_9015123
crossref_primary_10_1016_j_csbj_2025_04_005
crossref_primary_10_3389_fcomp_2025_1464122
crossref_primary_10_3390_math10030467
crossref_primary_10_1016_j_csbj_2022_11_014
crossref_primary_10_1016_j_tibs_2022_11_001
crossref_primary_10_3389_fchem_2025_1545136
crossref_primary_10_1016_j_ijfoodmicro_2024_110691
crossref_primary_10_1016_j_csbj_2024_01_009
crossref_primary_10_1038_s41467_025_58038_4
crossref_primary_10_1021_acs_molpharmaceut_5c00523
crossref_primary_10_1093_nar_gkad1031
crossref_primary_10_1109_TIV_2023_3245615
crossref_primary_10_1093_bib_bbac599
crossref_primary_10_1007_s12539_025_00730_6
crossref_primary_10_1093_bioadv_vbac094
crossref_primary_10_1177_10943420231188077
crossref_primary_10_1016_j_csbj_2024_12_029
crossref_primary_10_3390_life12020307
crossref_primary_10_3390_pharmaceutics15051337
crossref_primary_10_3390_ijms222111741
crossref_primary_10_1038_s41587_025_02761_2
crossref_primary_10_1109_JBHI_2022_3221988
crossref_primary_10_1007_s10930_023_10168_8
crossref_primary_10_1136_bmjhci_2022_100643
crossref_primary_10_1016_j_bpj_2024_01_026
crossref_primary_10_1093_bib_bbaf182
crossref_primary_10_1016_j_jtbi_2024_111878
crossref_primary_10_1038_s41467_024_48675_6
crossref_primary_10_32604_cmes_2023_043921
crossref_primary_10_3390_a18080465
crossref_primary_10_3389_fgene_2021_807825
crossref_primary_10_1016_j_imu_2024_101533
crossref_primary_10_1093_bioinformatics_btaf284
crossref_primary_10_1016_j_biotechadv_2024_108399
crossref_primary_10_1093_gigascience_giad036
crossref_primary_10_3390_biomedicines11051323
crossref_primary_10_1146_annurev_genom_021623_083207
crossref_primary_10_1002_ggn2_202100038
crossref_primary_10_1093_bioadv_vbae163
crossref_primary_10_1007_s42979_023_01980_1
crossref_primary_10_3390_ijms242216496
crossref_primary_10_1186_s40163_024_00212_y
crossref_primary_10_1016_j_envint_2024_108574
crossref_primary_10_1038_s41392_024_02066_x
crossref_primary_10_1093_bib_bbaf016
crossref_primary_10_1002_advs_202509501
crossref_primary_10_1016_j_ijbiomac_2024_138272
crossref_primary_10_1093_nar_gkac1247
crossref_primary_10_7554_eLife_82819
crossref_primary_10_3389_fgene_2022_1007618
crossref_primary_10_3390_ijms242116000
crossref_primary_10_3390_math11020279
crossref_primary_10_1016_j_ygeno_2025_111070
crossref_primary_10_1093_bib_bbac142
crossref_primary_10_1371_journal_pone_0289030
crossref_primary_10_3390_microorganisms13071635
crossref_primary_10_48130_tp_0025_0008
crossref_primary_10_1093_bioadv_vbad001
crossref_primary_10_1093_bib_bbae319
crossref_primary_10_1016_j_fbio_2025_106934
crossref_primary_10_1038_s41588_023_01465_0
crossref_primary_10_1016_j_cels_2025_101387
crossref_primary_10_3390_biotech14030058
crossref_primary_10_1016_j_procs_2024_06_106
crossref_primary_10_1186_s12859_023_05549_w
crossref_primary_10_1055_a_2424_1989
crossref_primary_10_1021_acs_jcim_4c02216
crossref_primary_10_1038_s41592_024_02362_y
crossref_primary_10_1186_s13321_022_00608_5
crossref_primary_10_1016_j_compbiomed_2024_109048
crossref_primary_10_1093_bib_bbab200
crossref_primary_10_1093_biomethods_bpae055
crossref_primary_10_1186_s12859_024_05699_5
crossref_primary_10_1002_advs_202404212
crossref_primary_10_1016_j_ymeth_2024_10_006
crossref_primary_10_1016_j_csbj_2022_12_044
crossref_primary_10_1109_ACCESS_2024_3481049
crossref_primary_10_3390_genes15010025
crossref_primary_10_1038_s42003_024_06561_3
crossref_primary_10_1093_nargab_lqae021
crossref_primary_10_1109_JBHI_2024_3413146
crossref_primary_10_1016_j_hlife_2023_06_001
crossref_primary_10_1007_s00521_022_07734_z
crossref_primary_10_1093_bib_bbae307
crossref_primary_10_1002_pmic_202300011
crossref_primary_10_12688_f1000research_129064_1
crossref_primary_10_1093_bib_bbae583
crossref_primary_10_12688_f1000research_129064_2
crossref_primary_10_1186_s12859_024_05766_x
crossref_primary_10_12688_f1000research_129064_3
crossref_primary_10_1016_j_jbi_2024_104650
crossref_primary_10_3389_fgene_2024_1376486
crossref_primary_10_1109_TCBB_2024_3381825
crossref_primary_10_1186_s12859_025_06062_y
crossref_primary_10_3390_axioms11090469
crossref_primary_10_1007_s44163_023_00065_5
crossref_primary_10_1016_j_sbi_2025_102986
crossref_primary_10_3389_fmolb_2022_916639
crossref_primary_10_1002_pld3_554
crossref_primary_10_7717_peerj_cs_2149
crossref_primary_10_1109_TCBB_2022_3173789
crossref_primary_10_1016_j_csbj_2025_02_042
crossref_primary_10_3390_ijms24043775
crossref_primary_10_1002_prot_26686
crossref_primary_10_1016_j_procs_2023_10_500
crossref_primary_10_1002_prot_26322
crossref_primary_10_1016_j_ijbiomac_2025_147637
crossref_primary_10_1016_j_jer_2024_08_001
crossref_primary_10_1038_s41598_022_20000_5
crossref_primary_10_1016_j_alit_2025_08_004
crossref_primary_10_1007_s00439_021_02411_y
crossref_primary_10_1016_j_jksuci_2024_101961
crossref_primary_10_1016_j_jcmds_2022_100044
crossref_primary_10_3390_v17091199
crossref_primary_10_1186_s12911_024_02531_1
crossref_primary_10_3390_biom14040409
crossref_primary_10_1186_s12864_022_08772_6
crossref_primary_10_3390_biom15010049
crossref_primary_10_1186_s12859_022_04873_x
crossref_primary_10_1007_s11760_022_02419_5
crossref_primary_10_1038_s41598_025_98979_w
crossref_primary_10_1093_femsre_fuad003
crossref_primary_10_1016_j_compbiomed_2024_108815
crossref_primary_10_3390_s23187722
crossref_primary_10_1016_j_sbi_2025_102997
crossref_primary_10_1093_bioinformatics_btaf360
crossref_primary_10_1016_j_ijbiomac_2024_137668
crossref_primary_10_1186_s12859_022_04623_z
crossref_primary_10_1093_bib_bbad358
crossref_primary_10_1099_jgv_0_002067
crossref_primary_10_3390_sym14112274
crossref_primary_10_1002_prot_26452
crossref_primary_10_1016_j_str_2022_05_001
crossref_primary_10_1093_nar_gkab1016
crossref_primary_10_12688_f1000research_130443_1
crossref_primary_10_3390_agriculture13010110
Cites_doi 10.1002/pmic.201000270
10.1002/prot.25823
10.1038/s41592-019-0598-1
10.1093/database/baw133
10.1073/pnas.0914097107
10.18653/v1/P18-1007
10.1186/s13059-016-1037-6
10.1016/j.resmic.2009.05.004
10.1038/s41598-019-38746-w
10.1101/2020.03.09.983585
10.1093/bib/bbn008
10.1038/nmeth.1818
10.1186/s12859-019-3220-8
10.1371/journal.pcbi.1003063
10.1093/protein/13.3.149
10.1126/science.aan0693
10.1016/j.ymeth.2014.10.026
10.1101/2020.09.17.301879
10.1021/acssynbio.0c00219
10.1093/nar/28.1.235
10.1109/MIS.2009.36
10.1002/j.1538-7305.1951.tb01366.x
10.1126/science.abb2762
10.1093/bib/bbw068
10.1162/neco.1997.9.8.1735
10.18653/v1/2020.findings-emnlp.139
10.3390/cells8020122
10.1093/bioinformatics/btaa003
10.15252/msb.20156651
10.1093/bioinformatics/bty178
10.1101/2020.06.26.174417
10.1038/srep31865
10.1002/prot.10381
10.1101/2020.09.04.282814
10.1038/s41586-019-1582-8
10.1016/j.csbj.2020.06.017
10.1038/nbt.3988
10.1002/prot.10559
10.1073/pnas.0409746102
10.1101/622803
10.1186/1471-2105-14-S3-S15
10.1038/s41586-019-1923-7
10.18653/v1/P16-1162
10.1007/978-3-540-74126-8_3
10.1073/pnas.1814684116
10.1016/0014-5793(91)80799-9
10.1021/pr3007123
10.1038/s41598-019-45685-z
10.1016/S0006-3495(96)79210-X
10.1093/bioinformatics/btx431
10.1101/2020.03.07.982272
10.3115/v1/D14-1162
10.1126/science.abd7331
10.1093/bioinformatics/bti1035
10.1093/bioinformatics/btaa1036
10.1101/2020.07.12.199554
10.1093/bioinformatics/btx818
10.18653/v1/K19-1052
10.1186/1745-6150-5-6
10.1007/978-3-540-35488-8_32
10.1371/journal.pone.0141287
10.1162/tacl_a_00051
10.1093/bioinformatics/btl645
10.1101/2021.02.12.430858
10.1093/bioinformatics/btv345
10.1016/j.febslet.2004.09.036
10.1101/676825
ContentType Journal Article
Copyright 2021 The Author(s)
2021 The Author(s).
2021 The Author(s) 2021
Copyright_xml – notice: 2021 The Author(s)
– notice: 2021 The Author(s).
– notice: 2021 The Author(s) 2021
DBID 6I.
AAFTH
AAYXX
CITATION
NPM
7X8
7S9
L.6
5PM
DOA
DOI 10.1016/j.csbj.2021.03.022
DatabaseName ScienceDirect Open Access Titles
Elsevier:ScienceDirect:Open Access
CrossRef
PubMed
MEDLINE - Academic
AGRICOLA
AGRICOLA - Academic
PubMed Central (Full Participant titles)
DOAJ Directory of Open Access Journals
DatabaseTitle CrossRef
PubMed
MEDLINE - Academic
AGRICOLA
AGRICOLA - Academic
DatabaseTitleList MEDLINE - Academic
AGRICOLA



PubMed
Database_xml – sequence: 1
  dbid: DOA
  name: DOAJ Directory of Open Access Journals
  url: https://www.doaj.org/
  sourceTypes: Open Website
– sequence: 2
  dbid: NPM
  name: PubMed
  url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 3
  dbid: 7X8
  name: MEDLINE - Academic
  url: https://search.proquest.com/medline
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
Computer Science
EISSN 2001-0370
EndPage 1758
ExternalDocumentID oai_doaj_org_article_92521a99034d4afe9fbf0dd2f99b7a0a
PMC8050421
33897979
10_1016_j_csbj_2021_03_022
S2001037021000945
Genre Journal Article
Review
GroupedDBID 0R~
0SF
457
53G
5VS
6I.
AACTN
AAEDT
AAEDW
AAFTH
AAHBH
AAIKJ
AALRI
AAXUO
ABMAC
ACGFS
ADBBV
ADEZE
ADRAZ
ADVLN
AEXQZ
AFTJW
AGHFR
AITUG
AKRWK
ALMA_UNASSIGNED_HOLDINGS
AMRAJ
AOIJS
BAWUL
BCNDV
DIK
EBS
EJD
FDB
GROUPED_DOAJ
HYE
IPNFZ
KQ8
M41
M48
M~E
NCXOZ
O9-
OK1
RIG
ROL
RPM
SSZ
AAYWO
AAYXX
ACVFH
ADCNI
AEUPX
AFPUW
AIGII
AKBMS
AKYEP
CITATION
NPM
7X8
7S9
L.6
5PM
ID FETCH-LOGICAL-c554t-2f3987d4a9b4bd86ae9f0d9619bc0e51bfac6d6501ec26bede1e3bc930f632ee3
IEDL.DBID DOA
ISICitedReferencesCount 211
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000684934900004&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 2001-0370
IngestDate Fri Oct 03 12:51:26 EDT 2025
Tue Nov 04 01:57:40 EST 2025
Fri Jul 11 12:18:20 EDT 2025
Fri Jul 11 10:12:16 EDT 2025
Thu Jan 02 22:55:56 EST 2025
Sat Nov 29 05:55:07 EST 2025
Tue Nov 18 21:58:09 EST 2025
Sat Aug 31 16:01:00 EDT 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Keywords Deep learning
Contextualized embedding
Word2vec
Bag of words
Transformer
Artificial neural networks
Natural language processing
BERT
Tokenization
Word embedding
Language models
Bioinformatics
Language English
License This is an open access article under the CC BY-NC-ND license.
2021 The Author(s).
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c554t-2f3987d4a9b4bd86ae9f0d9619bc0e51bfac6d6501ec26bede1e3bc930f632ee3
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
ObjectType-Review-3
content type line 23
OpenAccessLink https://doaj.org/article/92521a99034d4afe9fbf0dd2f99b7a0a
PMID 33897979
PQID 2518736718
PQPubID 23479
PageCount 9
ParticipantIDs doaj_primary_oai_doaj_org_article_92521a99034d4afe9fbf0dd2f99b7a0a
pubmedcentral_primary_oai_pubmedcentral_nih_gov_8050421
proquest_miscellaneous_2574317704
proquest_miscellaneous_2518736718
pubmed_primary_33897979
crossref_citationtrail_10_1016_j_csbj_2021_03_022
crossref_primary_10_1016_j_csbj_2021_03_022
elsevier_sciencedirect_doi_10_1016_j_csbj_2021_03_022
PublicationCentury 2000
PublicationDate 2021-01-01
PublicationDateYYYYMMDD 2021-01-01
PublicationDate_xml – month: 01
  year: 2021
  text: 2021-01-01
  day: 01
PublicationDecade 2020
PublicationPlace Netherlands
PublicationPlace_xml – name: Netherlands
PublicationTitle Computational and structural biotechnology journal
PublicationTitleAlternate Comput Struct Biotechnol J
PublicationYear 2021
Publisher Elsevier B.V
Research Network of Computational and Structural Biotechnology
Elsevier
Publisher_xml – name: Elsevier B.V
– name: Research Network of Computational and Structural Biotechnology
– name: Elsevier
References Barla, Jurman, Riccadonna, Merler, Chierici, Furlanello (b0050) 2008; 9
Zhang, Bengio, Hardt, Recht, Vinyals (b0575) 2017
Sunarso, Freddie, Srikumar Venugopal, and Federico Lauro. 2013. “Scalable Protein Sequence Similarity Search Using Locality-Sensitive Hashing and MapReduce.” ArXiv:1310.0883 [Cs], October.
Alley, Khimulya, Biswas, AlQuraishi, Church (b0010) 2019; 16
Dutta, Chen (b9010) 2007; 23
Akhtar, Southey, Andrén, Sweedler, Rodriguez-Zas (b9005) 2012; 11
Asgari, Mofrad (b0040) 2015; 10
Krizhevsky, Sutskever, Hinton (b0215) 2012
Sutskever, Ilya, Oriol Vinyals, and Quoc V Le. 2014. “Sequence to Sequence Learning with Neural Networks.” In Advances in Neural Information Processing Systems, 3104–12.
Ofer, Dan, and Michal Linial. 2015. “ProFET: Feature Engineering Captures High-Level Protein Functions.” Bioinformatics (Oxford, England), June.
Raiman, Raiman (b0370) 2018
.
Vig, Jesse, Ali Madani, Lav R. Varshney, Caiming Xiong, Richard Socher, and Nazneen Fatema Rajani. 2020. “BERTology Meets Biology: Interpreting Attention in Protein Language Models,” June.
Zaheer, Manzil, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, et al. 2020. “Big Bird: Transformers for Longer Sequences.” ArXiv:2007.14062 [Cs, Stat], July.
Howard, Ruder (b0180) 2018
Jiang, Oron, Clark, Bankapur, D’Andrea, Lepore (b0195) 2016; 17
Papanikolaou, Pavlopoulos, Theodosiou, Iliopoulos (b0320) 2015; 74
Kryshtafovych, Schwede, Topf, Fidelis, Moult (b0220) 2019; 87
Wang, You, Yang, Li, Jiang, Zhou (b0510) 2019; 8
Arora, Sanjeev, Yingyu Liang, and Tengyu Ma. 2016. “A Simple but Tough-to-Beat Baseline for Sentence Embeddings,” November.
Klein, Kim, Deng, Senellart, Rush (b9020) 2017
Leslie, Christina, Eleazar Eskin, and William Stafford Noble. 2002. “The Spectrum Kernel: A String Kernel for SVM Protein Classification.” Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing 575 (January): 564–75.
Sennrich, Rico, Barry Haddow, and Alexandra Birch. 2016. “Neural Machine Translation of Rare Words with Subword Units.” In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1715–25. Berlin, Germany: Association for Computational Linguistics. 10.18653/v1/P16-1162.
Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. Language Models Are Few-Shot Learners. ArXiv:2005.14165 [Cs], July. http://arxiv.org/abs/2005.14165.
Weathers, Paulaitis, Woolf, Hoh (b0515) 2004; 576
Pe’er, Felder, Man, Silman, Sussman, Beckmann (b0330) 2004; 54
Solan, Horn, Ruppin, Edelman (b0455) 2005
Nematzadeh, Meylan, Griffiths (b0300) 2017
Rao, Roshan M., Jason Liu, Robert Verkuil, Joshua Meier, John Canny, Pieter Abbeel, Tom Sercu, and Alexander Rives. 2021. “MSA Transformer.” BioRxiv, February, 2021.02.12.430858. 10.1101/2021.02.12.430858.
Rocklin, Chidyausiku, Goreshnik, Ford, Houliston, Lemak (b0395) 2017; 357
Halevy, Norvig, Pereira (b0155) 2009; 24
Singer, Uriel, Kira Radinsky, and Eric Horvitz. 2020. “On Biases of Attention in Scientific Discovery.” Edited by Jonathan Wren. Bioinformatics, December, btaa1036.
Rives, Alexander, Siddharth Goyal, Joshua Meier, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, and Rob Fergus. 2019. “Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences.” 10.1101/622803.
Yan, Zhang, Yaning Li, Xia, Zhou (b0535) 2020; 367
Yao, Liang, Chengsheng Mao, and Yuan Luo. 2019. “KG-BERT: BERT for Knowledge Graph Completion.” ArXiv:1909.03193 [Cs], September.
Wu, Yang, Liszka, Lee, Batzilla, Wernick (b0525) 2020; 9
Chen, Ting, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E Hinton. 2020. “Big Self-Supervised Models Are Strong Semi-Supervised Learners.” Advances in Neural Information Processing Systems 33.
Feng, Zhangyin, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, et al. 2020. “CodeBERT: A Pre-Trained Model for Programming and Natural Languages,” February.
Madani, Ali, Bryan McCann, Nikhil Naik, Nitish Shirish Keskar, Namrata Anand, Raphael R. Eguchi, Po-Ssu Huang, and Richard Socher. 2020. “ProGen: Language Modeling for Protein Generation.” BioRxiv, January, 2020.03.07.982272. 10.1101/2020.03.07.982272.
Razavian, Azizpour, Sullivan, Carlsson, Royal (b9025) 2014
Kudo, Taku. 2018. “Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates.” ArXiv:1804.10959 [Cs], April.
McCann, Bryan, James Bradbury, Caiming Xiong, and Richard Socher. 2018. “Learned in Translation: Contextualized Word Vectors.” ArXiv:1708.00107 [Cs], June.
Ofer, Linial, Ofer, Linial (b0305) 2014; 30
Almagro Armenteros, José Juan, Casper Kaae Sønderby, Søren Kaae Sønderby, Henrik Nielsen, and Ole Winther. 2017. “DeepLoc: Prediction of Protein Subcellular Localization Using Deep Learning.” Edited by John Hancock. Bioinformatics 33 (21): 3387–95. 10.1093/bioinformatics/btx431.
Bileschi, Belanger, Bryant, Sanderson, Brandon Carter, Sculley (b0070) 2019
Senior, Evans, Jumper, Kirkpatrick, Sifre, Green (b0430) 2020; 577
Paszke, Adam, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, et al. 2019. “PyTorch: An Imperative Style, High-Performance Deep Learning Library.” ArXiv:1912.01703 [Cs, Stat], December.
Strodthoff, Wagner, Wenzel, Samek (b0470) 2020; 36
Hie, Zhong, Berger, Bryson (b0165) 2021; 371
Rao, Roshan, Nicholas Bhattacharya, Neil Thomas, Yan Duan, Xi Chen, John Canny, Pieter Abbeel, and Yun S. Song. 2019. “Evaluating Protein Transfer Learning with TAPE,” June.
Schweiger, Linial (b0425) 2010; 5
Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez (b0495) 2017; 30
Yang, Dai, Yang, Carbonell, Salakhutdinov, Quoc (b0545) 2019; 32
Smith, Noah A. 2019. “Contextual Word Representations: A Contextual Introduction,” February.
Choromanski, Krzysztof, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, et al. 2020. “Rethinking Attention with Performers.” ArXiv:2009.14794 [Cs, Stat], September. http://arxiv.org/abs/2009.14794.
Strait, Dewey (b0465) 1996; 71
Peterson, Kondev, Theriot, Phillips (b0345) 2009; 25
Ptitsyn (b0355) 1991; 285
Ruder (b0400) 2018
Ofer, Dan. 2016. “Machine Learning for Protein Function.” ArXiv:1603.02021 [q-Bio], March.
Höglund, Dönnes, Blum, Adolph, Kohlbacher (b0175) 2006; 22
Raffel, Shazeer, Roberts, Lee, Narang, Matena (b0365) 2020; 21
Bojanowski, Grave, Joulin, Mikolov (b0075) 2017; 5
Devlin, J., Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” In NAACL-HLT. 10.18653/v1/N19-1423.
Askenazi, Marto, Linial (b0045) 2010; 10
Demis Hassabis. 2020. “High Accuracy Protein Structure Prediction Using Deep Learning.” Fourteenth Critical Assessment of Techniques for Protein Structure Prediction (Abstract Book), December.
Allam, Nagy, Thoma, Krauthammer (b0005) 2019; 9
Shannon (b0440) 1951; 30
Almagro Armenteros, Jose Juan, Alexander Rosenberg Johansen, Ole Winther, and Henrik Nielsen. Language Modelling for Biological Sequences – Curated Datasets and Baselines. BioRxiv 2020. March, 2020.03.09.983585. 10.1101/2020.03.09.983585.
Lan, Chen, Goodman, Gimpel, Sharma, Soricut (b0235) 2020
Varshavsky, Roy, Menachem Fromer, Amit Man, and Michal Linial. 2007. “When Less Is More : Improving Classification of Protein Families with a Minimal Set of Global Features,” 12–24.
Cozzetto, Domenico, Federico Minneci, Hannah Currant, and David T. Jones. 2016. “FFPred 3: Feature-Based Function Prediction for All Gene Ontology Domains.” Sci Rep 6 (August). 10.1038/srep31865.
Wang, Singh, Michael, Hill, Levy, Bowman (b0505) 2018
Schnoes, Ream, Thorman, Babbitt, Friedberg (b0420) 2013; 9
Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. “Language Models Are Unsupervised Multitask Learners,” 24.
Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. “RoBERTa: A Robustly Optimized BERT Pretraining Approach.” ArXiv:1907.11692 [Cs], July.
Savojardo, Castrense, Pier Luigi Martelli, Piero Fariselli, and Rita Casadio. 2018. “DeepSig: Deep Learning Improves Signal Peptide Detection in Proteins.” Edited by Alfonso Valencia. Bioinformatics 34 (10): 1690–96.
Koumakis (b0210) 2020; 18
Joulin, Armand, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. “Bag of Tricks for Efficient Text Classification.” ArXiv:1607.01759 [Cs], August.
Yang, Kevin K, Zachary Wu, Claire N Bedbrook, and Frances H Arnold. 2018. “Learned Protein Embeddings for Machine Learning.” Edited by Jonathan Wren. Bioinformatics 34 (15): 2642–48. 10.1093/bioinformatics/bty178.
Janin, Joël, Kim Henrick, John Moult, Lynn Ten Eyck, Michael J. E. Sternberg, Sandor Vajda, Ilya Vakser, and Shoshana J. Wodak. 2003. CAPRI: A Critical Assessment of PRedicted Interactions. Proteins: Struct Funct Bioinformatics 52 (1): 2–9. 10.1002/prot.10381.
Bepler, Tristan, Bonnie Berger. 2019. “Learning Protein Sequence Embeddings Using Information from Structure.” ArXiv:1902.08661 [Cs, q-Bio, Stat], October. http://arxiv.org/abs/1902.08661.
Keskar, Nitish Shirish, Bryan McCann, Lav R. Varshney, Caiming Xiong, and Richard Socher. 2019. “CTRL: A Conditional Transformer Language Model for Controllable Generation.” ArXiv:1909.05858 [Cs], September.
Murphy, Wallqvist, Levy (b0290) 2000; 13
Littmann, Maria, Michael Heinzinger, Christian Dallago, Tobias Olenyi, and & Burkhard Rost. 2020. “Embeddings from Deep Learning Transfer GO Annotations beyond Homology.” BioRxiv, September, 2020.09.04.282814. 10.1101/2020.09.04.282814.
Goldberg, Levy (b0150) 2014
Yuille, Alan L., and Chenxi Liu. 2020. “Deep Nets: What Have They Ever Done for Vision?” ArXiv:1805.04025 [Cs], November.
Pierse, Jingwen (b0350) 2020
Mignan, Broccardo (b0275) 2019; 574
Qin, Luo, Deng, Shu, Zhu, Griss (b9000) 2021; 232
Mikolov, Chen, Corrado, Dean (b0280) 2013; 1–9
Gillis, Jesse, Pa
10.1016/j.csbj.2021.03.022_b0140
Senior (10.1016/j.csbj.2021.03.022_b0430) 2020; 577
10.1016/j.csbj.2021.03.022_b0020
Kryshtafovych (10.1016/j.csbj.2021.03.022_b0220) 2019; 87
Askenazi (10.1016/j.csbj.2021.03.022_b0045) 2010; 10
10.1016/j.csbj.2021.03.022_b0265
10.1016/j.csbj.2021.03.022_b0540
10.1016/j.csbj.2021.03.022_b0145
Wang (10.1016/j.csbj.2021.03.022_b0505) 2018
Schweiger (10.1016/j.csbj.2021.03.022_b0425) 2010; 5
10.1016/j.csbj.2021.03.022_b0270
10.1016/j.csbj.2021.03.022_b0390
Bojanowski (10.1016/j.csbj.2021.03.022_b0075) 2017; 5
10.1016/j.csbj.2021.03.022_b0250
10.1016/j.csbj.2021.03.022_b0130
10.1016/j.csbj.2021.03.022_b0135
10.1016/j.csbj.2021.03.022_b0410
10.1016/j.csbj.2021.03.022_b0015
10.1016/j.csbj.2021.03.022_b0375
10.1016/j.csbj.2021.03.022_b0255
10.1016/j.csbj.2021.03.022_b0530
10.1016/j.csbj.2021.03.022_b0415
Goldberg (10.1016/j.csbj.2021.03.022_b0150) 2014
Sadka (10.1016/j.csbj.2021.03.022_b0405) 2005; 21
Hie (10.1016/j.csbj.2021.03.022_b0165) 2021; 371
Klein (10.1016/j.csbj.2021.03.022_b9020) 2017
Asgari (10.1016/j.csbj.2021.03.022_b0040) 2015; 10
Koumakis (10.1016/j.csbj.2021.03.022_b0210) 2020; 18
10.1016/j.csbj.2021.03.022_b0380
Shannon (10.1016/j.csbj.2021.03.022_b0440) 1951; 30
10.1016/j.csbj.2021.03.022_b0260
Mignan (10.1016/j.csbj.2021.03.022_b0275) 2019; 574
10.1016/j.csbj.2021.03.022_b0285
Murphy (10.1016/j.csbj.2021.03.022_b0290) 2000; 13
10.1016/j.csbj.2021.03.022_b0560
10.1016/j.csbj.2021.03.022_b0565
10.1016/j.csbj.2021.03.022_b0200
Wu (10.1016/j.csbj.2021.03.022_b0525) 2020; 9
10.1016/j.csbj.2021.03.022_b0205
Alley (10.1016/j.csbj.2021.03.022_b0010) 2019; 16
10.1016/j.csbj.2021.03.022_b0445
10.1016/j.csbj.2021.03.022_b0325
Leslie (10.1016/j.csbj.2021.03.022_b0245) 2004; 20
Solan (10.1016/j.csbj.2021.03.022_b0455) 2005
Yu (10.1016/j.csbj.2021.03.022_b0555) 2017
Papanikolaou (10.1016/j.csbj.2021.03.022_b0320) 2015; 74
Ofer (10.1016/j.csbj.2021.03.022_b0305) 2014; 30
10.1016/j.csbj.2021.03.022_b0550
10.1016/j.csbj.2021.03.022_b0030
Lan (10.1016/j.csbj.2021.03.022_b0235) 2020
10.1016/j.csbj.2021.03.022_b0310
10.1016/j.csbj.2021.03.022_b0315
Barla (10.1016/j.csbj.2021.03.022_b0050) 2008; 9
10.1016/j.csbj.2021.03.022_b0435
Steinegger (10.1016/j.csbj.2021.03.022_b0460) 2017; 35
Ben-hur (10.1016/j.csbj.2021.03.022_b0055) 2006; 207
Boutet (10.1016/j.csbj.2021.03.022_b0080) 2007; 406
10.1016/j.csbj.2021.03.022_b0340
Raiman (10.1016/j.csbj.2021.03.022_b0370) 2018
10.1016/j.csbj.2021.03.022_b0185
Yang (10.1016/j.csbj.2021.03.022_b0545) 2019; 32
10.1016/j.csbj.2021.03.022_b0100
Wang (10.1016/j.csbj.2021.03.022_b0510) 2019; 8
10.1016/j.csbj.2021.03.022_b0225
Qin (10.1016/j.csbj.2021.03.022_b9000) 2021; 232
10.1016/j.csbj.2021.03.022_b0500
10.1016/j.csbj.2021.03.022_b0105
Akhtar (10.1016/j.csbj.2021.03.022_b9005) 2012; 11
Razavian (10.1016/j.csbj.2021.03.022_b9025) 2014
Wen (10.1016/j.csbj.2021.03.022_b0520) 2020; 20
10.1016/j.csbj.2021.03.022_b0190
Schnoes (10.1016/j.csbj.2021.03.022_b0420) 2013; 9
10.1016/j.csbj.2021.03.022_b0450
10.1016/j.csbj.2021.03.022_b0570
Naamati (10.1016/j.csbj.2021.03.022_b0295) 2009; 37
Angermueller (10.1016/j.csbj.2021.03.022_b0025) 2016; 12
10.1016/j.csbj.2021.03.022_b0335
Zhang (10.1016/j.csbj.2021.03.022_b0575) 2017
Mikolov (10.1016/j.csbj.2021.03.022_b0280) 2013; 1–9
Höglund (10.1016/j.csbj.2021.03.022_b0175) 2006; 22
Dutta (10.1016/j.csbj.2021.03.022_b9010) 2007; 23
Pe’er (10.1016/j.csbj.2021.03.022_b0330) 2004; 54
Rocklin (10.1016/j.csbj.2021.03.022_b0395) 2017; 357
Pierse (10.1016/j.csbj.2021.03.022_b0350) 2020
10.1016/j.csbj.2021.03.022_b0060
10.1016/j.csbj.2021.03.022_b0120
Remmert (10.1016/j.csbj.2021.03.022_b0385) 2011; 9
10.1016/j.csbj.2021.03.022_b0360
10.1016/j.csbj.2021.03.022_b0240
10.1016/j.csbj.2021.03.022_b0125
Strait (10.1016/j.csbj.2021.03.022_b0465) 1996; 71
Strodthoff (10.1016/j.csbj.2021.03.022_b0470) 2020; 36
Weathers (10.1016/j.csbj.2021.03.022_b0515) 2004; 576
Peterson (10.1016/j.csbj.2021.03.022_b0345) 2009; 25
Brandes (10.1016/j.csbj.2021.03.022_b0085) 2016; 2016
Vaswani (10.1016/j.csbj.2021.03.022_b0495) 2017; 30
Bileschi (10.1016/j.csbj.2021.03.022_b0070) 2019
10.1016/j.csbj.2021.03.022_b0090
Nematzadeh (10.1016/j.csbj.2021.03.022_b0300) 2017
Raffel (10.1016/j.csbj.2021.03.022_b0365) 2020; 21
Ruder (10.1016/j.csbj.2021.03.022_b0400) 2018
10.1016/j.csbj.2021.03.022_b0490
10.1016/j.csbj.2021.03.022_b0095
Yan (10.1016/j.csbj.2021.03.022_b0535) 2020; 367
10.1016/j.csbj.2021.03.022_b0230
Allam (10.1016/j.csbj.2021.03.022_b0005) 2019; 9
Berman (10.1016/j.csbj.2021.03.022_b0065) 2000; 28
10.1016/j.csbj.2021.03.022_b0110
Jiang (10.1016/j.csbj.2021.03.022_b0195) 2016; 17
Hochreiter (10.1016/j.csbj.2021.03.022_b0170) 1997; 9
10.1016/j.csbj.2021.03.022_b0475
10.1016/j.csbj.2021.03.022_b0115
Krizhevsky (10.1016/j.csbj.2021.03.022_b0215) 2012
Howard (10.1016/j.csbj.2021.03.022_b0180) 2018
Heinzinger (10.1016/j.csbj.2021.03.022_b0160) 2019; 20
Trifonov (10.1016/j.csbj.2021.03.022_b0485) 2009; 160
Asgari (10.1016/j.csbj.2021.03.022_b0035) 2019
Ptitsyn (10.1016/j.csbj.2021.03.022_b0355) 1991; 285
Halevy (10.1016/j.csbj.2021.03.022_b0155) 2009; 24
10.1016/j.csbj.2021.03.022_b0480
References_xml – volume: 74
  start-page: 47
  year: 2015
  end-page: 53
  ident: b0320
  article-title: Protein–protein interaction predictions using text mining methods
– year: 2018
  ident: b0370
  article-title: DeepType: Multilingual entity linking by neural type system evolution
– volume: 30
  start-page: 5998
  year: 2017
  end-page: 6008
  ident: b0495
  article-title: Attention is all you need
  publication-title: Adv Neural Inf Process Syst
– volume: 11
  start-page: 6044
  year: 2012
  end-page: 6055
  ident: b9005
  article-title: Evaluation of Database Search Programs for Accurate Detection of Neuropeptides in Tandem Mass Spectrometry Experiments
  publication-title: J Proteome Res
– volume: 87
  start-page: 1011
  year: 2019
  end-page: 1020
  ident: b0220
  article-title: Critical assessment of methods of protein structure prediction (casp)—round xiii
  publication-title: Proteins Struct Funct Bioinf
– reference: Peters, Matthew E., Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. “Deep Contextualized Word Representations.” ArXiv:1802.05365 [Cs], March.
– reference: Yuille, Alan L., and Chenxi Liu. 2020. “Deep Nets: What Have They Ever Done for Vision?” ArXiv:1805.04025 [Cs], November.
– year: 2018
  ident: b0400
  article-title: NLP’s imagenet moment has arrived
  publication-title: Gradient.
– reference: Clark, K., Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. “ELECTRA: Pre-Training Text Encoders as Discriminators Rather Than Generators.” ArXiv abs/2003.10555.
– year: 2014
  ident: b0150
  article-title: Word2vec explained: Deriving Mikolov et al’.s negative-sampling word-embedding method
– volume: 9
  year: 2013
  ident: b0420
  article-title: Biases in the experimental annotations of protein function and their effect on our understanding of protein function space
  publication-title: PLoS Comput Biol
– volume: 9
  start-page: 9277
  year: 2019
  ident: b0005
  article-title: Neural networks versus logistic regression for 30 days all-cause readmission prediction
  publication-title: Sci Rep
– volume: 21
  start-page: 1
  year: 2020
  end-page: 67
  ident: b0365
  article-title: Exploring the limits of transfer learning with a unified text-to-text transformer
  publication-title: J Machine Learning Res
– reference: Sunarso, Freddie, Srikumar Venugopal, and Federico Lauro. 2013. “Scalable Protein Sequence Similarity Search Using Locality-Sensitive Hashing and MapReduce.” ArXiv:1310.0883 [Cs], October.
– volume: 160
  start-page: 481
  year: 2009
  end-page: 486
  ident: b0485
  article-title: The origin of the genetic code and of the earliest oligopeptides
  publication-title: Res Microbiol
– volume: 54
  start-page: 20
  year: 2004
  end-page: 40
  ident: b0330
  article-title: Proteomic Signatures: Amino Acid and Oligopeptide Compositions Differentiate among Phyla
  publication-title: Proteins
– reference: Yu, Lijia, Deepak Kumar Tanwar, Emanuel Diego S. Penha, Yuri I. Wolf, Eugene V. Koonin, and Malay Kumar Basu. 2019. “Grammar of Protein Domain Architectures.” Proceedings of the National Academy of Sciences 116 (9): 3636–45. 10.1073/pnas.1814684116.
– year: 2012
  ident: b0215
  article-title: Imagenet classification with deep convolutional neural networks
  publication-title: ImageNet Classification with Deep Convolutional Neural Networks
– reference: Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. “Language Models Are Unsupervised Multitask Learners,” 24.
– volume: 32
  year: 2019
  ident: b0545
  article-title: XLNet: Generalized autoregressive pretraining for language understanding
  publication-title: Advanc Neural Inform Process Sys
– reference: Sennrich, Rico, Barry Haddow, and Alexandra Birch. 2016. “Neural Machine Translation of Rare Words with Subword Units.” In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1715–25. Berlin, Germany: Association for Computational Linguistics. 10.18653/v1/P16-1162.
– reference: Almagro Armenteros, Jose Juan, Alexander Rosenberg Johansen, Ole Winther, and Henrik Nielsen. Language Modelling for Biological Sequences – Curated Datasets and Baselines. BioRxiv 2020. March, 2020.03.09.983585. 10.1101/2020.03.09.983585.
– reference: Ji, Yanrong, Zhihan Zhou, Han Liu, and Ramana V Davuluri. 2021. “DNABERT: Pre-Trained Bidirectional Encoder Representations from Transformers Model for DNA-Language in Genome.” Edited by Dr Janet Kelso and Janet Kelso. Bioinformatics, February, btab083.
– volume: 574
  start-page: E1
  year: 2019
  end-page: E3
  ident: b0275
  article-title: One neuron is more informative than a deep neural network for aftershock pattern forecasting
  publication-title: Nature
– reference: Savojardo, Castrense, Pier Luigi Martelli, Piero Fariselli, and Rita Casadio. 2018. “DeepSig: Deep Learning Improves Signal Peptide Detection in Proteins.” Edited by Alfonso Valencia. Bioinformatics 34 (10): 1690–96.
– volume: 2016
  year: 2016
  ident: b0085
  article-title: ASAP: A machine learning framework for local protein properties
  publication-title: Database
– reference: Ofer, Dan, and Michal Linial. 2015. “ProFET: Feature Engineering Captures High-Level Protein Functions.” Bioinformatics (Oxford, England), June.
– reference: Almagro Armenteros, José Juan, Casper Kaae Sønderby, Søren Kaae Sønderby, Henrik Nielsen, and Ole Winther. 2017. “DeepLoc: Prediction of Protein Subcellular Localization Using Deep Learning.” Edited by John Hancock. Bioinformatics 33 (21): 3387–95. 10.1093/bioinformatics/btx431.
– reference: Yamada, Ikuya, and Hiroyuki Shindo. 2019. “Neural Attentive Bag-of-Entities Model for Text Classification.” In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), 563–73. Hong Kong, China: Association for Computational Linguistics. 10.18653/v1/K19-1052.
– volume: 24
  start-page: 8
  year: 2009
  end-page: 12
  ident: b0155
  article-title: The unreasonable effectiveness of data
  publication-title: IEEE Intell Syst
– volume: 30
  start-page: 50
  year: 1951
  end-page: 64
  ident: b0440
  article-title: Prediction and entropy of printed english
  publication-title: Bell Syst Tech J
– volume: 9
  start-page: 173
  year: 2011
  end-page: 175
  ident: b0385
  article-title: HHblits: Lightning-fast iterative protein sequence searching by hmm-hmm alignment
  publication-title: Nat Methods
– volume: 5
  start-page: 6
  year: 2010
  ident: b0425
  article-title: Cooperativity within proximal phosphorylation sites is revealed from large-scale proteomics data
  publication-title: Biology Direct
– reference: Arora, Sanjeev, Yingyu Liang, and Tengyu Ma. 2016. “A Simple but Tough-to-Beat Baseline for Sentence Embeddings,” November.
– reference: Feng, Zhangyin, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, et al. 2020. “CodeBERT: A Pre-Trained Model for Programming and Natural Languages,” February.
– year: 2020
  ident: b0350
  article-title: Aligning the pretraining and finetuning objectives of language models
  publication-title: ArXiv
– reference: Chen, Ting, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E Hinton. 2020. “Big Self-Supervised Models Are Strong Semi-Supervised Learners.” Advances in Neural Information Processing Systems 33.
– volume: 577
  start-page: 706
  year: 2020
  end-page: 710
  ident: b0430
  article-title: Improved protein structure prediction using potentials from deep learning
  publication-title: Nature
– reference: Min, Seonwoo, Byunghan Lee, and Sungroh Yoon. 2016. “Deep Learning in Bioinformatics.” Briefings Bioinf, July, bbw068. 10.1093/bib/bbw068.
– volume: 576
  start-page: 348
  year: 2004
  end-page: 352
  ident: b0515
  article-title: Reduced amino acid alphabet is sufficient to accurately recognize intrinsically disordered protein
  publication-title: FEBS Lett
– reference: Ofer, Dan. 2016. “Machine Learning for Protein Function.” ArXiv:1603.02021 [q-Bio], March.
– reference: Liang, Wang, and Zhao KaiYong. 2015. “Detecting ‘Protein Words’ through Unsupervised Word Segmentation.” ArXiv:1404.6866 [Cs, q-Bio], October.
– reference: Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. “RoBERTa: A Robustly Optimized BERT Pretraining Approach.” ArXiv:1907.11692 [Cs], July.
– volume: 9
  start-page: 119
  year: 2008
  end-page: 128
  ident: b0050
  article-title: Machine learning methods for predictive proteomics
  publication-title: Briefings Bioinf
– volume: 1–9
  year: 2013
  ident: b0280
  article-title: Distributed representations of words and phrases and their compositionality
  publication-title: Nips
– volume: 285
  start-page: 176
  year: 1991
  end-page: 181
  ident: b0355
  article-title: How does protein synthesis give rise to the 3D-structure?
  publication-title: FEBS Lett
– reference: Budowski-Tal, Inbal, Yuval Nov, and Rachel Kolodny. FragBag, an Accurate Representation of Protein Structure, Retrieves Structural Neighbors from the Entire PDB Quickly and Accurately. Proceedings of the National Academy of Sciences of the United States of America. 2010. 107 (8): 3481–86. 10.1073/pnas.0914097107.
– volume: 20
  year: 2020
  ident: b0520
  article-title: Deep learning in proteomics
  publication-title: Proteomics
– reference: McCann, Bryan, James Bradbury, Caiming Xiong, and Richard Socher. 2018. “Learned in Translation: Contextualized Word Vectors.” ArXiv:1708.00107 [Cs], June.
– volume: 25
  start-page: 1356
  year: 2009
  end-page: 1362
  ident: b0345
  article-title: Reduced amino acid alphabets exhibit an improved sensitivity and selectivity in fold assignment
  publication-title: Bioinformatics (Oxford, England)
– start-page: 512
  year: 2014
  end-page: 519
  ident: b9025
  article-title: CNN Features Off-the-Shelf: An Astounding Baseline for Recognition
  publication-title: CVPRW ’14 Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops
– year: 2017
  ident: b0575
  article-title: Understanding deep learning requires rethinking generalization
– volume: 207
  year: 2006
  ident: b0055
  article-title: Protein Sequence Motifs: Highly Predictive Features of Protein Function
  publication-title: Stud Fuzziness Soft Comput
– volume: 71
  start-page: 148
  year: 1996
  end-page: 155
  ident: b0465
  article-title: The shannon information entropy of protein sequences
  publication-title: Biophys J
– volume: 371
  start-page: 284
  year: 2021
  end-page: 288
  ident: b0165
  article-title: Learning the language of viral evolution and escape
  publication-title: Science
– reference: Madani, Ali, Bryan McCann, Nikhil Naik, Nitish Shirish Keskar, Namrata Anand, Raphael R. Eguchi, Po-Ssu Huang, and Richard Socher. 2020. “ProGen: Language Modeling for Protein Generation.” BioRxiv, January, 2020.03.07.982272. 10.1101/2020.03.07.982272.
– volume: 28
  start-page: 235
  year: 2000
  end-page: 242
  ident: b0065
  article-title: The protein data bank
  publication-title: Nucleic Acids Res
– volume: 18
  start-page: 1466
  year: 2020
  end-page: 1473
  ident: b0210
  article-title: Deep learning models in genomics; are we there yet?
  publication-title: Comput Struct Biotechnol J
– reference: Chollet, François. 2015. Keras.
– volume: 20
  start-page: 1
  year: 2019
  end-page: 17
  ident: b0160
  article-title: Modeling aspects of the language of life through transfer-learning protein sequences
  publication-title: BMC Bioinf
– reference: Leslie, Christina, Eleazar Eskin, and William Stafford Noble. 2002. “The Spectrum Kernel: A String Kernel for SVM Protein Classification.” Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing 575 (January): 564–75.
– reference: Choromanski, Krzysztof, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, et al. 2020. “Rethinking Attention with Performers.” ArXiv:2009.14794 [Cs, Stat], September. http://arxiv.org/abs/2009.14794.
– reference: Lample, Guillaume, and François Charton. 2019. “Deep Learning for Symbolic Mathematics.” ArXiv:1912.01412 [Cs], December.
– year: 2019
  ident: b0070
  article-title: Using deep learning to annotate the protein universe
  publication-title: BioRxiv
– reference: Rao, Roshan M., Jason Liu, Robert Verkuil, Joshua Meier, John Canny, Pieter Abbeel, Tom Sercu, and Alexander Rives. 2021. “MSA Transformer.” BioRxiv, February, 2021.02.12.430858. 10.1101/2021.02.12.430858.
– volume: 22
  start-page: 1158
  year: 2006
  end-page: 1165
  ident: b0175
  article-title: MultiLoc: prediction of protein subcellular localization using n-terminal targeting sequences, sequence motifs and amino acid composition
  publication-title: Bioinformatics (Oxford, England)
– reference: Janin, Joël, Kim Henrick, John Moult, Lynn Ten Eyck, Michael J. E. Sternberg, Sandor Vajda, Ilya Vakser, and Shoshana J. Wodak. 2003. CAPRI: A Critical Assessment of PRedicted Interactions. Proteins: Struct Funct Bioinformatics 52 (1): 2–9. 10.1002/prot.10381.
– reference: Varshavsky, Roy, Menachem Fromer, Amit Man, and Michal Linial. 2007. “When Less Is More : Improving Classification of Protein Families with a Minimal Set of Global Features,” 12–24.
– reference: Paszke, Adam, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, et al. 2019. “PyTorch: An Imperative Style, High-Performance Deep Learning Library.” ArXiv:1912.01703 [Cs, Stat], December.
– volume: 232
  year: 2021
  ident: b9000
  article-title: Deep Learning Embedder Method and Tool for Mass Spectra Similarity Search
  publication-title: Journal of Proteomics
– reference: Yang, Kevin K, Zachary Wu, Claire N Bedbrook, and Frances H Arnold. 2018. “Learned Protein Embeddings for Machine Learning.” Edited by Jonathan Wren. Bioinformatics 34 (15): 2642–48. 10.1093/bioinformatics/bty178.
– volume: 8
  start-page: 122
  year: 2019
  ident: b0510
  article-title: A high efficient biological language model for predicting protein-protein interactions
  publication-title: Cells
– reference: Kudo, Taku. 2018. “Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates.” ArXiv:1804.10959 [Cs], April.
– volume: 37
  year: 2009
  ident: b0295
  article-title: ClanTox: A classifier of short animal toxins
  publication-title: Nucleic Acids Res
– reference: Sutskever, Ilya, Oriol Vinyals, and Quoc V Le. 2014. “Sequence to Sequence Learning with Neural Networks.” In Advances in Neural Information Processing Systems, 3104–12.
– year: 2017
  ident: b0300
  article-title: Evaluating vector-space models of word representation, or, the unreasonable effectiveness of counting words near other words
  publication-title: CogSci
– reference: Pennington, Jeffrey, Richard Socher, and Christopher Manning. 2014. “Glove: Global Vectors for Word Representation.” In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532–43. Doha, Qatar: Association for Computational Linguistics. 10.3115/v1/D14-1162.
– volume: 357
  start-page: 168
  year: 2017
  end-page: 175
  ident: b0395
  article-title: Global analysis of protein folding using massively parallel design, synthesis, and testing
  publication-title: Science
– reference: Cozzetto, Domenico, Federico Minneci, Hannah Currant, and David T. Jones. 2016. “FFPred 3: Feature-Based Function Prediction for All Gene Ontology Domains.” Sci Rep 6 (August). 10.1038/srep31865.
– volume: 406
  start-page: 89
  year: 2007
  end-page: 112
  ident: b0080
  article-title: UniProtKB/Swiss-Prot: The manually annotated section of the uniprot knowledgebase
  publication-title: Methods Mol Biol
– volume: 5
  start-page: 135
  year: 2017
  end-page: 146
  ident: b0075
  article-title: Enriching word vectors with subword information
  publication-title: Trans Assoc Computat Linguis
– reference: Devlin, J., Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” In NAACL-HLT. 10.18653/v1/N19-1423.
– volume: 367
  start-page: 1444
  year: 2020
  end-page: 1448
  ident: b0535
  article-title: Structural basis for the recognition of sars-cov-2 by full-length human ACE2
  publication-title: Science
– reference: Littmann, Maria, Michael Heinzinger, Christian Dallago, Tobias Olenyi, and & Burkhard Rost. 2020. “Embeddings from Deep Learning Transfer GO Annotations beyond Homology.” BioRxiv, September, 2020.09.04.282814. 10.1101/2020.09.04.282814.
– volume: 30
  start-page: 931
  year: 2014
  end-page: 940
  ident: b0305
  article-title: NeuroPID: A predictor for identifying neuropeptide precursors from metazoan proteomes
  publication-title: Bioinformatics (Oxford, England)
– volume: 12
  start-page: 878
  year: 2016
  ident: b0025
  article-title: Deep learning for computational biology
  publication-title: Mol Syst Biol
– reference: Yao, Liang, Chengsheng Mao, and Yuan Luo. 2019. “KG-BERT: BERT for Knowledge Graph Completion.” ArXiv:1909.03193 [Cs], September.
– reference: Joulin, Armand, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. “Bag of Tricks for Efficient Text Classification.” ArXiv:1607.01759 [Cs], August.
– start-page: 11629
  year: 2005
  end-page: 116344
  ident: b0455
  publication-title: Proc Natl Acad Sci
– volume: 13
  start-page: 149
  year: 2000
  end-page: 152
  ident: b0290
  article-title: Simplified amino acid alphabets for protein fold recognition and implications for folding
  publication-title: Protein Eng
– reference: Bepler, Tristan, Bonnie Berger. 2019. “Learning Protein Sequence Embeddings Using Information from Structure.” ArXiv:1902.08661 [Cs, q-Bio, Stat], October. http://arxiv.org/abs/1902.08661.
– year: 2018
  ident: b0180
  article-title: Universal language model fine-tuning for text classification
– reference: Demis Hassabis. 2020. “High Accuracy Protein Structure Prediction Using Deep Learning.” Fourteenth Critical Assessment of Techniques for Protein Structure Prediction (Abstract Book), December.
– volume: 35
  start-page: 1026
  year: 2017
  end-page: 1028
  ident: b0460
  article-title: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets
  publication-title: Nat Biotechnol
– reference: Vig, Jesse, Ali Madani, Lav R. Varshney, Caiming Xiong, Richard Socher, and Nazneen Fatema Rajani. 2020. “BERTology Meets Biology: Interpreting Attention in Protein Language Models,” June.
– volume: 10
  year: 2015
  ident: b0040
  article-title: Continuous Distributed representation of biological sequences for deep proteomics and genomics
  publication-title: PLoS ONE
– volume: 9
  start-page: 1735
  year: 1997
  end-page: 1780
  ident: b0170
  article-title: Long short-term memory
  publication-title: Neural Comput
– volume: 17
  year: 2016
  ident: b0195
  article-title: An expanded evaluation of protein function prediction methods shows an improvement in accuracy
  publication-title: Genome Biol
– year: 2020
  ident: b0235
  article-title: ALBERT: A lite BERT for self-supervised learning of language representations
– reference: Smith, Noah A. 2019. “Contextual Word Representations: A Contextual Introduction,” February.
– year: 2019
  ident: b0035
  article-title: Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)
  publication-title: Sci Rep
– reference: Salton, Gerard, and Michael J. McGill. 1983. Introduction to Modern Information Retrieval. McGraw-Hill Computer Science Series. New York: McGraw-Hill.
– volume: 21
  start-page: i378
  year: 2005
  end-page: i386
  ident: b0405
  article-title: Families of membranous proteins can be characterized by the amino acid composition of their transmembrane domains
  publication-title: Bioinformatics
– volume: 9
  start-page: 2154
  year: 2020
  end-page: 2161
  ident: b0525
  article-title: Signal peptides generated by attention-based neural networks
  publication-title: ACS Synth Biol
– year: 2018
  ident: b0505
  article-title: Glue: A multi-task benchmark and analysis platform for natural language understanding
  publication-title: ArXiv Preprint ArXiv:1804.07461.
– reference: Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. Language Models Are Few-Shot Learners. ArXiv:2005.14165 [Cs], July. http://arxiv.org/abs/2005.14165.
– volume: 16
  start-page: 1315
  year: 2019
  end-page: 1322
  ident: b0010
  article-title: Unified rational protein engineering with sequence-based deep representation learning
  publication-title: Nat Methods
– reference: Zaheer, Manzil, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, et al. 2020. “Big Bird: Transformers for Longer Sequences.” ArXiv:2007.14062 [Cs, Stat], July.
– reference: Singer, Uriel, Kira Radinsky, and Eric Horvitz. 2020. “On Biases of Attention in Scientific Discovery.” Edited by Jonathan Wren. Bioinformatics, December, btaa1036.
– volume: 20
  start-page: 467
  year: 2004
  end-page: 476
  ident: b0245
  article-title: Mismatch string kernels for discriminative protein classification
  publication-title: Bioinformatics (Oxford, England)
– volume: 36
  start-page: 2401
  year: 2020
  end-page: 2409
  ident: b0470
  article-title: UDSMProt: universal deep sequence models for protein classification
  publication-title: Bioinformatics
– reference: .
– reference: Rao, Roshan, Nicholas Bhattacharya, Neil Thomas, Yan Duan, Xi Chen, John Canny, Pieter Abbeel, and Yun S. Song. 2019. “Evaluating Protein Transfer Learning with TAPE,” June.
– reference: Keskar, Nitish Shirish, Bryan McCann, Lav R. Varshney, Caiming Xiong, and Richard Socher. 2019. “CTRL: A Conditional Transformer Language Model for Controllable Generation.” ArXiv:1909.05858 [Cs], September.
– year: 2017
  ident: b0555
  article-title: Dilated residual networks
– volume: 10
  start-page: 4306
  year: 2010
  end-page: 4310
  ident: b0045
  article-title: The complete peptide dictionary – a meta-proteomics resource
  publication-title: Proteomics
– start-page: 67
  year: 2017
  end-page: 72
  ident: b9020
  publication-title: OpenNMT: Open-Source Toolkit for Neural Machine Translation
– volume: 23
  start-page: 612
  year: 2007
  end-page: 618
  ident: b9010
  article-title: Speeding up Tandem Mass Spectrometry Database Search: Metric Embeddings and Fast near Neighbor Search
  publication-title: Bioinformatics
– reference: Gillis, Jesse, Paul Pavlidis. 2013. “Characterizing the State of the Art in the Computational Assignment of Gene Function: Lessons from the First Critical Assessment of Functional Annotation (CAFA).” BMC Bioinformatics 14 Suppl 3 (January): S15.
– reference: Rives, Alexander, Siddharth Goyal, Joshua Meier, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, and Rob Fergus. 2019. “Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences.” 10.1101/622803.
– reference: Elnaggar, Ahmed, Michael Heinzinger, Christian Dallago, Ghalia Rihawi, Yu Wang, Llion Jones, Tom Gibbs, et al. 2020. “ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing,” July.
– ident: 10.1016/j.csbj.2021.03.022_b0205
– volume: 10
  start-page: 4306
  issue: 23
  year: 2010
  ident: 10.1016/j.csbj.2021.03.022_b0045
  article-title: The complete peptide dictionary – a meta-proteomics resource
  publication-title: Proteomics
  doi: 10.1002/pmic.201000270
– volume: 87
  start-page: 1011
  issue: 12
  year: 2019
  ident: 10.1016/j.csbj.2021.03.022_b0220
  article-title: Critical assessment of methods of protein structure prediction (casp)—round xiii
  publication-title: Proteins Struct Funct Bioinf
  doi: 10.1002/prot.25823
– year: 2018
  ident: 10.1016/j.csbj.2021.03.022_b0370
  article-title: DeepType: Multilingual entity linking by neural type system evolution
  publication-title: ArXiv
– ident: 10.1016/j.csbj.2021.03.022_b0475
– volume: 16
  start-page: 1315
  issue: 12
  year: 2019
  ident: 10.1016/j.csbj.2021.03.022_b0010
  article-title: Unified rational protein engineering with sequence-based deep representation learning
  publication-title: Nat Methods
  doi: 10.1038/s41592-019-0598-1
– volume: 2016
  year: 2016
  ident: 10.1016/j.csbj.2021.03.022_b0085
  article-title: ASAP: A machine learning framework for local protein properties
  publication-title: Database
  doi: 10.1093/database/baw133
– ident: 10.1016/j.csbj.2021.03.022_b0125
– ident: 10.1016/j.csbj.2021.03.022_b0570
– ident: 10.1016/j.csbj.2021.03.022_b0095
  doi: 10.1073/pnas.0914097107
– ident: 10.1016/j.csbj.2021.03.022_b0240
– year: 2012
  ident: 10.1016/j.csbj.2021.03.022_b0215
  article-title: Imagenet classification with deep convolutional neural networks
  publication-title: ImageNet Classification with Deep Convolutional Neural Networks
– ident: 10.1016/j.csbj.2021.03.022_b0060
– ident: 10.1016/j.csbj.2021.03.022_b0225
  doi: 10.18653/v1/P18-1007
– volume: 17
  issue: 1
  year: 2016
  ident: 10.1016/j.csbj.2021.03.022_b0195
  article-title: An expanded evaluation of protein function prediction methods shows an improvement in accuracy
  publication-title: Genome Biol
  doi: 10.1186/s13059-016-1037-6
– volume: 160
  start-page: 481
  issue: 7
  year: 2009
  ident: 10.1016/j.csbj.2021.03.022_b0485
  article-title: The origin of the genetic code and of the earliest oligopeptides
  publication-title: Res Microbiol
  doi: 10.1016/j.resmic.2009.05.004
– volume: 30
  start-page: 5998
  year: 2017
  ident: 10.1016/j.csbj.2021.03.022_b0495
  article-title: Attention is all you need
  publication-title: Adv Neural Inf Process Syst
– year: 2019
  ident: 10.1016/j.csbj.2021.03.022_b0035
  article-title: Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)
  publication-title: Sci Rep
  doi: 10.1038/s41598-019-38746-w
– year: 2014
  ident: 10.1016/j.csbj.2021.03.022_b0150
  article-title: Word2vec explained: Deriving Mikolov et al’.s negative-sampling word-embedding method
  publication-title: ArXiv:1402.3722 [Cs, Stat]
– ident: 10.1016/j.csbj.2021.03.022_b0340
– ident: 10.1016/j.csbj.2021.03.022_b0015
  doi: 10.1101/2020.03.09.983585
– volume: 9
  start-page: 119
  issue: 2
  year: 2008
  ident: 10.1016/j.csbj.2021.03.022_b0050
  article-title: Machine learning methods for predictive proteomics
  publication-title: Briefings Bioinf
  doi: 10.1093/bib/bbn008
– ident: 10.1016/j.csbj.2021.03.022_b0200
– ident: 10.1016/j.csbj.2021.03.022_b0105
– volume: 9
  start-page: 173
  issue: 2
  year: 2011
  ident: 10.1016/j.csbj.2021.03.022_b0385
  article-title: HHblits: Lightning-fast iterative protein sequence searching by hmm-hmm alignment
  publication-title: Nat Methods
  doi: 10.1038/nmeth.1818
– volume: 20
  start-page: 1
  issue: 1
  year: 2019
  ident: 10.1016/j.csbj.2021.03.022_b0160
  article-title: Modeling aspects of the language of life through transfer-learning protein sequences
  publication-title: BMC Bioinf
  doi: 10.1186/s12859-019-3220-8
– volume: 232
  issue: February
  year: 2021
  ident: 10.1016/j.csbj.2021.03.022_b9000
  article-title: Deep Learning Embedder Method and Tool for Mass Spectra Similarity Search
  publication-title: Journal of Proteomics
– volume: 9
  issue: 5
  year: 2013
  ident: 10.1016/j.csbj.2021.03.022_b0420
  article-title: Biases in the experimental annotations of protein function and their effect on our understanding of protein function space
  publication-title: PLoS Comput Biol
  doi: 10.1371/journal.pcbi.1003063
– volume: 13
  start-page: 149
  issue: 3
  year: 2000
  ident: 10.1016/j.csbj.2021.03.022_b0290
  article-title: Simplified amino acid alphabets for protein fold recognition and implications for folding
  publication-title: Protein Eng
  doi: 10.1093/protein/13.3.149
– volume: 357
  start-page: 168
  issue: 6347
  year: 2017
  ident: 10.1016/j.csbj.2021.03.022_b0395
  article-title: Global analysis of protein folding using massively parallel design, synthesis, and testing
  publication-title: Science
  doi: 10.1126/science.aan0693
– volume: 74
  start-page: 47
  year: 2015
  ident: 10.1016/j.csbj.2021.03.022_b0320
  article-title: Protein–protein interaction predictions using text mining methods
  publication-title: Methods
  doi: 10.1016/j.ymeth.2014.10.026
– ident: 10.1016/j.csbj.2021.03.022_b0190
  doi: 10.1101/2020.09.17.301879
– volume: 9
  start-page: 2154
  issue: 8
  year: 2020
  ident: 10.1016/j.csbj.2021.03.022_b0525
  article-title: Signal peptides generated by attention-based neural networks
  publication-title: ACS Synth Biol
  doi: 10.1021/acssynbio.0c00219
– volume: 28
  start-page: 235
  issue: 1
  year: 2000
  ident: 10.1016/j.csbj.2021.03.022_b0065
  article-title: The protein data bank
  publication-title: Nucleic Acids Res
  doi: 10.1093/nar/28.1.235
– start-page: 512
  year: 2014
  ident: 10.1016/j.csbj.2021.03.022_b9025
  article-title: CNN Features Off-the-Shelf: An Astounding Baseline for Recognition
– volume: 24
  start-page: 8
  issue: 2
  year: 2009
  ident: 10.1016/j.csbj.2021.03.022_b0155
  article-title: The unreasonable effectiveness of data
  publication-title: IEEE Intell Syst
  doi: 10.1109/MIS.2009.36
– volume: 30
  start-page: 50
  issue: 1
  year: 1951
  ident: 10.1016/j.csbj.2021.03.022_b0440
  article-title: Prediction and entropy of printed english
  publication-title: Bell Syst Tech J
  doi: 10.1002/j.1538-7305.1951.tb01366.x
– ident: 10.1016/j.csbj.2021.03.022_b0110
– volume: 367
  start-page: 1444
  issue: 6485
  year: 2020
  ident: 10.1016/j.csbj.2021.03.022_b0535
  article-title: Structural basis for the recognition of sars-cov-2 by full-length human ACE2
  publication-title: Science
  doi: 10.1126/science.abb2762
– ident: 10.1016/j.csbj.2021.03.022_b0285
  doi: 10.1093/bib/bbw068
– volume: 32
  issue: June
  year: 2019
  ident: 10.1016/j.csbj.2021.03.022_b0545
  article-title: XLNet: Generalized autoregressive pretraining for language understanding
  publication-title: Advanc Neural Inform Process Sys
– volume: 9
  start-page: 1735
  issue: 8
  year: 1997
  ident: 10.1016/j.csbj.2021.03.022_b0170
  article-title: Long short-term memory
  publication-title: Neural Comput
  doi: 10.1162/neco.1997.9.8.1735
– ident: 10.1016/j.csbj.2021.03.022_b0450
– year: 2018
  ident: 10.1016/j.csbj.2021.03.022_b0505
  article-title: Glue: A multi-task benchmark and analysis platform for natural language understanding
  publication-title: ArXiv Preprint ArXiv:1804.07461.
– volume: 21
  start-page: 1
  issue: 140
  year: 2020
  ident: 10.1016/j.csbj.2021.03.022_b0365
  article-title: Exploring the limits of transfer learning with a unified text-to-text transformer
  publication-title: J Machine Learning Res
– ident: 10.1016/j.csbj.2021.03.022_b0140
  doi: 10.18653/v1/2020.findings-emnlp.139
– volume: 8
  start-page: 122
  issue: 2
  year: 2019
  ident: 10.1016/j.csbj.2021.03.022_b0510
  article-title: A high efficient biological language model for predicting protein-protein interactions
  publication-title: Cells
  doi: 10.3390/cells8020122
– year: 2018
  ident: 10.1016/j.csbj.2021.03.022_b0180
  article-title: Universal language model fine-tuning for text classification
  publication-title: ArXiv
– volume: 36
  start-page: 2401
  issue: 8
  year: 2020
  ident: 10.1016/j.csbj.2021.03.022_b0470
  article-title: UDSMProt: universal deep sequence models for protein classification
  publication-title: Bioinformatics
  doi: 10.1093/bioinformatics/btaa003
– ident: 10.1016/j.csbj.2021.03.022_b0090
– volume: 12
  start-page: 878
  issue: 7
  year: 2016
  ident: 10.1016/j.csbj.2021.03.022_b0025
  article-title: Deep learning for computational biology
  publication-title: Mol Syst Biol
  doi: 10.15252/msb.20156651
– ident: 10.1016/j.csbj.2021.03.022_b0540
  doi: 10.1093/bioinformatics/bty178
– ident: 10.1016/j.csbj.2021.03.022_b0325
– ident: 10.1016/j.csbj.2021.03.022_b0500
  doi: 10.1101/2020.06.26.174417
– ident: 10.1016/j.csbj.2021.03.022_b0120
  doi: 10.1038/srep31865
– volume: 20
  issue: 21–22
  year: 2020
  ident: 10.1016/j.csbj.2021.03.022_b0520
  article-title: Deep learning in proteomics
  publication-title: Proteomics
– ident: 10.1016/j.csbj.2021.03.022_b0185
  doi: 10.1002/prot.10381
– ident: 10.1016/j.csbj.2021.03.022_b0130
– ident: 10.1016/j.csbj.2021.03.022_b0255
  doi: 10.1101/2020.09.04.282814
– ident: 10.1016/j.csbj.2021.03.022_b0260
– volume: 574
  start-page: E1
  issue: 7776
  year: 2019
  ident: 10.1016/j.csbj.2021.03.022_b0275
  article-title: One neuron is more informative than a deep neural network for aftershock pattern forecasting
  publication-title: Nature
  doi: 10.1038/s41586-019-1582-8
– volume: 18
  start-page: 1466
  year: 2020
  ident: 10.1016/j.csbj.2021.03.022_b0210
  article-title: Deep learning models in genomics; are we there yet?
  publication-title: Comput Struct Biotechnol J
  doi: 10.1016/j.csbj.2020.06.017
– ident: 10.1016/j.csbj.2021.03.022_b0230
– volume: 35
  start-page: 1026
  issue: 11
  year: 2017
  ident: 10.1016/j.csbj.2021.03.022_b0460
  article-title: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets
  publication-title: Nat Biotechnol
  doi: 10.1038/nbt.3988
– volume: 54
  start-page: 20
  issue: 1
  year: 2004
  ident: 10.1016/j.csbj.2021.03.022_b0330
  article-title: Proteomic Signatures: Amino Acid and Oligopeptide Compositions Differentiate among Phyla
  publication-title: Proteins
  doi: 10.1002/prot.10559
– start-page: 11629
  year: 2005
  ident: 10.1016/j.csbj.2021.03.022_b0455
  publication-title: Proc Natl Acad Sci
  doi: 10.1073/pnas.0409746102
– volume: 20
  start-page: 467
  issue: 4
  year: 2004
  ident: 10.1016/j.csbj.2021.03.022_b0245
  article-title: Mismatch string kernels for discriminative protein classification
  publication-title: Bioinformatics (Oxford, England)
– ident: 10.1016/j.csbj.2021.03.022_b0390
  doi: 10.1101/622803
– volume: 37
  issue: Suppl. 2
  year: 2009
  ident: 10.1016/j.csbj.2021.03.022_b0295
  article-title: ClanTox: A classifier of short animal toxins
  publication-title: Nucleic Acids Res
– ident: 10.1016/j.csbj.2021.03.022_b0310
– ident: 10.1016/j.csbj.2021.03.022_b0145
  doi: 10.1186/1471-2105-14-S3-S15
– ident: 10.1016/j.csbj.2021.03.022_b0565
– year: 2017
  ident: 10.1016/j.csbj.2021.03.022_b0555
  article-title: Dilated residual networks
  publication-title: ArXiv
– volume: 577
  start-page: 706
  issue: 7792
  year: 2020
  ident: 10.1016/j.csbj.2021.03.022_b0430
  article-title: Improved protein structure prediction using potentials from deep learning
  publication-title: Nature
  doi: 10.1038/s41586-019-1923-7
– volume: 30
  start-page: 931
  issue: 7
  year: 2014
  ident: 10.1016/j.csbj.2021.03.022_b0305
  article-title: NeuroPID: A predictor for identifying neuropeptide precursors from metazoan proteomes
  publication-title: Bioinformatics (Oxford, England)
– volume: 25
  start-page: 1356
  issue: 11
  year: 2009
  ident: 10.1016/j.csbj.2021.03.022_b0345
  article-title: Reduced amino acid alphabets exhibit an improved sensitivity and selectivity in fold assignment
  publication-title: Bioinformatics (Oxford, England)
– volume: 22
  start-page: 1158
  issue: 10
  year: 2006
  ident: 10.1016/j.csbj.2021.03.022_b0175
  article-title: MultiLoc: prediction of protein subcellular localization using n-terminal targeting sequences, sequence motifs and amino acid composition
  publication-title: Bioinformatics (Oxford, England)
– ident: 10.1016/j.csbj.2021.03.022_b0410
– ident: 10.1016/j.csbj.2021.03.022_b0435
  doi: 10.18653/v1/P16-1162
– ident: 10.1016/j.csbj.2021.03.022_b0490
  doi: 10.1007/978-3-540-74126-8_3
– ident: 10.1016/j.csbj.2021.03.022_b0560
  doi: 10.1073/pnas.1814684116
– volume: 1–9
  year: 2013
  ident: 10.1016/j.csbj.2021.03.022_b0280
  article-title: Distributed representations of words and phrases and their compositionality
  publication-title: Nips
– volume: 285
  start-page: 176
  issue: 2
  year: 1991
  ident: 10.1016/j.csbj.2021.03.022_b0355
  article-title: How does protein synthesis give rise to the 3D-structure?
  publication-title: FEBS Lett
  doi: 10.1016/0014-5793(91)80799-9
– volume: 11
  start-page: 6044
  issue: 12
  year: 2012
  ident: 10.1016/j.csbj.2021.03.022_b9005
  article-title: Evaluation of Database Search Programs for Accurate Detection of Neuropeptides in Tandem Mass Spectrometry Experiments
  publication-title: J Proteome Res
  doi: 10.1021/pr3007123
– volume: 9
  start-page: 9277
  issue: 1
  year: 2019
  ident: 10.1016/j.csbj.2021.03.022_b0005
  article-title: Neural networks versus logistic regression for 30 days all-cause readmission prediction
  publication-title: Sci Rep
  doi: 10.1038/s41598-019-45685-z
– ident: 10.1016/j.csbj.2021.03.022_b0030
– start-page: 67
  year: 2017
  ident: 10.1016/j.csbj.2021.03.022_b9020
– volume: 71
  start-page: 148
  issue: 1
  year: 1996
  ident: 10.1016/j.csbj.2021.03.022_b0465
  article-title: The shannon information entropy of protein sequences
  publication-title: Biophys J
  doi: 10.1016/S0006-3495(96)79210-X
– ident: 10.1016/j.csbj.2021.03.022_b0020
  doi: 10.1093/bioinformatics/btx431
– ident: 10.1016/j.csbj.2021.03.022_b0265
  doi: 10.1101/2020.03.07.982272
– ident: 10.1016/j.csbj.2021.03.022_b0480
– ident: 10.1016/j.csbj.2021.03.022_b0335
  doi: 10.3115/v1/D14-1162
– volume: 371
  start-page: 284
  issue: 6526
  year: 2021
  ident: 10.1016/j.csbj.2021.03.022_b0165
  article-title: Learning the language of viral evolution and escape
  publication-title: Science
  doi: 10.1126/science.abd7331
– ident: 10.1016/j.csbj.2021.03.022_b0115
– ident: 10.1016/j.csbj.2021.03.022_b0100
– year: 2020
  ident: 10.1016/j.csbj.2021.03.022_b0235
  article-title: ALBERT: A lite BERT for self-supervised learning of language representations
  publication-title: ArXiv
– ident: 10.1016/j.csbj.2021.03.022_b0360
– volume: 21
  start-page: i378
  issue: 1
  year: 2005
  ident: 10.1016/j.csbj.2021.03.022_b0405
  article-title: Families of membranous proteins can be characterized by the amino acid composition of their transmembrane domains
  publication-title: Bioinformatics
  doi: 10.1093/bioinformatics/bti1035
– year: 2017
  ident: 10.1016/j.csbj.2021.03.022_b0300
  article-title: Evaluating vector-space models of word representation, or, the unreasonable effectiveness of counting words near other words
  publication-title: CogSci
– year: 2020
  ident: 10.1016/j.csbj.2021.03.022_b0350
  article-title: Aligning the pretraining and finetuning objectives of language models
  publication-title: ArXiv
– ident: 10.1016/j.csbj.2021.03.022_b0270
– ident: 10.1016/j.csbj.2021.03.022_b0445
  doi: 10.1093/bioinformatics/btaa1036
– ident: 10.1016/j.csbj.2021.03.022_b0135
  doi: 10.1101/2020.07.12.199554
– year: 2019
  ident: 10.1016/j.csbj.2021.03.022_b0070
  article-title: Using deep learning to annotate the protein universe
  publication-title: BioRxiv
– ident: 10.1016/j.csbj.2021.03.022_b0415
  doi: 10.1093/bioinformatics/btx818
– ident: 10.1016/j.csbj.2021.03.022_b0530
  doi: 10.18653/v1/K19-1052
– year: 2018
  ident: 10.1016/j.csbj.2021.03.022_b0400
  article-title: NLP’s imagenet moment has arrived
  publication-title: Gradient.
– volume: 5
  start-page: 6
  issue: January
  year: 2010
  ident: 10.1016/j.csbj.2021.03.022_b0425
  article-title: Cooperativity within proximal phosphorylation sites is revealed from large-scale proteomics data
  publication-title: Biology Direct
  doi: 10.1186/1745-6150-5-6
– volume: 207
  year: 2006
  ident: 10.1016/j.csbj.2021.03.022_b0055
  article-title: Protein Sequence Motifs: Highly Predictive Features of Protein Function
  publication-title: Stud Fuzziness Soft Comput
  doi: 10.1007/978-3-540-35488-8_32
– volume: 10
  issue: 11
  year: 2015
  ident: 10.1016/j.csbj.2021.03.022_b0040
  article-title: Continuous Distributed representation of biological sequences for deep proteomics and genomics
  publication-title: PLoS ONE
  doi: 10.1371/journal.pone.0141287
– volume: 5
  start-page: 135
  issue: December
  year: 2017
  ident: 10.1016/j.csbj.2021.03.022_b0075
  article-title: Enriching word vectors with subword information
  publication-title: Trans Assoc Computat Linguis
  doi: 10.1162/tacl_a_00051
– volume: 23
  start-page: 612
  issue: 5
  year: 2007
  ident: 10.1016/j.csbj.2021.03.022_b9010
  article-title: Speeding up Tandem Mass Spectrometry Database Search: Metric Embeddings and Fast near Neighbor Search
  publication-title: Bioinformatics
  doi: 10.1093/bioinformatics/btl645
– ident: 10.1016/j.csbj.2021.03.022_b0250
– ident: 10.1016/j.csbj.2021.03.022_b0380
  doi: 10.1101/2021.02.12.430858
– ident: 10.1016/j.csbj.2021.03.022_b0550
– year: 2017
  ident: 10.1016/j.csbj.2021.03.022_b0575
  article-title: Understanding deep learning requires rethinking generalization
  publication-title: ArXiv
– ident: 10.1016/j.csbj.2021.03.022_b0315
  doi: 10.1093/bioinformatics/btv345
– volume: 576
  start-page: 348
  issue: 3
  year: 2004
  ident: 10.1016/j.csbj.2021.03.022_b0515
  article-title: Reduced amino acid alphabet is sufficient to accurately recognize intrinsically disordered protein
  publication-title: FEBS Lett
  doi: 10.1016/j.febslet.2004.09.036
– ident: 10.1016/j.csbj.2021.03.022_b0375
  doi: 10.1101/676825
– volume: 406
  start-page: 89
  year: 2007
  ident: 10.1016/j.csbj.2021.03.022_b0080
  article-title: UniProtKB/Swiss-Prot: The manually annotated section of the uniprot knowledgebase
  publication-title: Methods Mol Biol
SSID ssj0000816930
Score 2.6174169
SecondaryResourceType review_article
Snippet Natural language processing (NLP) is a field of computer science concerned with automated text and language analysis. In recent years, following a series of...
SourceID doaj
pubmedcentral
proquest
pubmed
crossref
elsevier
SourceType Open Website
Open Access Repository
Aggregation Database
Index Database
Enrichment Source
Publisher
StartPage 1750
SubjectTerms amino acids
Artificial neural networks
automation
Bag of words
BERT
Bioinformatics
biotechnology
computer science
Contextualized embedding
Deep learning
Language models
Natural language processing
Review
Tokenization
Transformer
Word embedding
Word2vec
Title The language of proteins: NLP, machine learning & protein sequences
URI https://dx.doi.org/10.1016/j.csbj.2021.03.022
https://www.ncbi.nlm.nih.gov/pubmed/33897979
https://www.proquest.com/docview/2518736718
https://www.proquest.com/docview/2574317704
https://pubmed.ncbi.nlm.nih.gov/PMC8050421
https://doaj.org/article/92521a99034d4afe9fbf0dd2f99b7a0a
Volume 19
WOSCitedRecordID wos000684934900004&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVAON
  databaseName: DOAJ Directory of Open Access Journals
  customDbUrl:
  eissn: 2001-0370
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0000816930
  issn: 2001-0370
  databaseCode: DOA
  dateStart: 20120101
  isFulltext: true
  titleUrlDefault: https://www.doaj.org/
  providerName: Directory of Open Access Journals
– providerCode: PRVHPJ
  databaseName: ROAD: Directory of Open Access Scholarly Resources
  customDbUrl:
  eissn: 2001-0370
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0000816930
  issn: 2001-0370
  databaseCode: M~E
  dateStart: 20110101
  isFulltext: true
  titleUrlDefault: https://road.issn.org
  providerName: ISSN International Centre
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1Lb9QwEB6Vqgc4IKA8wqMKErc2wrHjR7gV1KoHWPVApb1ZfoqtIIu6W478dsZOstqAtFyqSDnETqL5PPZ8diafAd5JRSKOjKYiUbKq4TxU1nNRRe48EzxSpEp5swk5m6n5vL3c2uor5YT18sA9cO9bigHG4JjJGt-YGNpoI_Gexra10pBMjZD1bE2m8hiskshIWmAZcoYkGf6Y6ZO73Mpe4-SQ1lnhlNJJVMri_ZPg9C_5_DuHcisonT-ChwObLE97Kx7DXuiewIMtjcFDOEVHKMdVyXIZy6zMsOhWH8rZ58uT8kfOpsQqwxrJWF5ukqyfwtX52ddPF9Wwb0LlkBysKxpZqySi1drGeiUMgkZ8i1Ml60jgtY3GCY_UrA6OCht8qAOzDvGKgtEQ2DPY75ZdeAGl8J5zF2JjQ2ii5NZJWTdRMWVoHY0ooB5x024QFU97W3zXY_bYtU5Y64S1Jkwj1gUcb-752Utq7Kz9MTXHpmaSw84X0En04CT6f05SAB8bUw_MomcM-KjFzpe_HVteY7dL31JMF5a3K420UEkmMLLvqpPpmSRNAc97b9mYwZAoSjwKkBM_mtg5LekW37L8tyIcR9r65V0A8wruJ3P7NaXXsL--uQ1v4MD9Wi9WN0dwT87VUe5ZeP7y--wPP4YozQ
linkProvider Directory of Open Access Journals
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=The+language+of+proteins%3A+NLP%2C+machine+learning+%26+protein+sequences&rft.jtitle=Computational+and+structural+biotechnology+journal&rft.au=Ofer%2C+Dan&rft.au=Brandes%2C+Nadav&rft.au=Linial%2C+Michal&rft.date=2021-01-01&rft.issn=2001-0370&rft.eissn=2001-0370&rft.volume=19&rft.spage=1750&rft.epage=1758&rft_id=info:doi/10.1016%2Fj.csbj.2021.03.022&rft.externalDBID=n%2Fa&rft.externalDocID=10_1016_j_csbj_2021_03_022
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2001-0370&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2001-0370&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2001-0370&client=summon