The language of proteins: NLP, machine learning & protein sequences
Natural language processing (NLP) is a field of computer science concerned with automated text and language analysis. In recent years, following a series of breakthroughs in deep and machine learning, NLP methods have shown overwhelming progress. Here, we review the success, promise and pitfalls of...
Gespeichert in:
| Veröffentlicht in: | Computational and structural biotechnology journal Jg. 19; S. 1750 - 1758 |
|---|---|
| Hauptverfasser: | , , |
| Format: | Journal Article |
| Sprache: | Englisch |
| Veröffentlicht: |
Netherlands
Elsevier B.V
01.01.2021
Research Network of Computational and Structural Biotechnology Elsevier |
| Schlagworte: | |
| ISSN: | 2001-0370, 2001-0370 |
| Online-Zugang: | Volltext |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Abstract | Natural language processing (NLP) is a field of computer science concerned with automated text and language analysis. In recent years, following a series of breakthroughs in deep and machine learning, NLP methods have shown overwhelming progress. Here, we review the success, promise and pitfalls of applying NLP algorithms to the study of proteins. Proteins, which can be represented as strings of amino-acid letters, are a natural fit to many NLP methods. We explore the conceptual similarities and differences between proteins and language, and review a range of protein-related tasks amenable to machine learning. We present methods for encoding the information of proteins as text and analyzing it with NLP methods, reviewing classic concepts such as bag-of-words, k-mers/n-grams and text search, as well as modern techniques such as word embedding, contextualized embedding, deep learning and neural language models. In particular, we focus on recent innovations such as masked language modeling, self-supervised learning and attention-based models. Finally, we discuss trends and challenges in the intersection of NLP and protein research. |
|---|---|
| AbstractList | Natural language processing (NLP) is a field of computer science concerned with automated text and language analysis. In recent years, following a series of breakthroughs in deep and machine learning, NLP methods have shown overwhelming progress. Here, we review the success, promise and pitfalls of applying NLP algorithms to the study of proteins. Proteins, which can be represented as strings of amino-acid letters, are a natural fit to many NLP methods. We explore the conceptual similarities and differences between proteins and language, and review a range of protein-related tasks amenable to machine learning. We present methods for encoding the information of proteins as text and analyzing it with NLP methods, reviewing classic concepts such as bag-of-words, k-mers/n-grams and text search, as well as modern techniques such as word embedding, contextualized embedding, deep learning and neural language models. In particular, we focus on recent innovations such as masked language modeling, self-supervised learning and attention-based models. Finally, we discuss trends and challenges in the intersection of NLP and protein research.Natural language processing (NLP) is a field of computer science concerned with automated text and language analysis. In recent years, following a series of breakthroughs in deep and machine learning, NLP methods have shown overwhelming progress. Here, we review the success, promise and pitfalls of applying NLP algorithms to the study of proteins. Proteins, which can be represented as strings of amino-acid letters, are a natural fit to many NLP methods. We explore the conceptual similarities and differences between proteins and language, and review a range of protein-related tasks amenable to machine learning. We present methods for encoding the information of proteins as text and analyzing it with NLP methods, reviewing classic concepts such as bag-of-words, k-mers/n-grams and text search, as well as modern techniques such as word embedding, contextualized embedding, deep learning and neural language models. In particular, we focus on recent innovations such as masked language modeling, self-supervised learning and attention-based models. Finally, we discuss trends and challenges in the intersection of NLP and protein research. Natural language processing (NLP) is a field of computer science concerned with automated text and language analysis. In recent years, following a series of breakthroughs in deep and machine learning, NLP methods have shown overwhelming progress. Here, we review the success, promise and pitfalls of applying NLP algorithms to the study of proteins. Proteins, which can be represented as strings of amino-acid letters, are a natural fit to many NLP methods. We explore the conceptual similarities and differences between proteins and language, and review a range of protein-related tasks amenable to machine learning. We present methods for encoding the information of proteins as text and analyzing it with NLP methods, reviewing classic concepts such as bag-of-words, k-mers/n-grams and text search, as well as modern techniques such as word embedding, contextualized embedding, deep learning and neural language models. In particular, we focus on recent innovations such as masked language modeling, self-supervised learning and attention-based models. Finally, we discuss trends and challenges in the intersection of NLP and protein research. |
| Author | Brandes, Nadav Linial, Michal Ofer, Dan |
| Author_xml | – sequence: 1 givenname: Dan surname: Ofer fullname: Ofer, Dan organization: Medtronic, Inc, Israel – sequence: 2 givenname: Nadav surname: Brandes fullname: Brandes, Nadav email: nadav.brandes@mail.huji.ac.il organization: The Rachel and Selim Benin School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel – sequence: 3 givenname: Michal surname: Linial fullname: Linial, Michal organization: Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem, Israel |
| BackLink | https://www.ncbi.nlm.nih.gov/pubmed/33897979$$D View this record in MEDLINE/PubMed |
| BookMark | eNqNUk1v1DAUtFAR_aB_gAPKCfXQDc92vowQElpBW2kFHMrZsp2XrKOsvdjZSv33OGwXtRwq7IMte2be6L05JUfOOyTkDYWcAq3eD7mJesgZMJoDz4GxF-SEAdAF8BqOHt2PyXmMA6TV0EpweEWOOW9EnfYJWd6uMRuV63eqx8x32Tb4Ca2LH7Jvqx-X2UaZtXUJgio46_rs3QGRRfy1Q2cwviYvOzVGPH84z8jPr19ul9eL1ferm-Xn1cKUZTEtWMdFU7eFErrQbVMpFB20oqJCG8CS6k6Zqq1KoGhYpbFFilyb5LirOEPkZ-Rmr9t6NchtsBsV7qVXVv558KGXKkzWjCgFKxlVQgAvUsEuVdKpVss6IXStQCWtT3ut7U5vsDXopqDGJ6JPf5xdy97fyQZKKBhNAhcPAsGnPsRJbmw0OKZeot9Fycq64LSuofgPKG1qXtW0SdC3j2399XMYWAI0e4AJPsaAnTR2UpP1s0s7Sgpyjocc5BwPOcdDApcpHonK_qEe1J8lfdyTMA32zmKQ0dh56q0NaKbUefsc_TdrhdMs |
| CitedBy_id | crossref_primary_10_1016_j_compbiomed_2024_108385 crossref_primary_10_2196_37213 crossref_primary_10_3390_electronics14030496 crossref_primary_10_1039_D5SC04513D crossref_primary_10_1038_s42003_024_07262_7 crossref_primary_10_1016_j_heliyon_2023_e23781 crossref_primary_10_1093_nargab_lqac043 crossref_primary_10_1016_j_csbj_2025_03_037 crossref_primary_10_3389_fimmu_2024_1463931 crossref_primary_10_1111_imr_13309 crossref_primary_10_1039_D3BM00412K crossref_primary_10_1371_journal_pone_0325531 crossref_primary_10_1016_j_sbi_2023_102641 crossref_primary_10_3390_sym16040464 crossref_primary_10_1093_nsr_nwaf056 crossref_primary_10_1007_s10489_022_04052_8 crossref_primary_10_1016_j_cels_2023_12_003 crossref_primary_10_1128_msystems_01004_23 crossref_primary_10_1093_bib_bbae077 crossref_primary_10_1371_journal_pone_0296737 crossref_primary_10_1007_s10489_024_06223_1 crossref_primary_10_1016_j_comnet_2025_111181 crossref_primary_10_1016_j_measurement_2022_111588 crossref_primary_10_1186_s12859_022_04604_2 crossref_primary_10_1016_j_artmed_2024_102900 crossref_primary_10_1002_pro_4524 crossref_primary_10_1038_s41598_025_03275_2 crossref_primary_10_3389_fbioe_2025_1506508 crossref_primary_10_1038_s42256_023_00637_1 crossref_primary_10_3389_fmed_2025_1594442 crossref_primary_10_1016_j_jmb_2025_169236 crossref_primary_10_1109_ACCESS_2024_3368382 crossref_primary_10_3390_app14031265 crossref_primary_10_3390_antibiotics11101451 crossref_primary_10_1155_2022_9015123 crossref_primary_10_1016_j_csbj_2025_04_005 crossref_primary_10_3389_fcomp_2025_1464122 crossref_primary_10_3390_math10030467 crossref_primary_10_1016_j_csbj_2022_11_014 crossref_primary_10_1016_j_tibs_2022_11_001 crossref_primary_10_3389_fchem_2025_1545136 crossref_primary_10_1016_j_ijfoodmicro_2024_110691 crossref_primary_10_1016_j_csbj_2024_01_009 crossref_primary_10_1038_s41467_025_58038_4 crossref_primary_10_1021_acs_molpharmaceut_5c00523 crossref_primary_10_1093_nar_gkad1031 crossref_primary_10_1109_TIV_2023_3245615 crossref_primary_10_1093_bib_bbac599 crossref_primary_10_1007_s12539_025_00730_6 crossref_primary_10_1093_bioadv_vbac094 crossref_primary_10_1177_10943420231188077 crossref_primary_10_1016_j_csbj_2024_12_029 crossref_primary_10_3390_life12020307 crossref_primary_10_3390_pharmaceutics15051337 crossref_primary_10_3390_ijms222111741 crossref_primary_10_1038_s41587_025_02761_2 crossref_primary_10_1109_JBHI_2022_3221988 crossref_primary_10_1007_s10930_023_10168_8 crossref_primary_10_1136_bmjhci_2022_100643 crossref_primary_10_1016_j_bpj_2024_01_026 crossref_primary_10_1093_bib_bbaf182 crossref_primary_10_1016_j_jtbi_2024_111878 crossref_primary_10_1038_s41467_024_48675_6 crossref_primary_10_32604_cmes_2023_043921 crossref_primary_10_3390_a18080465 crossref_primary_10_3389_fgene_2021_807825 crossref_primary_10_1016_j_imu_2024_101533 crossref_primary_10_1093_bioinformatics_btaf284 crossref_primary_10_1016_j_biotechadv_2024_108399 crossref_primary_10_1093_gigascience_giad036 crossref_primary_10_3390_biomedicines11051323 crossref_primary_10_1146_annurev_genom_021623_083207 crossref_primary_10_1002_ggn2_202100038 crossref_primary_10_1093_bioadv_vbae163 crossref_primary_10_1007_s42979_023_01980_1 crossref_primary_10_3390_ijms242216496 crossref_primary_10_1186_s40163_024_00212_y crossref_primary_10_1016_j_envint_2024_108574 crossref_primary_10_1038_s41392_024_02066_x crossref_primary_10_1093_bib_bbaf016 crossref_primary_10_1002_advs_202509501 crossref_primary_10_1016_j_ijbiomac_2024_138272 crossref_primary_10_1093_nar_gkac1247 crossref_primary_10_7554_eLife_82819 crossref_primary_10_3389_fgene_2022_1007618 crossref_primary_10_3390_ijms242116000 crossref_primary_10_3390_math11020279 crossref_primary_10_1016_j_ygeno_2025_111070 crossref_primary_10_1093_bib_bbac142 crossref_primary_10_1371_journal_pone_0289030 crossref_primary_10_3390_microorganisms13071635 crossref_primary_10_48130_tp_0025_0008 crossref_primary_10_1093_bioadv_vbad001 crossref_primary_10_1093_bib_bbae319 crossref_primary_10_1016_j_fbio_2025_106934 crossref_primary_10_1038_s41588_023_01465_0 crossref_primary_10_1016_j_cels_2025_101387 crossref_primary_10_3390_biotech14030058 crossref_primary_10_1016_j_procs_2024_06_106 crossref_primary_10_1186_s12859_023_05549_w crossref_primary_10_1055_a_2424_1989 crossref_primary_10_1021_acs_jcim_4c02216 crossref_primary_10_1038_s41592_024_02362_y crossref_primary_10_1186_s13321_022_00608_5 crossref_primary_10_1016_j_compbiomed_2024_109048 crossref_primary_10_1093_bib_bbab200 crossref_primary_10_1093_biomethods_bpae055 crossref_primary_10_1186_s12859_024_05699_5 crossref_primary_10_1002_advs_202404212 crossref_primary_10_1016_j_ymeth_2024_10_006 crossref_primary_10_1016_j_csbj_2022_12_044 crossref_primary_10_1109_ACCESS_2024_3481049 crossref_primary_10_3390_genes15010025 crossref_primary_10_1038_s42003_024_06561_3 crossref_primary_10_1093_nargab_lqae021 crossref_primary_10_1109_JBHI_2024_3413146 crossref_primary_10_1016_j_hlife_2023_06_001 crossref_primary_10_1007_s00521_022_07734_z crossref_primary_10_1093_bib_bbae307 crossref_primary_10_1002_pmic_202300011 crossref_primary_10_12688_f1000research_129064_1 crossref_primary_10_1093_bib_bbae583 crossref_primary_10_12688_f1000research_129064_2 crossref_primary_10_1186_s12859_024_05766_x crossref_primary_10_12688_f1000research_129064_3 crossref_primary_10_1016_j_jbi_2024_104650 crossref_primary_10_3389_fgene_2024_1376486 crossref_primary_10_1109_TCBB_2024_3381825 crossref_primary_10_1186_s12859_025_06062_y crossref_primary_10_3390_axioms11090469 crossref_primary_10_1007_s44163_023_00065_5 crossref_primary_10_1016_j_sbi_2025_102986 crossref_primary_10_3389_fmolb_2022_916639 crossref_primary_10_1002_pld3_554 crossref_primary_10_7717_peerj_cs_2149 crossref_primary_10_1109_TCBB_2022_3173789 crossref_primary_10_1016_j_csbj_2025_02_042 crossref_primary_10_3390_ijms24043775 crossref_primary_10_1002_prot_26686 crossref_primary_10_1016_j_procs_2023_10_500 crossref_primary_10_1002_prot_26322 crossref_primary_10_1016_j_ijbiomac_2025_147637 crossref_primary_10_1016_j_jer_2024_08_001 crossref_primary_10_1038_s41598_022_20000_5 crossref_primary_10_1016_j_alit_2025_08_004 crossref_primary_10_1007_s00439_021_02411_y crossref_primary_10_1016_j_jksuci_2024_101961 crossref_primary_10_1016_j_jcmds_2022_100044 crossref_primary_10_3390_v17091199 crossref_primary_10_1186_s12911_024_02531_1 crossref_primary_10_3390_biom14040409 crossref_primary_10_1186_s12864_022_08772_6 crossref_primary_10_3390_biom15010049 crossref_primary_10_1186_s12859_022_04873_x crossref_primary_10_1007_s11760_022_02419_5 crossref_primary_10_1038_s41598_025_98979_w crossref_primary_10_1093_femsre_fuad003 crossref_primary_10_1016_j_compbiomed_2024_108815 crossref_primary_10_3390_s23187722 crossref_primary_10_1016_j_sbi_2025_102997 crossref_primary_10_1093_bioinformatics_btaf360 crossref_primary_10_1016_j_ijbiomac_2024_137668 crossref_primary_10_1186_s12859_022_04623_z crossref_primary_10_1093_bib_bbad358 crossref_primary_10_1099_jgv_0_002067 crossref_primary_10_3390_sym14112274 crossref_primary_10_1002_prot_26452 crossref_primary_10_1016_j_str_2022_05_001 crossref_primary_10_1093_nar_gkab1016 crossref_primary_10_12688_f1000research_130443_1 crossref_primary_10_3390_agriculture13010110 |
| Cites_doi | 10.1002/pmic.201000270 10.1002/prot.25823 10.1038/s41592-019-0598-1 10.1093/database/baw133 10.1073/pnas.0914097107 10.18653/v1/P18-1007 10.1186/s13059-016-1037-6 10.1016/j.resmic.2009.05.004 10.1038/s41598-019-38746-w 10.1101/2020.03.09.983585 10.1093/bib/bbn008 10.1038/nmeth.1818 10.1186/s12859-019-3220-8 10.1371/journal.pcbi.1003063 10.1093/protein/13.3.149 10.1126/science.aan0693 10.1016/j.ymeth.2014.10.026 10.1101/2020.09.17.301879 10.1021/acssynbio.0c00219 10.1093/nar/28.1.235 10.1109/MIS.2009.36 10.1002/j.1538-7305.1951.tb01366.x 10.1126/science.abb2762 10.1093/bib/bbw068 10.1162/neco.1997.9.8.1735 10.18653/v1/2020.findings-emnlp.139 10.3390/cells8020122 10.1093/bioinformatics/btaa003 10.15252/msb.20156651 10.1093/bioinformatics/bty178 10.1101/2020.06.26.174417 10.1038/srep31865 10.1002/prot.10381 10.1101/2020.09.04.282814 10.1038/s41586-019-1582-8 10.1016/j.csbj.2020.06.017 10.1038/nbt.3988 10.1002/prot.10559 10.1073/pnas.0409746102 10.1101/622803 10.1186/1471-2105-14-S3-S15 10.1038/s41586-019-1923-7 10.18653/v1/P16-1162 10.1007/978-3-540-74126-8_3 10.1073/pnas.1814684116 10.1016/0014-5793(91)80799-9 10.1021/pr3007123 10.1038/s41598-019-45685-z 10.1016/S0006-3495(96)79210-X 10.1093/bioinformatics/btx431 10.1101/2020.03.07.982272 10.3115/v1/D14-1162 10.1126/science.abd7331 10.1093/bioinformatics/bti1035 10.1093/bioinformatics/btaa1036 10.1101/2020.07.12.199554 10.1093/bioinformatics/btx818 10.18653/v1/K19-1052 10.1186/1745-6150-5-6 10.1007/978-3-540-35488-8_32 10.1371/journal.pone.0141287 10.1162/tacl_a_00051 10.1093/bioinformatics/btl645 10.1101/2021.02.12.430858 10.1093/bioinformatics/btv345 10.1016/j.febslet.2004.09.036 10.1101/676825 |
| ContentType | Journal Article |
| Copyright | 2021 The Author(s) 2021 The Author(s). 2021 The Author(s) 2021 |
| Copyright_xml | – notice: 2021 The Author(s) – notice: 2021 The Author(s). – notice: 2021 The Author(s) 2021 |
| DBID | 6I. AAFTH AAYXX CITATION NPM 7X8 7S9 L.6 5PM DOA |
| DOI | 10.1016/j.csbj.2021.03.022 |
| DatabaseName | ScienceDirect Open Access Titles Elsevier:ScienceDirect:Open Access CrossRef PubMed MEDLINE - Academic AGRICOLA AGRICOLA - Academic PubMed Central (Full Participant titles) DOAJ Directory of Open Access Journals |
| DatabaseTitle | CrossRef PubMed MEDLINE - Academic AGRICOLA AGRICOLA - Academic |
| DatabaseTitleList | MEDLINE - Academic AGRICOLA PubMed |
| Database_xml | – sequence: 1 dbid: DOA name: DOAJ Directory of Open Access Journals url: https://www.doaj.org/ sourceTypes: Open Website – sequence: 2 dbid: NPM name: PubMed url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 3 dbid: 7X8 name: MEDLINE - Academic url: https://search.proquest.com/medline sourceTypes: Aggregation Database |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Engineering Computer Science |
| EISSN | 2001-0370 |
| EndPage | 1758 |
| ExternalDocumentID | oai_doaj_org_article_92521a99034d4afe9fbf0dd2f99b7a0a PMC8050421 33897979 10_1016_j_csbj_2021_03_022 S2001037021000945 |
| Genre | Journal Article Review |
| GroupedDBID | 0R~ 0SF 457 53G 5VS 6I. AACTN AAEDT AAEDW AAFTH AAHBH AAIKJ AALRI AAXUO ABMAC ACGFS ADBBV ADEZE ADRAZ ADVLN AEXQZ AFTJW AGHFR AITUG AKRWK ALMA_UNASSIGNED_HOLDINGS AMRAJ AOIJS BAWUL BCNDV DIK EBS EJD FDB GROUPED_DOAJ HYE IPNFZ KQ8 M41 M48 M~E NCXOZ O9- OK1 RIG ROL RPM SSZ AAYWO AAYXX ACVFH ADCNI AEUPX AFPUW AIGII AKBMS AKYEP CITATION NPM 7X8 7S9 L.6 5PM |
| ID | FETCH-LOGICAL-c554t-2f3987d4a9b4bd86ae9f0d9619bc0e51bfac6d6501ec26bede1e3bc930f632ee3 |
| IEDL.DBID | DOA |
| ISICitedReferencesCount | 211 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000684934900004&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 2001-0370 |
| IngestDate | Fri Oct 03 12:51:26 EDT 2025 Tue Nov 04 01:57:40 EST 2025 Fri Jul 11 12:18:20 EDT 2025 Fri Jul 11 10:12:16 EDT 2025 Thu Jan 02 22:55:56 EST 2025 Sat Nov 29 05:55:07 EST 2025 Tue Nov 18 21:58:09 EST 2025 Sat Aug 31 16:01:00 EDT 2024 |
| IsDoiOpenAccess | true |
| IsOpenAccess | true |
| IsPeerReviewed | true |
| IsScholarly | true |
| Keywords | Deep learning Contextualized embedding Word2vec Bag of words Transformer Artificial neural networks Natural language processing BERT Tokenization Word embedding Language models Bioinformatics |
| Language | English |
| License | This is an open access article under the CC BY-NC-ND license. 2021 The Author(s). This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c554t-2f3987d4a9b4bd86ae9f0d9619bc0e51bfac6d6501ec26bede1e3bc930f632ee3 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 ObjectType-Review-3 content type line 23 |
| OpenAccessLink | https://doaj.org/article/92521a99034d4afe9fbf0dd2f99b7a0a |
| PMID | 33897979 |
| PQID | 2518736718 |
| PQPubID | 23479 |
| PageCount | 9 |
| ParticipantIDs | doaj_primary_oai_doaj_org_article_92521a99034d4afe9fbf0dd2f99b7a0a pubmedcentral_primary_oai_pubmedcentral_nih_gov_8050421 proquest_miscellaneous_2574317704 proquest_miscellaneous_2518736718 pubmed_primary_33897979 crossref_citationtrail_10_1016_j_csbj_2021_03_022 crossref_primary_10_1016_j_csbj_2021_03_022 elsevier_sciencedirect_doi_10_1016_j_csbj_2021_03_022 |
| PublicationCentury | 2000 |
| PublicationDate | 2021-01-01 |
| PublicationDateYYYYMMDD | 2021-01-01 |
| PublicationDate_xml | – month: 01 year: 2021 text: 2021-01-01 day: 01 |
| PublicationDecade | 2020 |
| PublicationPlace | Netherlands |
| PublicationPlace_xml | – name: Netherlands |
| PublicationTitle | Computational and structural biotechnology journal |
| PublicationTitleAlternate | Comput Struct Biotechnol J |
| PublicationYear | 2021 |
| Publisher | Elsevier B.V Research Network of Computational and Structural Biotechnology Elsevier |
| Publisher_xml | – name: Elsevier B.V – name: Research Network of Computational and Structural Biotechnology – name: Elsevier |
| References | Barla, Jurman, Riccadonna, Merler, Chierici, Furlanello (b0050) 2008; 9 Zhang, Bengio, Hardt, Recht, Vinyals (b0575) 2017 Sunarso, Freddie, Srikumar Venugopal, and Federico Lauro. 2013. “Scalable Protein Sequence Similarity Search Using Locality-Sensitive Hashing and MapReduce.” ArXiv:1310.0883 [Cs], October. Alley, Khimulya, Biswas, AlQuraishi, Church (b0010) 2019; 16 Dutta, Chen (b9010) 2007; 23 Akhtar, Southey, Andrén, Sweedler, Rodriguez-Zas (b9005) 2012; 11 Asgari, Mofrad (b0040) 2015; 10 Krizhevsky, Sutskever, Hinton (b0215) 2012 Sutskever, Ilya, Oriol Vinyals, and Quoc V Le. 2014. “Sequence to Sequence Learning with Neural Networks.” In Advances in Neural Information Processing Systems, 3104–12. Ofer, Dan, and Michal Linial. 2015. “ProFET: Feature Engineering Captures High-Level Protein Functions.” Bioinformatics (Oxford, England), June. Raiman, Raiman (b0370) 2018 . Vig, Jesse, Ali Madani, Lav R. Varshney, Caiming Xiong, Richard Socher, and Nazneen Fatema Rajani. 2020. “BERTology Meets Biology: Interpreting Attention in Protein Language Models,” June. Zaheer, Manzil, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, et al. 2020. “Big Bird: Transformers for Longer Sequences.” ArXiv:2007.14062 [Cs, Stat], July. Howard, Ruder (b0180) 2018 Jiang, Oron, Clark, Bankapur, D’Andrea, Lepore (b0195) 2016; 17 Papanikolaou, Pavlopoulos, Theodosiou, Iliopoulos (b0320) 2015; 74 Kryshtafovych, Schwede, Topf, Fidelis, Moult (b0220) 2019; 87 Wang, You, Yang, Li, Jiang, Zhou (b0510) 2019; 8 Arora, Sanjeev, Yingyu Liang, and Tengyu Ma. 2016. “A Simple but Tough-to-Beat Baseline for Sentence Embeddings,” November. Klein, Kim, Deng, Senellart, Rush (b9020) 2017 Leslie, Christina, Eleazar Eskin, and William Stafford Noble. 2002. “The Spectrum Kernel: A String Kernel for SVM Protein Classification.” Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing 575 (January): 564–75. Sennrich, Rico, Barry Haddow, and Alexandra Birch. 2016. “Neural Machine Translation of Rare Words with Subword Units.” In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1715–25. Berlin, Germany: Association for Computational Linguistics. 10.18653/v1/P16-1162. Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. Language Models Are Few-Shot Learners. ArXiv:2005.14165 [Cs], July. http://arxiv.org/abs/2005.14165. Weathers, Paulaitis, Woolf, Hoh (b0515) 2004; 576 Pe’er, Felder, Man, Silman, Sussman, Beckmann (b0330) 2004; 54 Solan, Horn, Ruppin, Edelman (b0455) 2005 Nematzadeh, Meylan, Griffiths (b0300) 2017 Rao, Roshan M., Jason Liu, Robert Verkuil, Joshua Meier, John Canny, Pieter Abbeel, Tom Sercu, and Alexander Rives. 2021. “MSA Transformer.” BioRxiv, February, 2021.02.12.430858. 10.1101/2021.02.12.430858. Rocklin, Chidyausiku, Goreshnik, Ford, Houliston, Lemak (b0395) 2017; 357 Halevy, Norvig, Pereira (b0155) 2009; 24 Singer, Uriel, Kira Radinsky, and Eric Horvitz. 2020. “On Biases of Attention in Scientific Discovery.” Edited by Jonathan Wren. Bioinformatics, December, btaa1036. Rives, Alexander, Siddharth Goyal, Joshua Meier, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, and Rob Fergus. 2019. “Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences.” 10.1101/622803. Yan, Zhang, Yaning Li, Xia, Zhou (b0535) 2020; 367 Yao, Liang, Chengsheng Mao, and Yuan Luo. 2019. “KG-BERT: BERT for Knowledge Graph Completion.” ArXiv:1909.03193 [Cs], September. Wu, Yang, Liszka, Lee, Batzilla, Wernick (b0525) 2020; 9 Chen, Ting, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E Hinton. 2020. “Big Self-Supervised Models Are Strong Semi-Supervised Learners.” Advances in Neural Information Processing Systems 33. Feng, Zhangyin, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, et al. 2020. “CodeBERT: A Pre-Trained Model for Programming and Natural Languages,” February. Madani, Ali, Bryan McCann, Nikhil Naik, Nitish Shirish Keskar, Namrata Anand, Raphael R. Eguchi, Po-Ssu Huang, and Richard Socher. 2020. “ProGen: Language Modeling for Protein Generation.” BioRxiv, January, 2020.03.07.982272. 10.1101/2020.03.07.982272. Razavian, Azizpour, Sullivan, Carlsson, Royal (b9025) 2014 Kudo, Taku. 2018. “Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates.” ArXiv:1804.10959 [Cs], April. McCann, Bryan, James Bradbury, Caiming Xiong, and Richard Socher. 2018. “Learned in Translation: Contextualized Word Vectors.” ArXiv:1708.00107 [Cs], June. Ofer, Linial, Ofer, Linial (b0305) 2014; 30 Almagro Armenteros, José Juan, Casper Kaae Sønderby, Søren Kaae Sønderby, Henrik Nielsen, and Ole Winther. 2017. “DeepLoc: Prediction of Protein Subcellular Localization Using Deep Learning.” Edited by John Hancock. Bioinformatics 33 (21): 3387–95. 10.1093/bioinformatics/btx431. Bileschi, Belanger, Bryant, Sanderson, Brandon Carter, Sculley (b0070) 2019 Senior, Evans, Jumper, Kirkpatrick, Sifre, Green (b0430) 2020; 577 Paszke, Adam, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, et al. 2019. “PyTorch: An Imperative Style, High-Performance Deep Learning Library.” ArXiv:1912.01703 [Cs, Stat], December. Strodthoff, Wagner, Wenzel, Samek (b0470) 2020; 36 Hie, Zhong, Berger, Bryson (b0165) 2021; 371 Rao, Roshan, Nicholas Bhattacharya, Neil Thomas, Yan Duan, Xi Chen, John Canny, Pieter Abbeel, and Yun S. Song. 2019. “Evaluating Protein Transfer Learning with TAPE,” June. Schweiger, Linial (b0425) 2010; 5 Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez (b0495) 2017; 30 Yang, Dai, Yang, Carbonell, Salakhutdinov, Quoc (b0545) 2019; 32 Smith, Noah A. 2019. “Contextual Word Representations: A Contextual Introduction,” February. Choromanski, Krzysztof, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, et al. 2020. “Rethinking Attention with Performers.” ArXiv:2009.14794 [Cs, Stat], September. http://arxiv.org/abs/2009.14794. Strait, Dewey (b0465) 1996; 71 Peterson, Kondev, Theriot, Phillips (b0345) 2009; 25 Ptitsyn (b0355) 1991; 285 Ruder (b0400) 2018 Ofer, Dan. 2016. “Machine Learning for Protein Function.” ArXiv:1603.02021 [q-Bio], March. Höglund, Dönnes, Blum, Adolph, Kohlbacher (b0175) 2006; 22 Raffel, Shazeer, Roberts, Lee, Narang, Matena (b0365) 2020; 21 Bojanowski, Grave, Joulin, Mikolov (b0075) 2017; 5 Devlin, J., Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” In NAACL-HLT. 10.18653/v1/N19-1423. Askenazi, Marto, Linial (b0045) 2010; 10 Demis Hassabis. 2020. “High Accuracy Protein Structure Prediction Using Deep Learning.” Fourteenth Critical Assessment of Techniques for Protein Structure Prediction (Abstract Book), December. Allam, Nagy, Thoma, Krauthammer (b0005) 2019; 9 Shannon (b0440) 1951; 30 Almagro Armenteros, Jose Juan, Alexander Rosenberg Johansen, Ole Winther, and Henrik Nielsen. Language Modelling for Biological Sequences – Curated Datasets and Baselines. BioRxiv 2020. March, 2020.03.09.983585. 10.1101/2020.03.09.983585. Lan, Chen, Goodman, Gimpel, Sharma, Soricut (b0235) 2020 Varshavsky, Roy, Menachem Fromer, Amit Man, and Michal Linial. 2007. “When Less Is More : Improving Classification of Protein Families with a Minimal Set of Global Features,” 12–24. Cozzetto, Domenico, Federico Minneci, Hannah Currant, and David T. Jones. 2016. “FFPred 3: Feature-Based Function Prediction for All Gene Ontology Domains.” Sci Rep 6 (August). 10.1038/srep31865. Wang, Singh, Michael, Hill, Levy, Bowman (b0505) 2018 Schnoes, Ream, Thorman, Babbitt, Friedberg (b0420) 2013; 9 Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. “Language Models Are Unsupervised Multitask Learners,” 24. Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. “RoBERTa: A Robustly Optimized BERT Pretraining Approach.” ArXiv:1907.11692 [Cs], July. Savojardo, Castrense, Pier Luigi Martelli, Piero Fariselli, and Rita Casadio. 2018. “DeepSig: Deep Learning Improves Signal Peptide Detection in Proteins.” Edited by Alfonso Valencia. Bioinformatics 34 (10): 1690–96. Koumakis (b0210) 2020; 18 Joulin, Armand, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. “Bag of Tricks for Efficient Text Classification.” ArXiv:1607.01759 [Cs], August. Yang, Kevin K, Zachary Wu, Claire N Bedbrook, and Frances H Arnold. 2018. “Learned Protein Embeddings for Machine Learning.” Edited by Jonathan Wren. Bioinformatics 34 (15): 2642–48. 10.1093/bioinformatics/bty178. Janin, Joël, Kim Henrick, John Moult, Lynn Ten Eyck, Michael J. E. Sternberg, Sandor Vajda, Ilya Vakser, and Shoshana J. Wodak. 2003. CAPRI: A Critical Assessment of PRedicted Interactions. Proteins: Struct Funct Bioinformatics 52 (1): 2–9. 10.1002/prot.10381. Bepler, Tristan, Bonnie Berger. 2019. “Learning Protein Sequence Embeddings Using Information from Structure.” ArXiv:1902.08661 [Cs, q-Bio, Stat], October. http://arxiv.org/abs/1902.08661. Keskar, Nitish Shirish, Bryan McCann, Lav R. Varshney, Caiming Xiong, and Richard Socher. 2019. “CTRL: A Conditional Transformer Language Model for Controllable Generation.” ArXiv:1909.05858 [Cs], September. Murphy, Wallqvist, Levy (b0290) 2000; 13 Littmann, Maria, Michael Heinzinger, Christian Dallago, Tobias Olenyi, and & Burkhard Rost. 2020. “Embeddings from Deep Learning Transfer GO Annotations beyond Homology.” BioRxiv, September, 2020.09.04.282814. 10.1101/2020.09.04.282814. Goldberg, Levy (b0150) 2014 Yuille, Alan L., and Chenxi Liu. 2020. “Deep Nets: What Have They Ever Done for Vision?” ArXiv:1805.04025 [Cs], November. Pierse, Jingwen (b0350) 2020 Mignan, Broccardo (b0275) 2019; 574 Qin, Luo, Deng, Shu, Zhu, Griss (b9000) 2021; 232 Mikolov, Chen, Corrado, Dean (b0280) 2013; 1–9 Gillis, Jesse, Pa 10.1016/j.csbj.2021.03.022_b0140 Senior (10.1016/j.csbj.2021.03.022_b0430) 2020; 577 10.1016/j.csbj.2021.03.022_b0020 Kryshtafovych (10.1016/j.csbj.2021.03.022_b0220) 2019; 87 Askenazi (10.1016/j.csbj.2021.03.022_b0045) 2010; 10 10.1016/j.csbj.2021.03.022_b0265 10.1016/j.csbj.2021.03.022_b0540 10.1016/j.csbj.2021.03.022_b0145 Wang (10.1016/j.csbj.2021.03.022_b0505) 2018 Schweiger (10.1016/j.csbj.2021.03.022_b0425) 2010; 5 10.1016/j.csbj.2021.03.022_b0270 10.1016/j.csbj.2021.03.022_b0390 Bojanowski (10.1016/j.csbj.2021.03.022_b0075) 2017; 5 10.1016/j.csbj.2021.03.022_b0250 10.1016/j.csbj.2021.03.022_b0130 10.1016/j.csbj.2021.03.022_b0135 10.1016/j.csbj.2021.03.022_b0410 10.1016/j.csbj.2021.03.022_b0015 10.1016/j.csbj.2021.03.022_b0375 10.1016/j.csbj.2021.03.022_b0255 10.1016/j.csbj.2021.03.022_b0530 10.1016/j.csbj.2021.03.022_b0415 Goldberg (10.1016/j.csbj.2021.03.022_b0150) 2014 Sadka (10.1016/j.csbj.2021.03.022_b0405) 2005; 21 Hie (10.1016/j.csbj.2021.03.022_b0165) 2021; 371 Klein (10.1016/j.csbj.2021.03.022_b9020) 2017 Asgari (10.1016/j.csbj.2021.03.022_b0040) 2015; 10 Koumakis (10.1016/j.csbj.2021.03.022_b0210) 2020; 18 10.1016/j.csbj.2021.03.022_b0380 Shannon (10.1016/j.csbj.2021.03.022_b0440) 1951; 30 10.1016/j.csbj.2021.03.022_b0260 Mignan (10.1016/j.csbj.2021.03.022_b0275) 2019; 574 10.1016/j.csbj.2021.03.022_b0285 Murphy (10.1016/j.csbj.2021.03.022_b0290) 2000; 13 10.1016/j.csbj.2021.03.022_b0560 10.1016/j.csbj.2021.03.022_b0565 10.1016/j.csbj.2021.03.022_b0200 Wu (10.1016/j.csbj.2021.03.022_b0525) 2020; 9 10.1016/j.csbj.2021.03.022_b0205 Alley (10.1016/j.csbj.2021.03.022_b0010) 2019; 16 10.1016/j.csbj.2021.03.022_b0445 10.1016/j.csbj.2021.03.022_b0325 Leslie (10.1016/j.csbj.2021.03.022_b0245) 2004; 20 Solan (10.1016/j.csbj.2021.03.022_b0455) 2005 Yu (10.1016/j.csbj.2021.03.022_b0555) 2017 Papanikolaou (10.1016/j.csbj.2021.03.022_b0320) 2015; 74 Ofer (10.1016/j.csbj.2021.03.022_b0305) 2014; 30 10.1016/j.csbj.2021.03.022_b0550 10.1016/j.csbj.2021.03.022_b0030 Lan (10.1016/j.csbj.2021.03.022_b0235) 2020 10.1016/j.csbj.2021.03.022_b0310 10.1016/j.csbj.2021.03.022_b0315 Barla (10.1016/j.csbj.2021.03.022_b0050) 2008; 9 10.1016/j.csbj.2021.03.022_b0435 Steinegger (10.1016/j.csbj.2021.03.022_b0460) 2017; 35 Ben-hur (10.1016/j.csbj.2021.03.022_b0055) 2006; 207 Boutet (10.1016/j.csbj.2021.03.022_b0080) 2007; 406 10.1016/j.csbj.2021.03.022_b0340 Raiman (10.1016/j.csbj.2021.03.022_b0370) 2018 10.1016/j.csbj.2021.03.022_b0185 Yang (10.1016/j.csbj.2021.03.022_b0545) 2019; 32 10.1016/j.csbj.2021.03.022_b0100 Wang (10.1016/j.csbj.2021.03.022_b0510) 2019; 8 10.1016/j.csbj.2021.03.022_b0225 Qin (10.1016/j.csbj.2021.03.022_b9000) 2021; 232 10.1016/j.csbj.2021.03.022_b0500 10.1016/j.csbj.2021.03.022_b0105 Akhtar (10.1016/j.csbj.2021.03.022_b9005) 2012; 11 Razavian (10.1016/j.csbj.2021.03.022_b9025) 2014 Wen (10.1016/j.csbj.2021.03.022_b0520) 2020; 20 10.1016/j.csbj.2021.03.022_b0190 Schnoes (10.1016/j.csbj.2021.03.022_b0420) 2013; 9 10.1016/j.csbj.2021.03.022_b0450 10.1016/j.csbj.2021.03.022_b0570 Naamati (10.1016/j.csbj.2021.03.022_b0295) 2009; 37 Angermueller (10.1016/j.csbj.2021.03.022_b0025) 2016; 12 10.1016/j.csbj.2021.03.022_b0335 Zhang (10.1016/j.csbj.2021.03.022_b0575) 2017 Mikolov (10.1016/j.csbj.2021.03.022_b0280) 2013; 1–9 Höglund (10.1016/j.csbj.2021.03.022_b0175) 2006; 22 Dutta (10.1016/j.csbj.2021.03.022_b9010) 2007; 23 Pe’er (10.1016/j.csbj.2021.03.022_b0330) 2004; 54 Rocklin (10.1016/j.csbj.2021.03.022_b0395) 2017; 357 Pierse (10.1016/j.csbj.2021.03.022_b0350) 2020 10.1016/j.csbj.2021.03.022_b0060 10.1016/j.csbj.2021.03.022_b0120 Remmert (10.1016/j.csbj.2021.03.022_b0385) 2011; 9 10.1016/j.csbj.2021.03.022_b0360 10.1016/j.csbj.2021.03.022_b0240 10.1016/j.csbj.2021.03.022_b0125 Strait (10.1016/j.csbj.2021.03.022_b0465) 1996; 71 Strodthoff (10.1016/j.csbj.2021.03.022_b0470) 2020; 36 Weathers (10.1016/j.csbj.2021.03.022_b0515) 2004; 576 Peterson (10.1016/j.csbj.2021.03.022_b0345) 2009; 25 Brandes (10.1016/j.csbj.2021.03.022_b0085) 2016; 2016 Vaswani (10.1016/j.csbj.2021.03.022_b0495) 2017; 30 Bileschi (10.1016/j.csbj.2021.03.022_b0070) 2019 10.1016/j.csbj.2021.03.022_b0090 Nematzadeh (10.1016/j.csbj.2021.03.022_b0300) 2017 Raffel (10.1016/j.csbj.2021.03.022_b0365) 2020; 21 Ruder (10.1016/j.csbj.2021.03.022_b0400) 2018 10.1016/j.csbj.2021.03.022_b0490 10.1016/j.csbj.2021.03.022_b0095 Yan (10.1016/j.csbj.2021.03.022_b0535) 2020; 367 10.1016/j.csbj.2021.03.022_b0230 Allam (10.1016/j.csbj.2021.03.022_b0005) 2019; 9 Berman (10.1016/j.csbj.2021.03.022_b0065) 2000; 28 10.1016/j.csbj.2021.03.022_b0110 Jiang (10.1016/j.csbj.2021.03.022_b0195) 2016; 17 Hochreiter (10.1016/j.csbj.2021.03.022_b0170) 1997; 9 10.1016/j.csbj.2021.03.022_b0475 10.1016/j.csbj.2021.03.022_b0115 Krizhevsky (10.1016/j.csbj.2021.03.022_b0215) 2012 Howard (10.1016/j.csbj.2021.03.022_b0180) 2018 Heinzinger (10.1016/j.csbj.2021.03.022_b0160) 2019; 20 Trifonov (10.1016/j.csbj.2021.03.022_b0485) 2009; 160 Asgari (10.1016/j.csbj.2021.03.022_b0035) 2019 Ptitsyn (10.1016/j.csbj.2021.03.022_b0355) 1991; 285 Halevy (10.1016/j.csbj.2021.03.022_b0155) 2009; 24 10.1016/j.csbj.2021.03.022_b0480 |
| References_xml | – volume: 74 start-page: 47 year: 2015 end-page: 53 ident: b0320 article-title: Protein–protein interaction predictions using text mining methods – year: 2018 ident: b0370 article-title: DeepType: Multilingual entity linking by neural type system evolution – volume: 30 start-page: 5998 year: 2017 end-page: 6008 ident: b0495 article-title: Attention is all you need publication-title: Adv Neural Inf Process Syst – volume: 11 start-page: 6044 year: 2012 end-page: 6055 ident: b9005 article-title: Evaluation of Database Search Programs for Accurate Detection of Neuropeptides in Tandem Mass Spectrometry Experiments publication-title: J Proteome Res – volume: 87 start-page: 1011 year: 2019 end-page: 1020 ident: b0220 article-title: Critical assessment of methods of protein structure prediction (casp)—round xiii publication-title: Proteins Struct Funct Bioinf – reference: Peters, Matthew E., Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. “Deep Contextualized Word Representations.” ArXiv:1802.05365 [Cs], March. – reference: Yuille, Alan L., and Chenxi Liu. 2020. “Deep Nets: What Have They Ever Done for Vision?” ArXiv:1805.04025 [Cs], November. – year: 2018 ident: b0400 article-title: NLP’s imagenet moment has arrived publication-title: Gradient. – reference: Clark, K., Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. “ELECTRA: Pre-Training Text Encoders as Discriminators Rather Than Generators.” ArXiv abs/2003.10555. – year: 2014 ident: b0150 article-title: Word2vec explained: Deriving Mikolov et al’.s negative-sampling word-embedding method – volume: 9 year: 2013 ident: b0420 article-title: Biases in the experimental annotations of protein function and their effect on our understanding of protein function space publication-title: PLoS Comput Biol – volume: 9 start-page: 9277 year: 2019 ident: b0005 article-title: Neural networks versus logistic regression for 30 days all-cause readmission prediction publication-title: Sci Rep – volume: 21 start-page: 1 year: 2020 end-page: 67 ident: b0365 article-title: Exploring the limits of transfer learning with a unified text-to-text transformer publication-title: J Machine Learning Res – reference: Sunarso, Freddie, Srikumar Venugopal, and Federico Lauro. 2013. “Scalable Protein Sequence Similarity Search Using Locality-Sensitive Hashing and MapReduce.” ArXiv:1310.0883 [Cs], October. – volume: 160 start-page: 481 year: 2009 end-page: 486 ident: b0485 article-title: The origin of the genetic code and of the earliest oligopeptides publication-title: Res Microbiol – volume: 54 start-page: 20 year: 2004 end-page: 40 ident: b0330 article-title: Proteomic Signatures: Amino Acid and Oligopeptide Compositions Differentiate among Phyla publication-title: Proteins – reference: Yu, Lijia, Deepak Kumar Tanwar, Emanuel Diego S. Penha, Yuri I. Wolf, Eugene V. Koonin, and Malay Kumar Basu. 2019. “Grammar of Protein Domain Architectures.” Proceedings of the National Academy of Sciences 116 (9): 3636–45. 10.1073/pnas.1814684116. – year: 2012 ident: b0215 article-title: Imagenet classification with deep convolutional neural networks publication-title: ImageNet Classification with Deep Convolutional Neural Networks – reference: Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. “Language Models Are Unsupervised Multitask Learners,” 24. – volume: 32 year: 2019 ident: b0545 article-title: XLNet: Generalized autoregressive pretraining for language understanding publication-title: Advanc Neural Inform Process Sys – reference: Sennrich, Rico, Barry Haddow, and Alexandra Birch. 2016. “Neural Machine Translation of Rare Words with Subword Units.” In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1715–25. Berlin, Germany: Association for Computational Linguistics. 10.18653/v1/P16-1162. – reference: Almagro Armenteros, Jose Juan, Alexander Rosenberg Johansen, Ole Winther, and Henrik Nielsen. Language Modelling for Biological Sequences – Curated Datasets and Baselines. BioRxiv 2020. March, 2020.03.09.983585. 10.1101/2020.03.09.983585. – reference: Ji, Yanrong, Zhihan Zhou, Han Liu, and Ramana V Davuluri. 2021. “DNABERT: Pre-Trained Bidirectional Encoder Representations from Transformers Model for DNA-Language in Genome.” Edited by Dr Janet Kelso and Janet Kelso. Bioinformatics, February, btab083. – volume: 574 start-page: E1 year: 2019 end-page: E3 ident: b0275 article-title: One neuron is more informative than a deep neural network for aftershock pattern forecasting publication-title: Nature – reference: Savojardo, Castrense, Pier Luigi Martelli, Piero Fariselli, and Rita Casadio. 2018. “DeepSig: Deep Learning Improves Signal Peptide Detection in Proteins.” Edited by Alfonso Valencia. Bioinformatics 34 (10): 1690–96. – volume: 2016 year: 2016 ident: b0085 article-title: ASAP: A machine learning framework for local protein properties publication-title: Database – reference: Ofer, Dan, and Michal Linial. 2015. “ProFET: Feature Engineering Captures High-Level Protein Functions.” Bioinformatics (Oxford, England), June. – reference: Almagro Armenteros, José Juan, Casper Kaae Sønderby, Søren Kaae Sønderby, Henrik Nielsen, and Ole Winther. 2017. “DeepLoc: Prediction of Protein Subcellular Localization Using Deep Learning.” Edited by John Hancock. Bioinformatics 33 (21): 3387–95. 10.1093/bioinformatics/btx431. – reference: Yamada, Ikuya, and Hiroyuki Shindo. 2019. “Neural Attentive Bag-of-Entities Model for Text Classification.” In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), 563–73. Hong Kong, China: Association for Computational Linguistics. 10.18653/v1/K19-1052. – volume: 24 start-page: 8 year: 2009 end-page: 12 ident: b0155 article-title: The unreasonable effectiveness of data publication-title: IEEE Intell Syst – volume: 30 start-page: 50 year: 1951 end-page: 64 ident: b0440 article-title: Prediction and entropy of printed english publication-title: Bell Syst Tech J – volume: 9 start-page: 173 year: 2011 end-page: 175 ident: b0385 article-title: HHblits: Lightning-fast iterative protein sequence searching by hmm-hmm alignment publication-title: Nat Methods – volume: 5 start-page: 6 year: 2010 ident: b0425 article-title: Cooperativity within proximal phosphorylation sites is revealed from large-scale proteomics data publication-title: Biology Direct – reference: Arora, Sanjeev, Yingyu Liang, and Tengyu Ma. 2016. “A Simple but Tough-to-Beat Baseline for Sentence Embeddings,” November. – reference: Feng, Zhangyin, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, et al. 2020. “CodeBERT: A Pre-Trained Model for Programming and Natural Languages,” February. – year: 2020 ident: b0350 article-title: Aligning the pretraining and finetuning objectives of language models publication-title: ArXiv – reference: Chen, Ting, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E Hinton. 2020. “Big Self-Supervised Models Are Strong Semi-Supervised Learners.” Advances in Neural Information Processing Systems 33. – volume: 577 start-page: 706 year: 2020 end-page: 710 ident: b0430 article-title: Improved protein structure prediction using potentials from deep learning publication-title: Nature – reference: Min, Seonwoo, Byunghan Lee, and Sungroh Yoon. 2016. “Deep Learning in Bioinformatics.” Briefings Bioinf, July, bbw068. 10.1093/bib/bbw068. – volume: 576 start-page: 348 year: 2004 end-page: 352 ident: b0515 article-title: Reduced amino acid alphabet is sufficient to accurately recognize intrinsically disordered protein publication-title: FEBS Lett – reference: Ofer, Dan. 2016. “Machine Learning for Protein Function.” ArXiv:1603.02021 [q-Bio], March. – reference: Liang, Wang, and Zhao KaiYong. 2015. “Detecting ‘Protein Words’ through Unsupervised Word Segmentation.” ArXiv:1404.6866 [Cs, q-Bio], October. – reference: Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. “RoBERTa: A Robustly Optimized BERT Pretraining Approach.” ArXiv:1907.11692 [Cs], July. – volume: 9 start-page: 119 year: 2008 end-page: 128 ident: b0050 article-title: Machine learning methods for predictive proteomics publication-title: Briefings Bioinf – volume: 1–9 year: 2013 ident: b0280 article-title: Distributed representations of words and phrases and their compositionality publication-title: Nips – volume: 285 start-page: 176 year: 1991 end-page: 181 ident: b0355 article-title: How does protein synthesis give rise to the 3D-structure? publication-title: FEBS Lett – reference: Budowski-Tal, Inbal, Yuval Nov, and Rachel Kolodny. FragBag, an Accurate Representation of Protein Structure, Retrieves Structural Neighbors from the Entire PDB Quickly and Accurately. Proceedings of the National Academy of Sciences of the United States of America. 2010. 107 (8): 3481–86. 10.1073/pnas.0914097107. – volume: 20 year: 2020 ident: b0520 article-title: Deep learning in proteomics publication-title: Proteomics – reference: McCann, Bryan, James Bradbury, Caiming Xiong, and Richard Socher. 2018. “Learned in Translation: Contextualized Word Vectors.” ArXiv:1708.00107 [Cs], June. – volume: 25 start-page: 1356 year: 2009 end-page: 1362 ident: b0345 article-title: Reduced amino acid alphabets exhibit an improved sensitivity and selectivity in fold assignment publication-title: Bioinformatics (Oxford, England) – start-page: 512 year: 2014 end-page: 519 ident: b9025 article-title: CNN Features Off-the-Shelf: An Astounding Baseline for Recognition publication-title: CVPRW ’14 Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops – year: 2017 ident: b0575 article-title: Understanding deep learning requires rethinking generalization – volume: 207 year: 2006 ident: b0055 article-title: Protein Sequence Motifs: Highly Predictive Features of Protein Function publication-title: Stud Fuzziness Soft Comput – volume: 71 start-page: 148 year: 1996 end-page: 155 ident: b0465 article-title: The shannon information entropy of protein sequences publication-title: Biophys J – volume: 371 start-page: 284 year: 2021 end-page: 288 ident: b0165 article-title: Learning the language of viral evolution and escape publication-title: Science – reference: Madani, Ali, Bryan McCann, Nikhil Naik, Nitish Shirish Keskar, Namrata Anand, Raphael R. Eguchi, Po-Ssu Huang, and Richard Socher. 2020. “ProGen: Language Modeling for Protein Generation.” BioRxiv, January, 2020.03.07.982272. 10.1101/2020.03.07.982272. – volume: 28 start-page: 235 year: 2000 end-page: 242 ident: b0065 article-title: The protein data bank publication-title: Nucleic Acids Res – volume: 18 start-page: 1466 year: 2020 end-page: 1473 ident: b0210 article-title: Deep learning models in genomics; are we there yet? publication-title: Comput Struct Biotechnol J – reference: Chollet, François. 2015. Keras. – volume: 20 start-page: 1 year: 2019 end-page: 17 ident: b0160 article-title: Modeling aspects of the language of life through transfer-learning protein sequences publication-title: BMC Bioinf – reference: Leslie, Christina, Eleazar Eskin, and William Stafford Noble. 2002. “The Spectrum Kernel: A String Kernel for SVM Protein Classification.” Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing 575 (January): 564–75. – reference: Choromanski, Krzysztof, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, et al. 2020. “Rethinking Attention with Performers.” ArXiv:2009.14794 [Cs, Stat], September. http://arxiv.org/abs/2009.14794. – reference: Lample, Guillaume, and François Charton. 2019. “Deep Learning for Symbolic Mathematics.” ArXiv:1912.01412 [Cs], December. – year: 2019 ident: b0070 article-title: Using deep learning to annotate the protein universe publication-title: BioRxiv – reference: Rao, Roshan M., Jason Liu, Robert Verkuil, Joshua Meier, John Canny, Pieter Abbeel, Tom Sercu, and Alexander Rives. 2021. “MSA Transformer.” BioRxiv, February, 2021.02.12.430858. 10.1101/2021.02.12.430858. – volume: 22 start-page: 1158 year: 2006 end-page: 1165 ident: b0175 article-title: MultiLoc: prediction of protein subcellular localization using n-terminal targeting sequences, sequence motifs and amino acid composition publication-title: Bioinformatics (Oxford, England) – reference: Janin, Joël, Kim Henrick, John Moult, Lynn Ten Eyck, Michael J. E. Sternberg, Sandor Vajda, Ilya Vakser, and Shoshana J. Wodak. 2003. CAPRI: A Critical Assessment of PRedicted Interactions. Proteins: Struct Funct Bioinformatics 52 (1): 2–9. 10.1002/prot.10381. – reference: Varshavsky, Roy, Menachem Fromer, Amit Man, and Michal Linial. 2007. “When Less Is More : Improving Classification of Protein Families with a Minimal Set of Global Features,” 12–24. – reference: Paszke, Adam, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, et al. 2019. “PyTorch: An Imperative Style, High-Performance Deep Learning Library.” ArXiv:1912.01703 [Cs, Stat], December. – volume: 232 year: 2021 ident: b9000 article-title: Deep Learning Embedder Method and Tool for Mass Spectra Similarity Search publication-title: Journal of Proteomics – reference: Yang, Kevin K, Zachary Wu, Claire N Bedbrook, and Frances H Arnold. 2018. “Learned Protein Embeddings for Machine Learning.” Edited by Jonathan Wren. Bioinformatics 34 (15): 2642–48. 10.1093/bioinformatics/bty178. – volume: 8 start-page: 122 year: 2019 ident: b0510 article-title: A high efficient biological language model for predicting protein-protein interactions publication-title: Cells – reference: Kudo, Taku. 2018. “Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates.” ArXiv:1804.10959 [Cs], April. – volume: 37 year: 2009 ident: b0295 article-title: ClanTox: A classifier of short animal toxins publication-title: Nucleic Acids Res – reference: Sutskever, Ilya, Oriol Vinyals, and Quoc V Le. 2014. “Sequence to Sequence Learning with Neural Networks.” In Advances in Neural Information Processing Systems, 3104–12. – year: 2017 ident: b0300 article-title: Evaluating vector-space models of word representation, or, the unreasonable effectiveness of counting words near other words publication-title: CogSci – reference: Pennington, Jeffrey, Richard Socher, and Christopher Manning. 2014. “Glove: Global Vectors for Word Representation.” In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532–43. Doha, Qatar: Association for Computational Linguistics. 10.3115/v1/D14-1162. – volume: 357 start-page: 168 year: 2017 end-page: 175 ident: b0395 article-title: Global analysis of protein folding using massively parallel design, synthesis, and testing publication-title: Science – reference: Cozzetto, Domenico, Federico Minneci, Hannah Currant, and David T. Jones. 2016. “FFPred 3: Feature-Based Function Prediction for All Gene Ontology Domains.” Sci Rep 6 (August). 10.1038/srep31865. – volume: 406 start-page: 89 year: 2007 end-page: 112 ident: b0080 article-title: UniProtKB/Swiss-Prot: The manually annotated section of the uniprot knowledgebase publication-title: Methods Mol Biol – volume: 5 start-page: 135 year: 2017 end-page: 146 ident: b0075 article-title: Enriching word vectors with subword information publication-title: Trans Assoc Computat Linguis – reference: Devlin, J., Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” In NAACL-HLT. 10.18653/v1/N19-1423. – volume: 367 start-page: 1444 year: 2020 end-page: 1448 ident: b0535 article-title: Structural basis for the recognition of sars-cov-2 by full-length human ACE2 publication-title: Science – reference: Littmann, Maria, Michael Heinzinger, Christian Dallago, Tobias Olenyi, and & Burkhard Rost. 2020. “Embeddings from Deep Learning Transfer GO Annotations beyond Homology.” BioRxiv, September, 2020.09.04.282814. 10.1101/2020.09.04.282814. – volume: 30 start-page: 931 year: 2014 end-page: 940 ident: b0305 article-title: NeuroPID: A predictor for identifying neuropeptide precursors from metazoan proteomes publication-title: Bioinformatics (Oxford, England) – volume: 12 start-page: 878 year: 2016 ident: b0025 article-title: Deep learning for computational biology publication-title: Mol Syst Biol – reference: Yao, Liang, Chengsheng Mao, and Yuan Luo. 2019. “KG-BERT: BERT for Knowledge Graph Completion.” ArXiv:1909.03193 [Cs], September. – reference: Joulin, Armand, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. “Bag of Tricks for Efficient Text Classification.” ArXiv:1607.01759 [Cs], August. – start-page: 11629 year: 2005 end-page: 116344 ident: b0455 publication-title: Proc Natl Acad Sci – volume: 13 start-page: 149 year: 2000 end-page: 152 ident: b0290 article-title: Simplified amino acid alphabets for protein fold recognition and implications for folding publication-title: Protein Eng – reference: Bepler, Tristan, Bonnie Berger. 2019. “Learning Protein Sequence Embeddings Using Information from Structure.” ArXiv:1902.08661 [Cs, q-Bio, Stat], October. http://arxiv.org/abs/1902.08661. – year: 2018 ident: b0180 article-title: Universal language model fine-tuning for text classification – reference: Demis Hassabis. 2020. “High Accuracy Protein Structure Prediction Using Deep Learning.” Fourteenth Critical Assessment of Techniques for Protein Structure Prediction (Abstract Book), December. – volume: 35 start-page: 1026 year: 2017 end-page: 1028 ident: b0460 article-title: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets publication-title: Nat Biotechnol – reference: Vig, Jesse, Ali Madani, Lav R. Varshney, Caiming Xiong, Richard Socher, and Nazneen Fatema Rajani. 2020. “BERTology Meets Biology: Interpreting Attention in Protein Language Models,” June. – volume: 10 year: 2015 ident: b0040 article-title: Continuous Distributed representation of biological sequences for deep proteomics and genomics publication-title: PLoS ONE – volume: 9 start-page: 1735 year: 1997 end-page: 1780 ident: b0170 article-title: Long short-term memory publication-title: Neural Comput – volume: 17 year: 2016 ident: b0195 article-title: An expanded evaluation of protein function prediction methods shows an improvement in accuracy publication-title: Genome Biol – year: 2020 ident: b0235 article-title: ALBERT: A lite BERT for self-supervised learning of language representations – reference: Smith, Noah A. 2019. “Contextual Word Representations: A Contextual Introduction,” February. – year: 2019 ident: b0035 article-title: Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX) publication-title: Sci Rep – reference: Salton, Gerard, and Michael J. McGill. 1983. Introduction to Modern Information Retrieval. McGraw-Hill Computer Science Series. New York: McGraw-Hill. – volume: 21 start-page: i378 year: 2005 end-page: i386 ident: b0405 article-title: Families of membranous proteins can be characterized by the amino acid composition of their transmembrane domains publication-title: Bioinformatics – volume: 9 start-page: 2154 year: 2020 end-page: 2161 ident: b0525 article-title: Signal peptides generated by attention-based neural networks publication-title: ACS Synth Biol – year: 2018 ident: b0505 article-title: Glue: A multi-task benchmark and analysis platform for natural language understanding publication-title: ArXiv Preprint ArXiv:1804.07461. – reference: Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. Language Models Are Few-Shot Learners. ArXiv:2005.14165 [Cs], July. http://arxiv.org/abs/2005.14165. – volume: 16 start-page: 1315 year: 2019 end-page: 1322 ident: b0010 article-title: Unified rational protein engineering with sequence-based deep representation learning publication-title: Nat Methods – reference: Zaheer, Manzil, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, et al. 2020. “Big Bird: Transformers for Longer Sequences.” ArXiv:2007.14062 [Cs, Stat], July. – reference: Singer, Uriel, Kira Radinsky, and Eric Horvitz. 2020. “On Biases of Attention in Scientific Discovery.” Edited by Jonathan Wren. Bioinformatics, December, btaa1036. – volume: 20 start-page: 467 year: 2004 end-page: 476 ident: b0245 article-title: Mismatch string kernels for discriminative protein classification publication-title: Bioinformatics (Oxford, England) – volume: 36 start-page: 2401 year: 2020 end-page: 2409 ident: b0470 article-title: UDSMProt: universal deep sequence models for protein classification publication-title: Bioinformatics – reference: . – reference: Rao, Roshan, Nicholas Bhattacharya, Neil Thomas, Yan Duan, Xi Chen, John Canny, Pieter Abbeel, and Yun S. Song. 2019. “Evaluating Protein Transfer Learning with TAPE,” June. – reference: Keskar, Nitish Shirish, Bryan McCann, Lav R. Varshney, Caiming Xiong, and Richard Socher. 2019. “CTRL: A Conditional Transformer Language Model for Controllable Generation.” ArXiv:1909.05858 [Cs], September. – year: 2017 ident: b0555 article-title: Dilated residual networks – volume: 10 start-page: 4306 year: 2010 end-page: 4310 ident: b0045 article-title: The complete peptide dictionary – a meta-proteomics resource publication-title: Proteomics – start-page: 67 year: 2017 end-page: 72 ident: b9020 publication-title: OpenNMT: Open-Source Toolkit for Neural Machine Translation – volume: 23 start-page: 612 year: 2007 end-page: 618 ident: b9010 article-title: Speeding up Tandem Mass Spectrometry Database Search: Metric Embeddings and Fast near Neighbor Search publication-title: Bioinformatics – reference: Gillis, Jesse, Paul Pavlidis. 2013. “Characterizing the State of the Art in the Computational Assignment of Gene Function: Lessons from the First Critical Assessment of Functional Annotation (CAFA).” BMC Bioinformatics 14 Suppl 3 (January): S15. – reference: Rives, Alexander, Siddharth Goyal, Joshua Meier, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, and Rob Fergus. 2019. “Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences.” 10.1101/622803. – reference: Elnaggar, Ahmed, Michael Heinzinger, Christian Dallago, Ghalia Rihawi, Yu Wang, Llion Jones, Tom Gibbs, et al. 2020. “ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing,” July. – ident: 10.1016/j.csbj.2021.03.022_b0205 – volume: 10 start-page: 4306 issue: 23 year: 2010 ident: 10.1016/j.csbj.2021.03.022_b0045 article-title: The complete peptide dictionary – a meta-proteomics resource publication-title: Proteomics doi: 10.1002/pmic.201000270 – volume: 87 start-page: 1011 issue: 12 year: 2019 ident: 10.1016/j.csbj.2021.03.022_b0220 article-title: Critical assessment of methods of protein structure prediction (casp)—round xiii publication-title: Proteins Struct Funct Bioinf doi: 10.1002/prot.25823 – year: 2018 ident: 10.1016/j.csbj.2021.03.022_b0370 article-title: DeepType: Multilingual entity linking by neural type system evolution publication-title: ArXiv – ident: 10.1016/j.csbj.2021.03.022_b0475 – volume: 16 start-page: 1315 issue: 12 year: 2019 ident: 10.1016/j.csbj.2021.03.022_b0010 article-title: Unified rational protein engineering with sequence-based deep representation learning publication-title: Nat Methods doi: 10.1038/s41592-019-0598-1 – volume: 2016 year: 2016 ident: 10.1016/j.csbj.2021.03.022_b0085 article-title: ASAP: A machine learning framework for local protein properties publication-title: Database doi: 10.1093/database/baw133 – ident: 10.1016/j.csbj.2021.03.022_b0125 – ident: 10.1016/j.csbj.2021.03.022_b0570 – ident: 10.1016/j.csbj.2021.03.022_b0095 doi: 10.1073/pnas.0914097107 – ident: 10.1016/j.csbj.2021.03.022_b0240 – year: 2012 ident: 10.1016/j.csbj.2021.03.022_b0215 article-title: Imagenet classification with deep convolutional neural networks publication-title: ImageNet Classification with Deep Convolutional Neural Networks – ident: 10.1016/j.csbj.2021.03.022_b0060 – ident: 10.1016/j.csbj.2021.03.022_b0225 doi: 10.18653/v1/P18-1007 – volume: 17 issue: 1 year: 2016 ident: 10.1016/j.csbj.2021.03.022_b0195 article-title: An expanded evaluation of protein function prediction methods shows an improvement in accuracy publication-title: Genome Biol doi: 10.1186/s13059-016-1037-6 – volume: 160 start-page: 481 issue: 7 year: 2009 ident: 10.1016/j.csbj.2021.03.022_b0485 article-title: The origin of the genetic code and of the earliest oligopeptides publication-title: Res Microbiol doi: 10.1016/j.resmic.2009.05.004 – volume: 30 start-page: 5998 year: 2017 ident: 10.1016/j.csbj.2021.03.022_b0495 article-title: Attention is all you need publication-title: Adv Neural Inf Process Syst – year: 2019 ident: 10.1016/j.csbj.2021.03.022_b0035 article-title: Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX) publication-title: Sci Rep doi: 10.1038/s41598-019-38746-w – year: 2014 ident: 10.1016/j.csbj.2021.03.022_b0150 article-title: Word2vec explained: Deriving Mikolov et al’.s negative-sampling word-embedding method publication-title: ArXiv:1402.3722 [Cs, Stat] – ident: 10.1016/j.csbj.2021.03.022_b0340 – ident: 10.1016/j.csbj.2021.03.022_b0015 doi: 10.1101/2020.03.09.983585 – volume: 9 start-page: 119 issue: 2 year: 2008 ident: 10.1016/j.csbj.2021.03.022_b0050 article-title: Machine learning methods for predictive proteomics publication-title: Briefings Bioinf doi: 10.1093/bib/bbn008 – ident: 10.1016/j.csbj.2021.03.022_b0200 – ident: 10.1016/j.csbj.2021.03.022_b0105 – volume: 9 start-page: 173 issue: 2 year: 2011 ident: 10.1016/j.csbj.2021.03.022_b0385 article-title: HHblits: Lightning-fast iterative protein sequence searching by hmm-hmm alignment publication-title: Nat Methods doi: 10.1038/nmeth.1818 – volume: 20 start-page: 1 issue: 1 year: 2019 ident: 10.1016/j.csbj.2021.03.022_b0160 article-title: Modeling aspects of the language of life through transfer-learning protein sequences publication-title: BMC Bioinf doi: 10.1186/s12859-019-3220-8 – volume: 232 issue: February year: 2021 ident: 10.1016/j.csbj.2021.03.022_b9000 article-title: Deep Learning Embedder Method and Tool for Mass Spectra Similarity Search publication-title: Journal of Proteomics – volume: 9 issue: 5 year: 2013 ident: 10.1016/j.csbj.2021.03.022_b0420 article-title: Biases in the experimental annotations of protein function and their effect on our understanding of protein function space publication-title: PLoS Comput Biol doi: 10.1371/journal.pcbi.1003063 – volume: 13 start-page: 149 issue: 3 year: 2000 ident: 10.1016/j.csbj.2021.03.022_b0290 article-title: Simplified amino acid alphabets for protein fold recognition and implications for folding publication-title: Protein Eng doi: 10.1093/protein/13.3.149 – volume: 357 start-page: 168 issue: 6347 year: 2017 ident: 10.1016/j.csbj.2021.03.022_b0395 article-title: Global analysis of protein folding using massively parallel design, synthesis, and testing publication-title: Science doi: 10.1126/science.aan0693 – volume: 74 start-page: 47 year: 2015 ident: 10.1016/j.csbj.2021.03.022_b0320 article-title: Protein–protein interaction predictions using text mining methods publication-title: Methods doi: 10.1016/j.ymeth.2014.10.026 – ident: 10.1016/j.csbj.2021.03.022_b0190 doi: 10.1101/2020.09.17.301879 – volume: 9 start-page: 2154 issue: 8 year: 2020 ident: 10.1016/j.csbj.2021.03.022_b0525 article-title: Signal peptides generated by attention-based neural networks publication-title: ACS Synth Biol doi: 10.1021/acssynbio.0c00219 – volume: 28 start-page: 235 issue: 1 year: 2000 ident: 10.1016/j.csbj.2021.03.022_b0065 article-title: The protein data bank publication-title: Nucleic Acids Res doi: 10.1093/nar/28.1.235 – start-page: 512 year: 2014 ident: 10.1016/j.csbj.2021.03.022_b9025 article-title: CNN Features Off-the-Shelf: An Astounding Baseline for Recognition – volume: 24 start-page: 8 issue: 2 year: 2009 ident: 10.1016/j.csbj.2021.03.022_b0155 article-title: The unreasonable effectiveness of data publication-title: IEEE Intell Syst doi: 10.1109/MIS.2009.36 – volume: 30 start-page: 50 issue: 1 year: 1951 ident: 10.1016/j.csbj.2021.03.022_b0440 article-title: Prediction and entropy of printed english publication-title: Bell Syst Tech J doi: 10.1002/j.1538-7305.1951.tb01366.x – ident: 10.1016/j.csbj.2021.03.022_b0110 – volume: 367 start-page: 1444 issue: 6485 year: 2020 ident: 10.1016/j.csbj.2021.03.022_b0535 article-title: Structural basis for the recognition of sars-cov-2 by full-length human ACE2 publication-title: Science doi: 10.1126/science.abb2762 – ident: 10.1016/j.csbj.2021.03.022_b0285 doi: 10.1093/bib/bbw068 – volume: 32 issue: June year: 2019 ident: 10.1016/j.csbj.2021.03.022_b0545 article-title: XLNet: Generalized autoregressive pretraining for language understanding publication-title: Advanc Neural Inform Process Sys – volume: 9 start-page: 1735 issue: 8 year: 1997 ident: 10.1016/j.csbj.2021.03.022_b0170 article-title: Long short-term memory publication-title: Neural Comput doi: 10.1162/neco.1997.9.8.1735 – ident: 10.1016/j.csbj.2021.03.022_b0450 – year: 2018 ident: 10.1016/j.csbj.2021.03.022_b0505 article-title: Glue: A multi-task benchmark and analysis platform for natural language understanding publication-title: ArXiv Preprint ArXiv:1804.07461. – volume: 21 start-page: 1 issue: 140 year: 2020 ident: 10.1016/j.csbj.2021.03.022_b0365 article-title: Exploring the limits of transfer learning with a unified text-to-text transformer publication-title: J Machine Learning Res – ident: 10.1016/j.csbj.2021.03.022_b0140 doi: 10.18653/v1/2020.findings-emnlp.139 – volume: 8 start-page: 122 issue: 2 year: 2019 ident: 10.1016/j.csbj.2021.03.022_b0510 article-title: A high efficient biological language model for predicting protein-protein interactions publication-title: Cells doi: 10.3390/cells8020122 – year: 2018 ident: 10.1016/j.csbj.2021.03.022_b0180 article-title: Universal language model fine-tuning for text classification publication-title: ArXiv – volume: 36 start-page: 2401 issue: 8 year: 2020 ident: 10.1016/j.csbj.2021.03.022_b0470 article-title: UDSMProt: universal deep sequence models for protein classification publication-title: Bioinformatics doi: 10.1093/bioinformatics/btaa003 – ident: 10.1016/j.csbj.2021.03.022_b0090 – volume: 12 start-page: 878 issue: 7 year: 2016 ident: 10.1016/j.csbj.2021.03.022_b0025 article-title: Deep learning for computational biology publication-title: Mol Syst Biol doi: 10.15252/msb.20156651 – ident: 10.1016/j.csbj.2021.03.022_b0540 doi: 10.1093/bioinformatics/bty178 – ident: 10.1016/j.csbj.2021.03.022_b0325 – ident: 10.1016/j.csbj.2021.03.022_b0500 doi: 10.1101/2020.06.26.174417 – ident: 10.1016/j.csbj.2021.03.022_b0120 doi: 10.1038/srep31865 – volume: 20 issue: 21–22 year: 2020 ident: 10.1016/j.csbj.2021.03.022_b0520 article-title: Deep learning in proteomics publication-title: Proteomics – ident: 10.1016/j.csbj.2021.03.022_b0185 doi: 10.1002/prot.10381 – ident: 10.1016/j.csbj.2021.03.022_b0130 – ident: 10.1016/j.csbj.2021.03.022_b0255 doi: 10.1101/2020.09.04.282814 – ident: 10.1016/j.csbj.2021.03.022_b0260 – volume: 574 start-page: E1 issue: 7776 year: 2019 ident: 10.1016/j.csbj.2021.03.022_b0275 article-title: One neuron is more informative than a deep neural network for aftershock pattern forecasting publication-title: Nature doi: 10.1038/s41586-019-1582-8 – volume: 18 start-page: 1466 year: 2020 ident: 10.1016/j.csbj.2021.03.022_b0210 article-title: Deep learning models in genomics; are we there yet? publication-title: Comput Struct Biotechnol J doi: 10.1016/j.csbj.2020.06.017 – ident: 10.1016/j.csbj.2021.03.022_b0230 – volume: 35 start-page: 1026 issue: 11 year: 2017 ident: 10.1016/j.csbj.2021.03.022_b0460 article-title: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets publication-title: Nat Biotechnol doi: 10.1038/nbt.3988 – volume: 54 start-page: 20 issue: 1 year: 2004 ident: 10.1016/j.csbj.2021.03.022_b0330 article-title: Proteomic Signatures: Amino Acid and Oligopeptide Compositions Differentiate among Phyla publication-title: Proteins doi: 10.1002/prot.10559 – start-page: 11629 year: 2005 ident: 10.1016/j.csbj.2021.03.022_b0455 publication-title: Proc Natl Acad Sci doi: 10.1073/pnas.0409746102 – volume: 20 start-page: 467 issue: 4 year: 2004 ident: 10.1016/j.csbj.2021.03.022_b0245 article-title: Mismatch string kernels for discriminative protein classification publication-title: Bioinformatics (Oxford, England) – ident: 10.1016/j.csbj.2021.03.022_b0390 doi: 10.1101/622803 – volume: 37 issue: Suppl. 2 year: 2009 ident: 10.1016/j.csbj.2021.03.022_b0295 article-title: ClanTox: A classifier of short animal toxins publication-title: Nucleic Acids Res – ident: 10.1016/j.csbj.2021.03.022_b0310 – ident: 10.1016/j.csbj.2021.03.022_b0145 doi: 10.1186/1471-2105-14-S3-S15 – ident: 10.1016/j.csbj.2021.03.022_b0565 – year: 2017 ident: 10.1016/j.csbj.2021.03.022_b0555 article-title: Dilated residual networks publication-title: ArXiv – volume: 577 start-page: 706 issue: 7792 year: 2020 ident: 10.1016/j.csbj.2021.03.022_b0430 article-title: Improved protein structure prediction using potentials from deep learning publication-title: Nature doi: 10.1038/s41586-019-1923-7 – volume: 30 start-page: 931 issue: 7 year: 2014 ident: 10.1016/j.csbj.2021.03.022_b0305 article-title: NeuroPID: A predictor for identifying neuropeptide precursors from metazoan proteomes publication-title: Bioinformatics (Oxford, England) – volume: 25 start-page: 1356 issue: 11 year: 2009 ident: 10.1016/j.csbj.2021.03.022_b0345 article-title: Reduced amino acid alphabets exhibit an improved sensitivity and selectivity in fold assignment publication-title: Bioinformatics (Oxford, England) – volume: 22 start-page: 1158 issue: 10 year: 2006 ident: 10.1016/j.csbj.2021.03.022_b0175 article-title: MultiLoc: prediction of protein subcellular localization using n-terminal targeting sequences, sequence motifs and amino acid composition publication-title: Bioinformatics (Oxford, England) – ident: 10.1016/j.csbj.2021.03.022_b0410 – ident: 10.1016/j.csbj.2021.03.022_b0435 doi: 10.18653/v1/P16-1162 – ident: 10.1016/j.csbj.2021.03.022_b0490 doi: 10.1007/978-3-540-74126-8_3 – ident: 10.1016/j.csbj.2021.03.022_b0560 doi: 10.1073/pnas.1814684116 – volume: 1–9 year: 2013 ident: 10.1016/j.csbj.2021.03.022_b0280 article-title: Distributed representations of words and phrases and their compositionality publication-title: Nips – volume: 285 start-page: 176 issue: 2 year: 1991 ident: 10.1016/j.csbj.2021.03.022_b0355 article-title: How does protein synthesis give rise to the 3D-structure? publication-title: FEBS Lett doi: 10.1016/0014-5793(91)80799-9 – volume: 11 start-page: 6044 issue: 12 year: 2012 ident: 10.1016/j.csbj.2021.03.022_b9005 article-title: Evaluation of Database Search Programs for Accurate Detection of Neuropeptides in Tandem Mass Spectrometry Experiments publication-title: J Proteome Res doi: 10.1021/pr3007123 – volume: 9 start-page: 9277 issue: 1 year: 2019 ident: 10.1016/j.csbj.2021.03.022_b0005 article-title: Neural networks versus logistic regression for 30 days all-cause readmission prediction publication-title: Sci Rep doi: 10.1038/s41598-019-45685-z – ident: 10.1016/j.csbj.2021.03.022_b0030 – start-page: 67 year: 2017 ident: 10.1016/j.csbj.2021.03.022_b9020 – volume: 71 start-page: 148 issue: 1 year: 1996 ident: 10.1016/j.csbj.2021.03.022_b0465 article-title: The shannon information entropy of protein sequences publication-title: Biophys J doi: 10.1016/S0006-3495(96)79210-X – ident: 10.1016/j.csbj.2021.03.022_b0020 doi: 10.1093/bioinformatics/btx431 – ident: 10.1016/j.csbj.2021.03.022_b0265 doi: 10.1101/2020.03.07.982272 – ident: 10.1016/j.csbj.2021.03.022_b0480 – ident: 10.1016/j.csbj.2021.03.022_b0335 doi: 10.3115/v1/D14-1162 – volume: 371 start-page: 284 issue: 6526 year: 2021 ident: 10.1016/j.csbj.2021.03.022_b0165 article-title: Learning the language of viral evolution and escape publication-title: Science doi: 10.1126/science.abd7331 – ident: 10.1016/j.csbj.2021.03.022_b0115 – ident: 10.1016/j.csbj.2021.03.022_b0100 – year: 2020 ident: 10.1016/j.csbj.2021.03.022_b0235 article-title: ALBERT: A lite BERT for self-supervised learning of language representations publication-title: ArXiv – ident: 10.1016/j.csbj.2021.03.022_b0360 – volume: 21 start-page: i378 issue: 1 year: 2005 ident: 10.1016/j.csbj.2021.03.022_b0405 article-title: Families of membranous proteins can be characterized by the amino acid composition of their transmembrane domains publication-title: Bioinformatics doi: 10.1093/bioinformatics/bti1035 – year: 2017 ident: 10.1016/j.csbj.2021.03.022_b0300 article-title: Evaluating vector-space models of word representation, or, the unreasonable effectiveness of counting words near other words publication-title: CogSci – year: 2020 ident: 10.1016/j.csbj.2021.03.022_b0350 article-title: Aligning the pretraining and finetuning objectives of language models publication-title: ArXiv – ident: 10.1016/j.csbj.2021.03.022_b0270 – ident: 10.1016/j.csbj.2021.03.022_b0445 doi: 10.1093/bioinformatics/btaa1036 – ident: 10.1016/j.csbj.2021.03.022_b0135 doi: 10.1101/2020.07.12.199554 – year: 2019 ident: 10.1016/j.csbj.2021.03.022_b0070 article-title: Using deep learning to annotate the protein universe publication-title: BioRxiv – ident: 10.1016/j.csbj.2021.03.022_b0415 doi: 10.1093/bioinformatics/btx818 – ident: 10.1016/j.csbj.2021.03.022_b0530 doi: 10.18653/v1/K19-1052 – year: 2018 ident: 10.1016/j.csbj.2021.03.022_b0400 article-title: NLP’s imagenet moment has arrived publication-title: Gradient. – volume: 5 start-page: 6 issue: January year: 2010 ident: 10.1016/j.csbj.2021.03.022_b0425 article-title: Cooperativity within proximal phosphorylation sites is revealed from large-scale proteomics data publication-title: Biology Direct doi: 10.1186/1745-6150-5-6 – volume: 207 year: 2006 ident: 10.1016/j.csbj.2021.03.022_b0055 article-title: Protein Sequence Motifs: Highly Predictive Features of Protein Function publication-title: Stud Fuzziness Soft Comput doi: 10.1007/978-3-540-35488-8_32 – volume: 10 issue: 11 year: 2015 ident: 10.1016/j.csbj.2021.03.022_b0040 article-title: Continuous Distributed representation of biological sequences for deep proteomics and genomics publication-title: PLoS ONE doi: 10.1371/journal.pone.0141287 – volume: 5 start-page: 135 issue: December year: 2017 ident: 10.1016/j.csbj.2021.03.022_b0075 article-title: Enriching word vectors with subword information publication-title: Trans Assoc Computat Linguis doi: 10.1162/tacl_a_00051 – volume: 23 start-page: 612 issue: 5 year: 2007 ident: 10.1016/j.csbj.2021.03.022_b9010 article-title: Speeding up Tandem Mass Spectrometry Database Search: Metric Embeddings and Fast near Neighbor Search publication-title: Bioinformatics doi: 10.1093/bioinformatics/btl645 – ident: 10.1016/j.csbj.2021.03.022_b0250 – ident: 10.1016/j.csbj.2021.03.022_b0380 doi: 10.1101/2021.02.12.430858 – ident: 10.1016/j.csbj.2021.03.022_b0550 – year: 2017 ident: 10.1016/j.csbj.2021.03.022_b0575 article-title: Understanding deep learning requires rethinking generalization publication-title: ArXiv – ident: 10.1016/j.csbj.2021.03.022_b0315 doi: 10.1093/bioinformatics/btv345 – volume: 576 start-page: 348 issue: 3 year: 2004 ident: 10.1016/j.csbj.2021.03.022_b0515 article-title: Reduced amino acid alphabet is sufficient to accurately recognize intrinsically disordered protein publication-title: FEBS Lett doi: 10.1016/j.febslet.2004.09.036 – ident: 10.1016/j.csbj.2021.03.022_b0375 doi: 10.1101/676825 – volume: 406 start-page: 89 year: 2007 ident: 10.1016/j.csbj.2021.03.022_b0080 article-title: UniProtKB/Swiss-Prot: The manually annotated section of the uniprot knowledgebase publication-title: Methods Mol Biol |
| SSID | ssj0000816930 |
| Score | 2.6174169 |
| SecondaryResourceType | review_article |
| Snippet | Natural language processing (NLP) is a field of computer science concerned with automated text and language analysis. In recent years, following a series of... |
| SourceID | doaj pubmedcentral proquest pubmed crossref elsevier |
| SourceType | Open Website Open Access Repository Aggregation Database Index Database Enrichment Source Publisher |
| StartPage | 1750 |
| SubjectTerms | amino acids Artificial neural networks automation Bag of words BERT Bioinformatics biotechnology computer science Contextualized embedding Deep learning Language models Natural language processing Review Tokenization Transformer Word embedding Word2vec |
| Title | The language of proteins: NLP, machine learning & protein sequences |
| URI | https://dx.doi.org/10.1016/j.csbj.2021.03.022 https://www.ncbi.nlm.nih.gov/pubmed/33897979 https://www.proquest.com/docview/2518736718 https://www.proquest.com/docview/2574317704 https://pubmed.ncbi.nlm.nih.gov/PMC8050421 https://doaj.org/article/92521a99034d4afe9fbf0dd2f99b7a0a |
| Volume | 19 |
| WOSCitedRecordID | wos000684934900004&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVAON databaseName: DOAJ Directory of Open Access Journals customDbUrl: eissn: 2001-0370 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0000816930 issn: 2001-0370 databaseCode: DOA dateStart: 20120101 isFulltext: true titleUrlDefault: https://www.doaj.org/ providerName: Directory of Open Access Journals – providerCode: PRVHPJ databaseName: ROAD: Directory of Open Access Scholarly Resources customDbUrl: eissn: 2001-0370 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0000816930 issn: 2001-0370 databaseCode: M~E dateStart: 20110101 isFulltext: true titleUrlDefault: https://road.issn.org providerName: ISSN International Centre |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1Lb9QwEB6Vqgc4IKA8wqMKErc2wrHjR7gV1KoHWPVApb1ZfoqtIIu6W478dsZOstqAtFyqSDnETqL5PPZ8diafAd5JRSKOjKYiUbKq4TxU1nNRRe48EzxSpEp5swk5m6n5vL3c2uor5YT18sA9cO9bigHG4JjJGt-YGNpoI_Gexra10pBMjZD1bE2m8hiskshIWmAZcoYkGf6Y6ZO73Mpe4-SQ1lnhlNJJVMri_ZPg9C_5_DuHcisonT-ChwObLE97Kx7DXuiewIMtjcFDOEVHKMdVyXIZy6zMsOhWH8rZ58uT8kfOpsQqwxrJWF5ukqyfwtX52ddPF9Wwb0LlkBysKxpZqySi1drGeiUMgkZ8i1Ml60jgtY3GCY_UrA6OCht8qAOzDvGKgtEQ2DPY75ZdeAGl8J5zF2JjQ2ii5NZJWTdRMWVoHY0ooB5x024QFU97W3zXY_bYtU5Y64S1Jkwj1gUcb-752Utq7Kz9MTXHpmaSw84X0En04CT6f05SAB8bUw_MomcM-KjFzpe_HVteY7dL31JMF5a3K420UEkmMLLvqpPpmSRNAc97b9mYwZAoSjwKkBM_mtg5LekW37L8tyIcR9r65V0A8wruJ3P7NaXXsL--uQ1v4MD9Wi9WN0dwT87VUe5ZeP7y--wPP4YozQ |
| linkProvider | Directory of Open Access Journals |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=The+language+of+proteins%3A+NLP%2C+machine+learning+%26+protein+sequences&rft.jtitle=Computational+and+structural+biotechnology+journal&rft.au=Ofer%2C+Dan&rft.au=Brandes%2C+Nadav&rft.au=Linial%2C+Michal&rft.date=2021-01-01&rft.issn=2001-0370&rft.eissn=2001-0370&rft.volume=19&rft.spage=1750&rft.epage=1758&rft_id=info:doi/10.1016%2Fj.csbj.2021.03.022&rft.externalDBID=n%2Fa&rft.externalDocID=10_1016_j_csbj_2021_03_022 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2001-0370&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2001-0370&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2001-0370&client=summon |