Search algorithms of verbal identity markers in modern scientific discourse
The article is devoted to the study of identity verbalization specifics via Data Mining. The research material consists of English texts from Internet scientific repositories and e-libraries devoted to various concepts of youth identity. A methodology based on the use of modern natural language proc...
Uloženo v:
| Vydáno v: | Aktualʹnye problemy filologii i pedagogičeskoj lingvistiki číslo 2; s. 18 - 29 |
|---|---|
| Hlavní autoři: | , , |
| Médium: | Journal Article |
| Jazyk: | angličtina němčina |
| Vydáno: |
Publishing and Printing Center NOSU
25.06.2024
|
| Témata: | |
| ISSN: | 2079-6021, 2619-029X |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | The article is devoted to the study of identity verbalization specifics via Data Mining. The research material consists of English texts from Internet scientific repositories and e-libraries devoted to various concepts of youth identity. A methodology based on the use of modern natural language processing and machine learning tools was developed and tested as part of the research. The analysis was carried out using the Natural Language Toolkit library for tokenization and POS-tagging procedures for calculating the frequency of tokens from the «identity» environment. Word Embeddings, pre-trained Word2Vec model and K-means algorithm were used for the subsequent analysis and clustering of words based on their semantic proximity. Gensim library and Scikit-learn library were used to work with the Word2Vec model. As a result, it was proved that in English scientific discourse young person’s identity is verbalized within 9 semantic categories: behavior, communities, communication, education, identity, language, practice, complexity, science, the most common of which are education (33%), language (21%) and communities (18%). N-grams analysis made it possible to identify semantic fields, establish their attributes, and evaluate texts’ similarity, which provided the most accurate vector space search for semantically close n-grams. Optimization made it possible to establish a similarity measure to rank phrases according to the query, as well as assign each n-gram a certain ranking weight. Improvements can be achieved by adding statistical word weighting, such as TF-IDF. The proposed system is capable of searching in a large text array of related phrases with a similar meaning. |
|---|---|
| AbstractList | The article is devoted to the study of identity verbalization specifics via Data Mining. The research material consists of English texts from Internet scientific repositories and e-libraries devoted to various concepts of youth identity. A methodology based on the use of modern natural language processing and machine learning tools was developed and tested as part of the research. The analysis was carried out using the Natural Language Toolkit library for tokenization and POS-tagging procedures for calculating the frequency of tokens from the «identity» environment. Word Embeddings, pre-trained Word2Vec model and K-means algorithm were used for the subsequent analysis and clustering of words based on their semantic proximity. Gensim library and Scikit-learn library were used to work with the Word2Vec model. As a result, it was proved that in English scientific discourse young person’s identity is verbalized within 9 semantic categories: behavior, communities, communication, education, identity, language, practice, complexity, science, the most common of which are education (33%), language (21%) and communities (18%). N-grams analysis made it possible to identify semantic fields, establish their attributes, and evaluate texts’ similarity, which provided the most accurate vector space search for semantically close n-grams. Optimization made it possible to establish a similarity measure to rank phrases according to the query, as well as assign each n-gram a certain ranking weight. Improvements can be achieved by adding statistical word weighting, such as TF-IDF. The proposed system is capable of searching in a large text array of related phrases with a similar meaning. |
| Author | Zavrumov, Zaur A. Goncharova, Oksana V. Khaleeva, Svetlana |
| Author_xml | – sequence: 1 givenname: Oksana V. surname: Goncharova fullname: Goncharova, Oksana V. – sequence: 2 givenname: Zaur A. surname: Zavrumov fullname: Zavrumov, Zaur A. – sequence: 3 givenname: Svetlana surname: Khaleeva fullname: Khaleeva, Svetlana |
| BookMark | eNo9kNtKAzEQhoMoWGufwbxAdDKbpsmlFE8oeKGCdyGbzLap7UaSVfDtbevhaob5fz6G74Qd9rknxs4knKMFnF4gzKzQgFIgoBIopBFoD9gItbQC0L4ebve_0jGb1LoCADRWTpUcsfsn8iUsuV8vcknDclN57vgnldaveYrUD2n44htf3qhUnnq-yZFKz2tIu6xLgcdUQ_4olU7ZUefXlSa_c8xerq-e57fi4fHmbn75IILUjRVegUKtSIONcmowdg2SxEghUNvp6KmhgHqmpFJAaFujYgOx1cpo46FpxuzuhxuzX7n3krbvfbnsk9sfclk4X4YU1uSaQJKU9dAGpSxFrzwZCDqGWWd8bLes2Q8rlFxroe6fJ8HtFbudPbez53aKHTpptknzDQCAcfc |
| ContentType | Journal Article |
| DBID | AAYXX CITATION DOA |
| DOI | 10.29025/2079-6021-2024-2-18-29 |
| DatabaseName | CrossRef DOAJ Directory of Open Access Journals |
| DatabaseTitle | CrossRef |
| DatabaseTitleList | CrossRef |
| Database_xml | – sequence: 1 dbid: DOA name: DOAJ Directory of Open Access Journals url: https://www.doaj.org/ sourceTypes: Open Website |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Languages & Literatures |
| EISSN | 2619-029X |
| EndPage | 29 |
| ExternalDocumentID | oai_doaj_org_article_3ce1e49a0bc449eda4ae80c6dc7f8adb 10_29025_2079_6021_2024_2_18_29 |
| GroupedDBID | 642 AAYXX ALMA_UNASSIGNED_HOLDINGS CITATION GROUPED_DOAJ |
| ID | FETCH-LOGICAL-c1639-a404264e609d1582df32e12deccebf6dae3ec26741440e29b84d30db64868a033 |
| IEDL.DBID | DOA |
| ISSN | 2079-6021 |
| IngestDate | Fri Oct 03 12:50:00 EDT 2025 Sat Nov 29 02:24:21 EST 2025 |
| IsDoiOpenAccess | true |
| IsOpenAccess | true |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 2 |
| Language | English German |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c1639-a404264e609d1582df32e12deccebf6dae3ec26741440e29b84d30db64868a033 |
| OpenAccessLink | https://doaj.org/article/3ce1e49a0bc449eda4ae80c6dc7f8adb |
| PageCount | 12 |
| ParticipantIDs | doaj_primary_oai_doaj_org_article_3ce1e49a0bc449eda4ae80c6dc7f8adb crossref_primary_10_29025_2079_6021_2024_2_18_29 |
| PublicationCentury | 2000 |
| PublicationDate | 2024-06-25 |
| PublicationDateYYYYMMDD | 2024-06-25 |
| PublicationDate_xml | – month: 06 year: 2024 text: 2024-06-25 day: 25 |
| PublicationDecade | 2020 |
| PublicationTitle | Aktualʹnye problemy filologii i pedagogičeskoj lingvistiki |
| PublicationYear | 2024 |
| Publisher | Publishing and Printing Center NOSU |
| Publisher_xml | – name: Publishing and Printing Center NOSU |
| SSID | ssj0002891541 |
| Score | 2.2600207 |
| Snippet | The article is devoted to the study of identity verbalization specifics via Data Mining. The research material consists of English texts from Internet... |
| SourceID | doaj crossref |
| SourceType | Open Website Index Database |
| StartPage | 18 |
| SubjectTerms | data mining identity verbalization internet scientific repositories python scientific discourse semantic category youth identity |
| Title | Search algorithms of verbal identity markers in modern scientific discourse |
| URI | https://doaj.org/article/3ce1e49a0bc449eda4ae80c6dc7f8adb |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVAON databaseName: DOAJ Directory of Open Access Journals customDbUrl: eissn: 2619-029X dateEnd: 20241231 omitProxy: false ssIdentifier: ssj0002891541 issn: 2079-6021 databaseCode: DOA dateStart: 20170101 isFulltext: true titleUrlDefault: https://www.doaj.org/ providerName: Directory of Open Access Journals |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV07T8MwELZQxcCCeFNe8oDYrDqOYztjeVRIVBUDVN0svwJF0KCmIPHvOSehKhMLqx1Z0XeX3H32-TuEzrkvggBXIRaSBcIN5SQHAkds3PK3hZS2li8eD-VopCaT_H6l1VesCWvkgRvgeqkLSeC5odZxngdvuAmKOuGdLJTxNv59qcxXyNRLc3wGuUFkW3GSCIhkTXEXi8dqveUgOAnjhJEkXkX4FZpWFPzrUDPYQpttjoj7zbttozUfdtDBsN1ZrPAFHi7FkKtddNeUDOP-61MJVP_5rcJlgceAFyzSXsT9wvFSDqR6eDrDTf8zXH_WdakQvp5WrozlHHvocXDzcHVL2h4JxEEmlRPDIwniQdDcJ5livkhZSJgHywRbCG9CGhwTkDdwTgPLreI-pd4KroQyNE33UWdWzsIhwmCZVNkskSJx3GTGZrHdHyApCytD4ruI_sCj3xspDA0UokZUR0R1RFRHRDXTiYKZLrqMMC4fj1rW9QBYWLcW1n9Z-Og_FjlGG7WpqSAsO0GdxfwjnKJ197mYVvOz2nm-AVZ1xtE |
| linkProvider | Directory of Open Access Journals |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Search+algorithms+of+verbal+identity+markers+in+modern+scientific+discourse&rft.jtitle=Aktual%CA%B9nye+problemy+filologii+i+pedagogi%C4%8Deskoj+lingvistiki&rft.au=Goncharova%2C+Oksana+V.&rft.au=Zavrumov%2C+Zaur+A.&rft.au=Khaleeva%2C+Svetlana&rft.date=2024-06-25&rft.issn=2079-6021&rft.eissn=2619-029X&rft.issue=2&rft.spage=18&rft.epage=29&rft_id=info:doi/10.29025%2F2079-6021-2024-2-18-29&rft.externalDBID=n%2Fa&rft.externalDocID=10_29025_2079_6021_2024_2_18_29 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2079-6021&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2079-6021&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2079-6021&client=summon |