Search algorithms of verbal identity markers in modern scientific discourse

The article is devoted to the study of identity verbalization specifics via Data Mining. The research material consists of English texts from Internet scientific repositories and e-libraries devoted to various concepts of youth identity. A methodology based on the use of modern natural language proc...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Aktualʹnye problemy filologii i pedagogičeskoj lingvistiki číslo 2; s. 18 - 29
Hlavní autoři: Goncharova, Oksana V., Zavrumov, Zaur A., Khaleeva, Svetlana
Médium: Journal Article
Jazyk:angličtina
němčina
Vydáno: Publishing and Printing Center NOSU 25.06.2024
Témata:
ISSN:2079-6021, 2619-029X
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract The article is devoted to the study of identity verbalization specifics via Data Mining. The research material consists of English texts from Internet scientific repositories and e-libraries devoted to various concepts of youth identity. A methodology based on the use of modern natural language processing and machine learning tools was developed and tested as part of the research. The analysis was carried out using the Natural Language Toolkit library for tokenization and POS-tagging procedures for calculating the frequency of tokens from the «identity» environment. Word Embeddings, pre-trained Word2Vec model and K-means algorithm were used for the subsequent analysis and clustering of words based on their semantic proximity. Gensim library and Scikit-learn library were used to work with the Word2Vec model. As a result, it was proved that in English scientific discourse young person’s identity is verbalized within 9 semantic categories: behavior, communities, communication, education, identity, language, practice, complexity, science, the most common of which are education (33%), language (21%) and communities (18%). N-grams analysis made it possible to identify semantic fields, establish their attributes, and evaluate texts’ similarity, which provided the most accurate vector space search for semantically close n-grams. Optimization made it possible to establish a similarity measure to rank phrases according to the query, as well as assign each n-gram a certain ranking weight. Improvements can be achieved by adding statistical word weighting, such as TF-IDF. The proposed system is capable of searching in a large text array of related phrases with a similar meaning.
AbstractList The article is devoted to the study of identity verbalization specifics via Data Mining. The research material consists of English texts from Internet scientific repositories and e-libraries devoted to various concepts of youth identity. A methodology based on the use of modern natural language processing and machine learning tools was developed and tested as part of the research. The analysis was carried out using the Natural Language Toolkit library for tokenization and POS-tagging procedures for calculating the frequency of tokens from the «identity» environment. Word Embeddings, pre-trained Word2Vec model and K-means algorithm were used for the subsequent analysis and clustering of words based on their semantic proximity. Gensim library and Scikit-learn library were used to work with the Word2Vec model. As a result, it was proved that in English scientific discourse young person’s identity is verbalized within 9 semantic categories: behavior, communities, communication, education, identity, language, practice, complexity, science, the most common of which are education (33%), language (21%) and communities (18%). N-grams analysis made it possible to identify semantic fields, establish their attributes, and evaluate texts’ similarity, which provided the most accurate vector space search for semantically close n-grams. Optimization made it possible to establish a similarity measure to rank phrases according to the query, as well as assign each n-gram a certain ranking weight. Improvements can be achieved by adding statistical word weighting, such as TF-IDF. The proposed system is capable of searching in a large text array of related phrases with a similar meaning.
Author Zavrumov, Zaur A.
Goncharova, Oksana V.
Khaleeva, Svetlana
Author_xml – sequence: 1
  givenname: Oksana V.
  surname: Goncharova
  fullname: Goncharova, Oksana V.
– sequence: 2
  givenname: Zaur A.
  surname: Zavrumov
  fullname: Zavrumov, Zaur A.
– sequence: 3
  givenname: Svetlana
  surname: Khaleeva
  fullname: Khaleeva, Svetlana
BookMark eNo9kNtKAzEQhoMoWGufwbxAdDKbpsmlFE8oeKGCdyGbzLap7UaSVfDtbevhaob5fz6G74Qd9rknxs4knKMFnF4gzKzQgFIgoBIopBFoD9gItbQC0L4ebve_0jGb1LoCADRWTpUcsfsn8iUsuV8vcknDclN57vgnldaveYrUD2n44htf3qhUnnq-yZFKz2tIu6xLgcdUQ_4olU7ZUefXlSa_c8xerq-e57fi4fHmbn75IILUjRVegUKtSIONcmowdg2SxEghUNvp6KmhgHqmpFJAaFujYgOx1cpo46FpxuzuhxuzX7n3krbvfbnsk9sfclk4X4YU1uSaQJKU9dAGpSxFrzwZCDqGWWd8bLes2Q8rlFxroe6fJ8HtFbudPbez53aKHTpptknzDQCAcfc
ContentType Journal Article
DBID AAYXX
CITATION
DOA
DOI 10.29025/2079-6021-2024-2-18-29
DatabaseName CrossRef
DOAJ Directory of Open Access Journals
DatabaseTitle CrossRef
DatabaseTitleList CrossRef

Database_xml – sequence: 1
  dbid: DOA
  name: DOAJ Directory of Open Access Journals
  url: https://www.doaj.org/
  sourceTypes: Open Website
DeliveryMethod fulltext_linktorsrc
Discipline Languages & Literatures
EISSN 2619-029X
EndPage 29
ExternalDocumentID oai_doaj_org_article_3ce1e49a0bc449eda4ae80c6dc7f8adb
10_29025_2079_6021_2024_2_18_29
GroupedDBID 642
AAYXX
ALMA_UNASSIGNED_HOLDINGS
CITATION
GROUPED_DOAJ
ID FETCH-LOGICAL-c1639-a404264e609d1582df32e12deccebf6dae3ec26741440e29b84d30db64868a033
IEDL.DBID DOA
ISSN 2079-6021
IngestDate Fri Oct 03 12:50:00 EDT 2025
Sat Nov 29 02:24:21 EST 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 2
Language English
German
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c1639-a404264e609d1582df32e12deccebf6dae3ec26741440e29b84d30db64868a033
OpenAccessLink https://doaj.org/article/3ce1e49a0bc449eda4ae80c6dc7f8adb
PageCount 12
ParticipantIDs doaj_primary_oai_doaj_org_article_3ce1e49a0bc449eda4ae80c6dc7f8adb
crossref_primary_10_29025_2079_6021_2024_2_18_29
PublicationCentury 2000
PublicationDate 2024-06-25
PublicationDateYYYYMMDD 2024-06-25
PublicationDate_xml – month: 06
  year: 2024
  text: 2024-06-25
  day: 25
PublicationDecade 2020
PublicationTitle Aktualʹnye problemy filologii i pedagogičeskoj lingvistiki
PublicationYear 2024
Publisher Publishing and Printing Center NOSU
Publisher_xml – name: Publishing and Printing Center NOSU
SSID ssj0002891541
Score 2.2600207
Snippet The article is devoted to the study of identity verbalization specifics via Data Mining. The research material consists of English texts from Internet...
SourceID doaj
crossref
SourceType Open Website
Index Database
StartPage 18
SubjectTerms data mining
identity verbalization
internet scientific repositories
python
scientific discourse
semantic category
youth identity
Title Search algorithms of verbal identity markers in modern scientific discourse
URI https://doaj.org/article/3ce1e49a0bc449eda4ae80c6dc7f8adb
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVAON
  databaseName: DOAJ Directory of Open Access Journals
  customDbUrl:
  eissn: 2619-029X
  dateEnd: 20241231
  omitProxy: false
  ssIdentifier: ssj0002891541
  issn: 2079-6021
  databaseCode: DOA
  dateStart: 20170101
  isFulltext: true
  titleUrlDefault: https://www.doaj.org/
  providerName: Directory of Open Access Journals
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV07T8MwELZQxcCCeFNe8oDYrDqOYztjeVRIVBUDVN0svwJF0KCmIPHvOSehKhMLqx1Z0XeX3H32-TuEzrkvggBXIRaSBcIN5SQHAkds3PK3hZS2li8eD-VopCaT_H6l1VesCWvkgRvgeqkLSeC5odZxngdvuAmKOuGdLJTxNv59qcxXyNRLc3wGuUFkW3GSCIhkTXEXi8dqveUgOAnjhJEkXkX4FZpWFPzrUDPYQpttjoj7zbttozUfdtDBsN1ZrPAFHi7FkKtddNeUDOP-61MJVP_5rcJlgceAFyzSXsT9wvFSDqR6eDrDTf8zXH_WdakQvp5WrozlHHvocXDzcHVL2h4JxEEmlRPDIwniQdDcJ5livkhZSJgHywRbCG9CGhwTkDdwTgPLreI-pd4KroQyNE33UWdWzsIhwmCZVNkskSJx3GTGZrHdHyApCytD4ruI_sCj3xspDA0UokZUR0R1RFRHRDXTiYKZLrqMMC4fj1rW9QBYWLcW1n9Z-Og_FjlGG7WpqSAsO0GdxfwjnKJ197mYVvOz2nm-AVZ1xtE
linkProvider Directory of Open Access Journals
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Search+algorithms+of+verbal+identity+markers+in+modern+scientific+discourse&rft.jtitle=Aktual%CA%B9nye+problemy+filologii+i+pedagogi%C4%8Deskoj+lingvistiki&rft.au=Goncharova%2C+Oksana+V.&rft.au=Zavrumov%2C+Zaur+A.&rft.au=Khaleeva%2C+Svetlana&rft.date=2024-06-25&rft.issn=2079-6021&rft.eissn=2619-029X&rft.issue=2&rft.spage=18&rft.epage=29&rft_id=info:doi/10.29025%2F2079-6021-2024-2-18-29&rft.externalDBID=n%2Fa&rft.externalDocID=10_29025_2079_6021_2024_2_18_29
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2079-6021&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2079-6021&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2079-6021&client=summon