Search algorithms of verbal identity markers in modern scientific discourse

The article is devoted to the study of identity verbalization specifics via Data Mining. The research material consists of English texts from Internet scientific repositories and e-libraries devoted to various concepts of youth identity. A methodology based on the use of modern natural language proc...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Aktualʹnye problemy filologii i pedagogičeskoj lingvistiki číslo 2; s. 18 - 29
Hlavní autoři:	Goncharova, Oksana V., Zavrumov, Zaur A., Khaleeva, Svetlana
Médium:	Journal Article
Jazyk:	angličtina němčina
Vydáno:	Publishing and Printing Center NOSU 25.06.2024
Témata:	data mining identity verbalization internet scientific repositories python scientific discourse semantic category youth identity
ISSN:	2079-6021, 2619-029X
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Abstract	The article is devoted to the study of identity verbalization specifics via Data Mining. The research material consists of English texts from Internet scientific repositories and e-libraries devoted to various concepts of youth identity. A methodology based on the use of modern natural language processing and machine learning tools was developed and tested as part of the research. The analysis was carried out using the Natural Language Toolkit library for tokenization and POS-tagging procedures for calculating the frequency of tokens from the «identity» environment. Word Embeddings, pre-trained Word2Vec model and K-means algorithm were used for the subsequent analysis and clustering of words based on their semantic proximity. Gensim library and Scikit-learn library were used to work with the Word2Vec model. As a result, it was proved that in English scientific discourse young person’s identity is verbalized within 9 semantic categories: behavior, communities, communication, education, identity, language, practice, complexity, science, the most common of which are education (33%), language (21%) and communities (18%). N-grams analysis made it possible to identify semantic fields, establish their attributes, and evaluate texts’ similarity, which provided the most accurate vector space search for semantically close n-grams. Optimization made it possible to establish a similarity measure to rank phrases according to the query, as well as assign each n-gram a certain ranking weight. Improvements can be achieved by adding statistical word weighting, such as TF-IDF. The proposed system is capable of searching in a large text array of related phrases with a similar meaning.
AbstractList	The article is devoted to the study of identity verbalization specifics via Data Mining. The research material consists of English texts from Internet scientific repositories and e-libraries devoted to various concepts of youth identity. A methodology based on the use of modern natural language processing and machine learning tools was developed and tested as part of the research. The analysis was carried out using the Natural Language Toolkit library for tokenization and POS-tagging procedures for calculating the frequency of tokens from the «identity» environment. Word Embeddings, pre-trained Word2Vec model and K-means algorithm were used for the subsequent analysis and clustering of words based on their semantic proximity. Gensim library and Scikit-learn library were used to work with the Word2Vec model. As a result, it was proved that in English scientific discourse young person’s identity is verbalized within 9 semantic categories: behavior, communities, communication, education, identity, language, practice, complexity, science, the most common of which are education (33%), language (21%) and communities (18%). N-grams analysis made it possible to identify semantic fields, establish their attributes, and evaluate texts’ similarity, which provided the most accurate vector space search for semantically close n-grams. Optimization made it possible to establish a similarity measure to rank phrases according to the query, as well as assign each n-gram a certain ranking weight. Improvements can be achieved by adding statistical word weighting, such as TF-IDF. The proposed system is capable of searching in a large text array of related phrases with a similar meaning.
Author	Zavrumov, Zaur A. Goncharova, Oksana V. Khaleeva, Svetlana
Author_xml	– sequence: 1 givenname: Oksana V. surname: Goncharova fullname: Goncharova, Oksana V. – sequence: 2 givenname: Zaur A. surname: Zavrumov fullname: Zavrumov, Zaur A. – sequence: 3 givenname: Svetlana surname: Khaleeva fullname: Khaleeva, Svetlana
BookMark	eNo9kNtKAzEQhoMoWGufwbxAdDKbpsmlFE8oeKGCdyGbzLap7UaSVfDtbevhaob5fz6G74Qd9rknxs4knKMFnF4gzKzQgFIgoBIopBFoD9gItbQC0L4ebve_0jGb1LoCADRWTpUcsfsn8iUsuV8vcknDclN57vgnldaveYrUD2n44htf3qhUnnq-yZFKz2tIu6xLgcdUQ_4olU7ZUefXlSa_c8xerq-e57fi4fHmbn75IILUjRVegUKtSIONcmowdg2SxEghUNvp6KmhgHqmpFJAaFujYgOx1cpo46FpxuzuhxuzX7n3krbvfbnsk9sfclk4X4YU1uSaQJKU9dAGpSxFrzwZCDqGWWd8bLes2Q8rlFxroe6fJ8HtFbudPbez53aKHTpptknzDQCAcfc
ContentType	Journal Article
DBID	AAYXX CITATION DOA
DOI	10.29025/2079-6021-2024-2-18-29
DatabaseName	CrossRef DOAJ Directory of Open Access Journals
DatabaseTitle	CrossRef
DatabaseTitleList	CrossRef
Database_xml	– sequence: 1 dbid: DOA name: DOAJ Directory of Open Access Journals url: https://www.doaj.org/ sourceTypes: Open Website
DeliveryMethod	fulltext_linktorsrc
Discipline	Languages & Literatures
EISSN	2619-029X
EndPage	29
ExternalDocumentID	oai_doaj_org_article_3ce1e49a0bc449eda4ae80c6dc7f8adb 10_29025_2079_6021_2024_2_18_29
GroupedDBID	642 AAYXX ALMA_UNASSIGNED_HOLDINGS CITATION GROUPED_DOAJ
ID	FETCH-LOGICAL-c1639-a404264e609d1582df32e12deccebf6dae3ec26741440e29b84d30db64868a033
IEDL.DBID	DOA
ISSN	2079-6021
IngestDate	Fri Oct 03 12:50:00 EDT 2025 Sat Nov 29 02:24:21 EST 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Issue	2
Language	English German
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c1639-a404264e609d1582df32e12deccebf6dae3ec26741440e29b84d30db64868a033
OpenAccessLink	https://doaj.org/article/3ce1e49a0bc449eda4ae80c6dc7f8adb
PageCount	12
ParticipantIDs	doaj_primary_oai_doaj_org_article_3ce1e49a0bc449eda4ae80c6dc7f8adb crossref_primary_10_29025_2079_6021_2024_2_18_29
PublicationCentury	2000
PublicationDate	2024-06-25
PublicationDateYYYYMMDD	2024-06-25
PublicationDate_xml	– month: 06 year: 2024 text: 2024-06-25 day: 25
PublicationDecade	2020
PublicationTitle	Aktualʹnye problemy filologii i pedagogičeskoj lingvistiki
PublicationYear	2024
Publisher	Publishing and Printing Center NOSU
Publisher_xml	– name: Publishing and Printing Center NOSU
SSID	ssj0002891541
Score	2.2600207
Snippet	The article is devoted to the study of identity verbalization specifics via Data Mining. The research material consists of English texts from Internet...
SourceID	doaj crossref
SourceType	Open Website Index Database
StartPage	18
SubjectTerms	data mining identity verbalization internet scientific repositories python scientific discourse semantic category youth identity
Title	Search algorithms of verbal identity markers in modern scientific discourse
URI	https://doaj.org/article/3ce1e49a0bc449eda4ae80c6dc7f8adb
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
journalDatabaseRights	– providerCode: PRVAON databaseName: DOAJ Directory of Open Access Journals customDbUrl: eissn: 2619-029X dateEnd: 20241231 omitProxy: false ssIdentifier: ssj0002891541 issn: 2079-6021 databaseCode: DOA dateStart: 20170101 isFulltext: true titleUrlDefault: https://www.doaj.org/ providerName: Directory of Open Access Journals
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV07T8MwELZQxcCCeFNe8oDYrDqOYztjeVRIVBUDVN0svwJF0KCmIPHvOSehKhMLqx1Z0XeX3H32-TuEzrkvggBXIRaSBcIN5SQHAkds3PK3hZS2li8eD-VopCaT_H6l1VesCWvkgRvgeqkLSeC5odZxngdvuAmKOuGdLJTxNv59qcxXyNRLc3wGuUFkW3GSCIhkTXEXi8dqveUgOAnjhJEkXkX4FZpWFPzrUDPYQpttjoj7zbttozUfdtDBsN1ZrPAFHi7FkKtddNeUDOP-61MJVP_5rcJlgceAFyzSXsT9wvFSDqR6eDrDTf8zXH_WdakQvp5WrozlHHvocXDzcHVL2h4JxEEmlRPDIwniQdDcJ5livkhZSJgHywRbCG9CGhwTkDdwTgPLreI-pd4KroQyNE33UWdWzsIhwmCZVNkskSJx3GTGZrHdHyApCytD4ruI_sCj3xspDA0UokZUR0R1RFRHRDXTiYKZLrqMMC4fj1rW9QBYWLcW1n9Z-Og_FjlGG7WpqSAsO0GdxfwjnKJ197mYVvOz2nm-AVZ1xtE
linkProvider	Directory of Open Access Journals
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Search+algorithms+of+verbal+identity+markers+in+modern+scientific+discourse&rft.jtitle=Aktual%CA%B9nye+problemy+filologii+i+pedagogi%C4%8Deskoj+lingvistiki&rft.au=Goncharova%2C+Oksana+V.&rft.au=Zavrumov%2C+Zaur+A.&rft.au=Khaleeva%2C+Svetlana&rft.date=2024-06-25&rft.issn=2079-6021&rft.eissn=2619-029X&rft.issue=2&rft.spage=18&rft.epage=29&rft_id=info:doi/10.29025%2F2079-6021-2024-2-18-29&rft.externalDBID=n%2Fa&rft.externalDocID=10_29025_2079_6021_2024_2_18_29
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2079-6021&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2079-6021&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2079-6021&client=summon