Improving Large-Scale k-Nearest Neighbor Text Categorization with Label Autoencoders

In this paper, we introduce a multi-label lazy learning approach to deal with automatic semantic indexing in large document collections in the presence of complex and structured label vocabularies with high inter-label correlation. The proposed method is an evolution of the traditional k-Nearest Nei...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:arXiv.org
Hlavní autori: Ribadas-Pena, Francisco J, Cao, Shuyuan, Víctor M Darriba Bilbao
Médium: Paper
Jazyk:English
Vydavateľské údaje: Ithaca Cornell University Library, arXiv.org 03.02.2024
Predmet:
ISSN:2331-8422
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:In this paper, we introduce a multi-label lazy learning approach to deal with automatic semantic indexing in large document collections in the presence of complex and structured label vocabularies with high inter-label correlation. The proposed method is an evolution of the traditional k-Nearest Neighbors algorithm which uses a large autoencoder trained to map the large label space to a reduced size latent space and to regenerate the predicted labels from this latent space. We have evaluated our proposal in a large portion of the MEDLINE biomedical document collection which uses the Medical Subject Headings (MeSH) thesaurus as a controlled vocabulary. In our experiments we propose and evaluate several document representation approaches and different label autoencoder configurations.
Bibliografia:SourceType-Working Papers-1
ObjectType-Working Paper/Pre-Print-1
content type line 50
ISSN:2331-8422
DOI:10.48550/arxiv.2402.01963