An index-based algorithm for fast on-line query processing of latent semantic analysis

Latent Semantic Analysis (LSA) is widely used for finding the documents whose semantic is similar to the query of keywords. Although LSA yield promising similar results, the existing LSA algorithms involve lots of unnecessary operations in similarity computation and candidate check during on-line qu...

Full description

Saved in:

Bibliographic Details
Published in:	PloS one Vol. 12; no. 5; p. e0177523
Main Authors:	Zhang, Mingxi, Li, Pohan, Wang, Wei
Format:	Journal Article
Language:	English
Published:	United States Public Library of Science 16.05.2017 Public Library of Science (PLoS)
Subjects:	Accuracy Agglomeration Algorithms Analogies Approximation Arrhythmia Asymmetry Attention Biology and Life Sciences CAD Cardiovascular diseases Classification Cognitive ability Collaboration Computer aided design Computer science Data analysis Data management Data processing Data transmission Datasets Decomposition Drugs Efficiency Electronic records Factorization Filtration Forecasting Gene expression Heart diseases Humans Image processing Immunoglobulin M Information processing Information systems International conferences Keywords Language Learning algorithms Machine learning Mathematical models Matrices (mathematics) Medicine and Health Sciences Mining Models, Theoretical Neurocomputing Optimization Optimization techniques Outsourcing Physical Sciences Probability theory Propagation Query expansion Remote sensing Research and Analysis Methods Sampling Scoring Semantic analysis Semantics Sign language Social Sciences Splitting Text categorization Valleys Video data
ISSN:	1932-6203, 1932-6203
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Latent Semantic Analysis (LSA) is widely used for finding the documents whose semantic is similar to the query of keywords. Although LSA yield promising similar results, the existing LSA algorithms involve lots of unnecessary operations in similarity computation and candidate check during on-line query processing, which is expensive in terms of time cost and cannot efficiently response the query request especially when the dataset becomes large. In this paper, we study the efficiency problem of on-line query processing for LSA towards efficiently searching the similar documents to a given query. We rewrite the similarity equation of LSA combined with an intermediate value called partial similarity that is stored in a designed index called partial index. For reducing the searching space, we give an approximate form of similarity equation, and then develop an efficient algorithm for building partial index, which skips the partial similarities lower than a given threshold θ. Based on partial index, we develop an efficient algorithm called ILSA for supporting fast on-line query processing. The given query is transformed into a pseudo document vector, and the similarities between query and candidate documents are computed by accumulating the partial similarities obtained from the index nodes corresponds to non-zero entries in the pseudo document vector. Compared to the LSA algorithm, ILSA reduces the time cost of on-line query processing by pruning the candidate documents that are not promising and skipping the operations that make little contribution to similarity scores. Extensive experiments through comparison with LSA have been done, which demonstrate the efficiency and effectiveness of our proposed algorithm.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 Conceptualization: MZ PL.Data curation: MZ PL.Formal analysis: MZ PL.Funding acquisition: MZ WW.Investigation: MZ PL.Methodology: MZ PL WW.Project administration: MZ PL WW.Resources: MZ PL WW.Software: MZ PL.Supervision: WW.Visualization: MZ.Writing – original draft: MZ PL WW. Competing Interests: The authors have declared that no competing interests exist.
ISSN:	1932-6203 1932-6203
DOI:	10.1371/journal.pone.0177523