Learning Question Similarity with Recurrent Neural Networks

The measurement of semantic similarity is a fundamental task in natural language processing. In the settings of a community question answering (cQA) system, it is essentially a classification problem: given a pair of questions, label it similar, relevant, or irrelevant. Traditional methods, either t...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:2017 IEEE International Conference on Big Knowledge (ICBK) s. 111 - 118
Hlavní autoři: Borui Ye, Guangyu Feng, Anqi Cui, Ming Li
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 01.08.2017
Témata:
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:The measurement of semantic similarity is a fundamental task in natural language processing. In the settings of a community question answering (cQA) system, it is essentially a classification problem: given a pair of questions, label it similar, relevant, or irrelevant. Traditional methods, either those at word level or at sentence level, typically require many lexical and syntactic resources, which are not available in languages other than English. In addition, there does not exist a finely annotated dataset for our purpose. In this paper, we constructed a dataset containing 4,322 labelled question pairs in Chinese, which is, to the best of our knowledge, the first open Chinese dataset for question similarity classification. We propose a novel framework for measuring the semantic similarity between sentences based on the architecture of a recurrent neural network (RNN) encoderdecoder, which does not require lexical or syntactic resources. We solve the problem of lacking labelled data by first training the RNN using a larger dataset of question pairs that are automatically labelled with heuristic scores, and then fine-tuning it with our smaller, manually labelled dataset. The two-step training scheme improves the accuracy of classification compared to single-step training, and also outperforms other traditional models. The proposed model is capable of both classification and candidate ranking.
DOI:10.1109/ICBK.2017.46