Learning Question Similarity with Recurrent Neural Networks

The measurement of semantic similarity is a fundamental task in natural language processing. In the settings of a community question answering (cQA) system, it is essentially a classification problem: given a pair of questions, label it similar, relevant, or irrelevant. Traditional methods, either t...

Full description

Saved in:
Bibliographic Details
Published in:2017 IEEE International Conference on Big Knowledge (ICBK) pp. 111 - 118
Main Authors: Borui Ye, Guangyu Feng, Anqi Cui, Ming Li
Format: Conference Proceeding
Language:English
Published: IEEE 01.08.2017
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The measurement of semantic similarity is a fundamental task in natural language processing. In the settings of a community question answering (cQA) system, it is essentially a classification problem: given a pair of questions, label it similar, relevant, or irrelevant. Traditional methods, either those at word level or at sentence level, typically require many lexical and syntactic resources, which are not available in languages other than English. In addition, there does not exist a finely annotated dataset for our purpose. In this paper, we constructed a dataset containing 4,322 labelled question pairs in Chinese, which is, to the best of our knowledge, the first open Chinese dataset for question similarity classification. We propose a novel framework for measuring the semantic similarity between sentences based on the architecture of a recurrent neural network (RNN) encoderdecoder, which does not require lexical or syntactic resources. We solve the problem of lacking labelled data by first training the RNN using a larger dataset of question pairs that are automatically labelled with heuristic scores, and then fine-tuning it with our smaller, manually labelled dataset. The two-step training scheme improves the accuracy of classification compared to single-step training, and also outperforms other traditional models. The proposed model is capable of both classification and candidate ranking.
DOI:10.1109/ICBK.2017.46