Hashcode representations of natural language for relation extraction

Uložené v:
Podrobná bibliografia
Názov: Hashcode representations of natural language for relation extraction
Autori: Garg, Sahil (author)
Informácie o vydavateľovi: University of Southern California Digital Library (USC.DL), 2019.
Rok vydania: 2019
Predmety: Viterbi School of Engineering (school), Computer Science (degree program), Doctor of Philosophy (degree)
Popis: This thesis studies the problem of identifying and extracting relationships between biological entities from the text of scientific papers. For the relation extraction task, state-of-the-art performance has been achieved by classification methods based on convolution kernels which facilitate sophisticated reasoning on natural language text using structural similarities between sentences and/or their parse trees. Despite their success, however, kernel-based methods are difficult to customize and computationally expensive to scale to large datasets. In this thesis, the first problem is addressed by proposing a nonstationary extension to the conventional convolution kernels for improved expressiveness and flexibility. For scalability, I propose to employ kernelized locality sensitive hashcodes as explicit representations of natural language structures, which can be used as feature-vector inputs to arbitrary classification methods. For optimizing the representations, a theoretically justified method is proposed that is based on approximate and efficient maximization of the mutual information between the hashcodes and the class labels. I have evaluated the proposed approach on multiple biomedical relation extraction datasets, and have observed significant and robust improvements in accuracy over state-of-the-art classifiers, along with drastic orders-of-magnitude speedup compared to conventional kernel methods. ? Finally, in this thesis, a nearly-unsupervised framework is introduced for learning kernel- or neural-hashcode representations. In this framework, an information-theoretic objective is defined which leverages both labeled and unlabeled data points for fine-grained optimization of each hash function, and propose a greedy algorithm for maximizing that objective. This novel learning paradigm is beneficial for building hashcode representations generalizing from a training set to a test set. I have conducted a thorough experimental evaluation on the relation extraction datasets, and demonstrate that the proposed extension leads to superior accuracies with respect to state-of-the-art supervised and semi-supervised approaches, such as variational autoencoders and adversarial neural networks. An added benefit of the proposed representation learning technique is that it is easily parallelizable, interpretable, and owing to its generality, applicable to a wide range of NLP problems.
Druh dokumentu: Doctoral thesis
Jazyk: English
DOI: 10.25549/usctheses-c89-204564
Prístupové číslo: edsair.doi...........db55a71a94da24fdfedc93ca209d07e8
Databáza: OpenAIRE
Popis
Abstrakt:This thesis studies the problem of identifying and extracting relationships between biological entities from the text of scientific papers. For the relation extraction task, state-of-the-art performance has been achieved by classification methods based on convolution kernels which facilitate sophisticated reasoning on natural language text using structural similarities between sentences and/or their parse trees. Despite their success, however, kernel-based methods are difficult to customize and computationally expensive to scale to large datasets. In this thesis, the first problem is addressed by proposing a nonstationary extension to the conventional convolution kernels for improved expressiveness and flexibility. For scalability, I propose to employ kernelized locality sensitive hashcodes as explicit representations of natural language structures, which can be used as feature-vector inputs to arbitrary classification methods. For optimizing the representations, a theoretically justified method is proposed that is based on approximate and efficient maximization of the mutual information between the hashcodes and the class labels. I have evaluated the proposed approach on multiple biomedical relation extraction datasets, and have observed significant and robust improvements in accuracy over state-of-the-art classifiers, along with drastic orders-of-magnitude speedup compared to conventional kernel methods. ? Finally, in this thesis, a nearly-unsupervised framework is introduced for learning kernel- or neural-hashcode representations. In this framework, an information-theoretic objective is defined which leverages both labeled and unlabeled data points for fine-grained optimization of each hash function, and propose a greedy algorithm for maximizing that objective. This novel learning paradigm is beneficial for building hashcode representations generalizing from a training set to a test set. I have conducted a thorough experimental evaluation on the relation extraction datasets, and demonstrate that the proposed extension leads to superior accuracies with respect to state-of-the-art supervised and semi-supervised approaches, such as variational autoencoders and adversarial neural networks. An added benefit of the proposed representation learning technique is that it is easily parallelizable, interpretable, and owing to its generality, applicable to a wide range of NLP problems.
DOI:10.25549/usctheses-c89-204564