A Study of Two Sampling Methods for Analyzing Large Datasets with ILP

This paper is concerned with problems that arise when submitting large quantities of data to analysis by an Inductive Logic Programming (ILP) system. Complexity arguments usually make it prohibitive to analyse such datasets in their entirety. We examine two schemes that allow an ILP system to constr...

Full description

Saved in:
Bibliographic Details
Published in:Data mining and knowledge discovery Vol. 3; no. 1; pp. 95 - 123
Main Author: Srinivasan, Ashwin
Format: Journal Article
Language:English
Published: New York Springer Nature B.V 01.03.1999
Subjects:
ISSN:1384-5810, 1573-756X
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:This paper is concerned with problems that arise when submitting large quantities of data to analysis by an Inductive Logic Programming (ILP) system. Complexity arguments usually make it prohibitive to analyse such datasets in their entirety. We examine two schemes that allow an ILP system to construct theories by sampling from this large pool of data. The first, "subsampling", is a single-sample design in which the utility of a potential rule is evaluated on a randomly selected sub-sample of the data. The second, "logical windowing", is multiple-sample design that tests and sequentially includes errors made by a partially correct theory. Both schemes are derived from techniques developed to enable propositional learning methods (like decision trees) to cope with large datasets. The ILP system CProgol, equipped with each of these methods, is used to construct theories for two datasets--one artificial (a chess endgame) and the other naturally occurring (a language tagging problem). In each case, we ask the following questions of CProgol equipped with sampling: (1) Is its theory comparable in predictive accuracy to that obtained if all the data were used (that is, no sampling was employed)?; and (2) Is its theory constructed in less time than the one obtained with all the data? For the problems considered, the answers to these questions is "yes". This suggests that an ILP program equipped with an appropriate sampling method could begin to address problems satisfactorily that have hitherto been inaccessible simply due to data extent.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
content type line 14
ISSN:1384-5810
1573-756X
DOI:10.1023/A:1009824123462