Active Learning Genetic programming for record deduplication

The great majority of genetic programming (GP) algorithms that deal with the classification problem follow a supervised approach, i.e., they consider that all fitness cases available to evaluate their models are labeled. However, in certain application domains, a lot of human effort is required to l...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:IEEE Congress on Evolutionary Computation s. 1 - 8
Hlavní autoři: de Freitas, Junio, Pappa, Gisele L., da Silva, Altigran S., Gonccalves, Marcos A., Moura, Edleno, Veloso, Adriano, Laender, Alberto H.F., de Carvalho, Moises G.
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 01.07.2010
Témata:
ISBN:1424469090, 9781424469093
ISSN:1089-778X
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:The great majority of genetic programming (GP) algorithms that deal with the classification problem follow a supervised approach, i.e., they consider that all fitness cases available to evaluate their models are labeled. However, in certain application domains, a lot of human effort is required to label training data, and methods following a semi-supervised approach might be more appropriate. This is because they significantly reduce the time required for data labeling while maintaining acceptable accuracy rates. This paper presents the Active Learning GP (AGP), a semi-supervised GP, and instantiates it for the data deduplication problem. AGP uses an active learning approach in which a committee of multi-attribute functions votes for classifying record pairs as duplicates or not. When the committee majority voting is not enough to predict the class of the data pairs, a user is called to solve the conflict. The method was applied to three datasets and compared to two other deduplication methods. Results show that AGP guarantees the quality of the deduplication while reducing the number of labeled examples needed.
ISBN:1424469090
9781424469093
ISSN:1089-778X
DOI:10.1109/CEC.2010.5586104