Skyline Operators for Document Spanners

Uloženo v:
Podrobná bibliografie
Název: Skyline Operators for Document Spanners
Autoři: Amarilli, Antoine, Kimelfeld, Benny, Labbé, Sébastien, Mengel, Stefan
Přispěvatelé: Data, Intelligence and Graphs (DIG), Laboratoire Traitement et Communication de l'Information (LTCI), Institut Mines-Télécom Paris (IMT)-Télécom Paris, Institut Mines-Télécom Paris (IMT)-Institut Polytechnique de Paris (IP Paris)-Institut Polytechnique de Paris (IP Paris)-Institut Mines-Télécom Paris (IMT)-Télécom Paris, Institut Mines-Télécom Paris (IMT)-Institut Polytechnique de Paris (IP Paris)-Institut Polytechnique de Paris (IP Paris), Département Informatique et Réseaux (INFRES), Télécom ParisTech, Technion - Israel Institute of Technology Haifa, École normale supérieure - Paris (ENS-PSL), Université Paris Sciences et Lettres (PSL), Centre de Recherche en Informatique de Lens (CRIL), Université d'Artois (UA)-Centre National de la Recherche Scientifique (CNRS), ANR-19-CE48-0019,EQUUS,Réponse efficace aux requêtes sous mises à jour(2019)
Zdroj: Leibniz International Proceedings in Informatics (LIPIcs) ; 27th International Conference on Database Theory (ICDT 2024) ; https://hal.science/hal-04778343 ; 27th International Conference on Database Theory (ICDT 2024), Mar 2024, Paestum, Italy. pp.7:1-7:18, ⟨10.4230/LIPICS.ICDT.2024.7⟩
Informace o vydavateli: CCSD
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Rok vydání: 2024
Sbírka: Université d'Artois: HAL
Témata: Information Extraction, Document Spanners, Query Evaluation, Information systems → Information extraction, Theory of computation → Database query processing and optimization (theory), [INFO]Computer Science [cs]
Geografické téma: Paestum, Italy
Popis: International audience ; When extracting a relation of spans (intervals) from a text document, a common practice is to filter out tuples of the relation that are deemed dominated by others. The domination rule is defined as a partial order that varies along different systems and tasks. For example, we may state that a tuple is dominated by tuples that extend it by assigning additional attributes, or assigning larger intervals. The result of filtering the relation would then be the skyline according to this partial order. As this filtering may remove most of the extracted tuples, we study whether we can improve the performance of the extraction by compiling the domination rule into the extractor. To this aim, we introduce the skyline operator for declarative information extraction tasks expressed as document spanners. We show that this operator can be expressed via regular operations when the domination partial order can itself be expressed as a regular spanner, which covers several natural domination rules. Yet, we show that the skyline operator incurs a computational cost (under combined complexity). First, there are cases where the operator requires an exponential blowup on the number of states needed to represent the spanner as a sequential variable-set automaton. Second, the evaluation may become computationally hard. Our analysis more precisely identifies classes of domination rules for which the combined complexity is tractable or intractable.
Druh dokumentu: conference object
Jazyk: English
Relation: info:eu-repo/semantics/altIdentifier/arxiv/2304.06155; ARXIV: 2304.06155
DOI: 10.4230/LIPICS.ICDT.2024.7
Dostupnost: https://hal.science/hal-04778343
https://doi.org/10.4230/LIPICS.ICDT.2024.7
Rights: http://creativecommons.org/licenses/by/
Přístupové číslo: edsbas.1BA64642
Databáze: BASE
Popis
Abstrakt:International audience ; When extracting a relation of spans (intervals) from a text document, a common practice is to filter out tuples of the relation that are deemed dominated by others. The domination rule is defined as a partial order that varies along different systems and tasks. For example, we may state that a tuple is dominated by tuples that extend it by assigning additional attributes, or assigning larger intervals. The result of filtering the relation would then be the skyline according to this partial order. As this filtering may remove most of the extracted tuples, we study whether we can improve the performance of the extraction by compiling the domination rule into the extractor. To this aim, we introduce the skyline operator for declarative information extraction tasks expressed as document spanners. We show that this operator can be expressed via regular operations when the domination partial order can itself be expressed as a regular spanner, which covers several natural domination rules. Yet, we show that the skyline operator incurs a computational cost (under combined complexity). First, there are cases where the operator requires an exponential blowup on the number of states needed to represent the spanner as a sequential variable-set automaton. Second, the evaluation may become computationally hard. Our analysis more precisely identifies classes of domination rules for which the combined complexity is tractable or intractable.
DOI:10.4230/LIPICS.ICDT.2024.7