mwetoolkit-lib: Adaptation of the mwetoolkit as a Python Library and an Application to MWE-based Document Clustering
Gespeichert in:
| Titel: | mwetoolkit-lib: Adaptation of the mwetoolkit as a Python Library and an Application to MWE-based Document Clustering |
|---|---|
| Autoren: | Zagatti, Fernando, de Lima Medeiros, Paulo Augusto, da Cunha Soares, Esther, dos Santos Silva, Lucas Nildaimon, Ramisch, Carlos, Real, Livy |
| Weitere Verfasser: | Federal University of São Carlos = Universidade Federal de São Carlos (UFSCar), americanas s.a., Traitement Automatique du Langage Ecrit et Parlé (TALEP), Laboratoire d'Informatique et des Systèmes (LIS) (Marseille, Toulon) (LIS), Aix Marseille Université (AMU)-Université de Toulon (UTLN)-Centre National de la Recherche Scientifique (CNRS)-Aix Marseille Université (AMU)-Université de Toulon (UTLN)-Centre National de la Recherche Scientifique (CNRS), ELRA, ANR-21-CE23-0033,SELEXINI,Induction de lexiques sémantiques pour l'interprétabilité et la diversité en traitement de textes(2021), European Project: COST CA21167,UniDive |
| Quelle: | Proceedings of the 18th Workshop on Multiword Expressions @LREC2022 ; https://hal.science/hal-05380115 ; Proceedings of the 18th Workshop on Multiword Expressions @LREC2022, ELRA, 2022, Marseille, France |
| Verlagsinformationen: | CCSD |
| Publikationsjahr: | 2022 |
| Bestand: | Aix-Marseille Université: HAL |
| Schlagwörter: | Python library, Clustering, k-means, Multiword expressions, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] |
| Geographisches Schlagwort: | Marseille, France |
| Beschreibung: | International audience ; This paper introduces the mwetoolkit-lib, an adaptation of the mwetoolkit as a python library. The original toolkit performs the extraction and identification of multiword expressions (MWEs) in large text bases through the command line. One of the contributions of our work is the adaptation of the MWE extraction pipeline from the mwetoolkit, allowing its usage in python development environments and integration in larger pipelines. The other contribution is the execution of a pilot experiment aiming to show the impact of MWE discovery in data professionals' work. Thus, we propose a textual clustering experiment in which we compare using single-word and MWE features. This experiment found that the addition of MWE knowledge to the Term Frequency-Inverse Document Frequency (TF-IDF) vectorization altered the word relevance order, improving the linguistic quality of the clusters returned by k-means. |
| Publikationsart: | conference object |
| Sprache: | English |
| Relation: | info:eu-repo/grantAgreement//COST CA21167/EU/Universality, diversity and idiosyncrasy in language technology/UniDive |
| Verfügbarkeit: | https://hal.science/hal-05380115 https://hal.science/hal-05380115v1/document https://hal.science/hal-05380115v1/file/2022.mwe-1.16.pdf |
| Rights: | https://creativecommons.org/licenses/by/4.0/ ; info:eu-repo/semantics/OpenAccess |
| Dokumentencode: | edsbas.324EFDC8 |
| Datenbank: | BASE |
| Abstract: | International audience ; This paper introduces the mwetoolkit-lib, an adaptation of the mwetoolkit as a python library. The original toolkit performs the extraction and identification of multiword expressions (MWEs) in large text bases through the command line. One of the contributions of our work is the adaptation of the MWE extraction pipeline from the mwetoolkit, allowing its usage in python development environments and integration in larger pipelines. The other contribution is the execution of a pilot experiment aiming to show the impact of MWE discovery in data professionals' work. Thus, we propose a textual clustering experiment in which we compare using single-word and MWE features. This experiment found that the addition of MWE knowledge to the Term Frequency-Inverse Document Frequency (TF-IDF) vectorization altered the word relevance order, improving the linguistic quality of the clusters returned by k-means. |
|---|
Nájsť tento článok vo Web of Science