FINK NLP: A Natural Language Processing Toolkit for Structured Analysis of Multilingual Interview Data

Gespeichert in:
Bibliographische Detailangaben
Titel: FINK NLP: A Natural Language Processing Toolkit for Structured Analysis of Multilingual Interview Data
Autoren: Spitale, Giovanni, orcid:0000-0002-6812-
Weitere Verfasser: Germani, Federico
Verlagsinformationen: Zenodo
Publikationsjahr: 2025
Bestand: Zenodo
Schlagwörter: nlp, Natural Language Processing
Beschreibung: FINK NLP is a modular Jupyter-based pipeline designed for the structured extraction, organization, and analysis of multilingual interview transcripts stored as .docx files. It performs metadata parsing from filenames, text ingestion using textract, and corpus structuring into a DataFrame. The notebook supports selective subsetting by language, module, category, or expression. It integrates spaCy for lemmatization, gensim for topic modeling (LDA), and multiple Python visualization libraries (matplotlib, seaborn, wordcloud, pyLDAvis) to facilitate qualitative and quantitative content analysis. This repository includes the output tabular data (redacted for data protection) and the visualization outputs.
Publikationsart: other/unknown material
Sprache: unknown
Relation: https://zenodo.org/records/15394889; oai:zenodo.org:15394889; https://doi.org/10.5281/zenodo.15394889
DOI: 10.5281/zenodo.15394889
Verfügbarkeit: https://doi.org/10.5281/zenodo.15394889
https://zenodo.org/records/15394889
Rights: Creative Commons Attribution 4.0 International ; cc-by-4.0 ; https://creativecommons.org/licenses/by/4.0/legalcode
Dokumentencode: edsbas.BC14446F
Datenbank: BASE
Beschreibung
Abstract:FINK NLP is a modular Jupyter-based pipeline designed for the structured extraction, organization, and analysis of multilingual interview transcripts stored as .docx files. It performs metadata parsing from filenames, text ingestion using textract, and corpus structuring into a DataFrame. The notebook supports selective subsetting by language, module, category, or expression. It integrates spaCy for lemmatization, gensim for topic modeling (LDA), and multiple Python visualization libraries (matplotlib, seaborn, wordcloud, pyLDAvis) to facilitate qualitative and quantitative content analysis. This repository includes the output tabular data (redacted for data protection) and the visualization outputs.
DOI:10.5281/zenodo.15394889