How can automated data extraction and structuring enhance the retrieval of project-related data : A machine learning method for data handling ; Hur kan automatiserad dataextraktion och strukturering förbättra hämtningen av projektrelaterad data : En maskininlärningsbaserad metod för datahantering

Gespeichert in:
Bibliographische Detailangaben
Titel: How can automated data extraction and structuring enhance the retrieval of project-related data : A machine learning method for data handling ; Hur kan automatiserad dataextraktion och strukturering förbättra hämtningen av projektrelaterad data : En maskininlärningsbaserad metod för datahantering
Autoren: Ohlsson, Alexander, Kazemi, Parham
Verlagsinformationen: KTH, Skolan för elektroteknik och datavetenskap (EECS)
Publikationsjahr: 2025
Bestand: Royal Inst. of Technology, Stockholm (KTH): Publication Database DiVA
Schlagwörter: Machine Learning, Data extraction, Nordiska Brand, Data processing, Natural Language Processing, Relational database, Structured Query Language(SQL), Python, Hugging Face, Pathlib library, Textract library, Named entity recognition, MySQL, User Interface, React framework, Model training, F1 score, Dataset, Validation set, Maskininlärning, Datautvinning, Databehandling, Naturlig språkbehandling, Relationsdatabas, Structured Query Language (SQL), Pathlib-biblioteket, Textract-biblioteket, Namngiven enhetsigenkänning, Användargränssnitt, React-ramverket
Beschreibung: Nordiska Brand is a sprinkler construction company with internal archives consisting of text documents detailing past projects. Locating specific information, such as which pipes a given project used or what regulatory standard applies requires manual searching across multiple files. This system is time-consuming and outdated. The problem, is how to transform these documents into a searchable database that allows fast retrieval of key information. The problem is significant due to its practical implications for real-world challenges. The task is complex because all project-related tasks are highly unstructured and vary in format, posing challenges for the implementation of training, structuring, and creation of the user interface. The method for solving the problem involved investigating and developing a system to automate the extraction, structuring, and retrieval of key information. An end-to-end pipeline was implemented that converts unstructured text into a structured relational database and provides a user interface for efficient data access. The main component was a Named Entity Recognition (NER) model that was retrained using a dataset tailored to Nordiska Brands information. The NER model was applied to hundreds of projects containing thousands of project documents and returned company-specific entities that were stored in a relational database. The implemented system successfully automated the extraction and structuring of project-related data from over 200 projects. The system includes a local user interface that significantly reduces the time needed to locate sought-after project information. ; Nordiska Brand är ett sprinklermontageföretag med interna arkiv bestående av textdokument som beskriver tidigare projekt. Att hitta specifik information, såsom vilka rör som använts i ett visst projekt eller vilken regulatorisk standard som gäller, kräver manuell sökning genom flera filer. Detta system är därför tidskrävande och föråldrat. Problemet är hur man kan omvandla dessa dokument till en sökbar databas ...
Publikationsart: bachelor thesis
Dateibeschreibung: application/pdf
Sprache: English
Relation: TRITA-EECS-EX; 2025:494
Verfügbarkeit: http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-368135
Rights: info:eu-repo/semantics/openAccess
Dokumentencode: edsbas.B7A88D8D
Datenbank: BASE
Beschreibung
Abstract:Nordiska Brand is a sprinkler construction company with internal archives consisting of text documents detailing past projects. Locating specific information, such as which pipes a given project used or what regulatory standard applies requires manual searching across multiple files. This system is time-consuming and outdated. The problem, is how to transform these documents into a searchable database that allows fast retrieval of key information. The problem is significant due to its practical implications for real-world challenges. The task is complex because all project-related tasks are highly unstructured and vary in format, posing challenges for the implementation of training, structuring, and creation of the user interface. The method for solving the problem involved investigating and developing a system to automate the extraction, structuring, and retrieval of key information. An end-to-end pipeline was implemented that converts unstructured text into a structured relational database and provides a user interface for efficient data access. The main component was a Named Entity Recognition (NER) model that was retrained using a dataset tailored to Nordiska Brands information. The NER model was applied to hundreds of projects containing thousands of project documents and returned company-specific entities that were stored in a relational database. The implemented system successfully automated the extraction and structuring of project-related data from over 200 projects. The system includes a local user interface that significantly reduces the time needed to locate sought-after project information. ; Nordiska Brand är ett sprinklermontageföretag med interna arkiv bestående av textdokument som beskriver tidigare projekt. Att hitta specifik information, såsom vilka rör som använts i ett visst projekt eller vilken regulatorisk standard som gäller, kräver manuell sökning genom flera filer. Detta system är därför tidskrävande och föråldrat. Problemet är hur man kan omvandla dessa dokument till en sökbar databas ...