Comparative Evaluation of GPT-4o, GPT-OSS-120B and Llama-3.1-8B-Instruct Language Models in a Reproducible CV-to-JSON Extraction Pipeline.
Uložené v:
| Názov: | Comparative Evaluation of GPT-4o, GPT-OSS-120B and Llama-3.1-8B-Instruct Language Models in a Reproducible CV-to-JSON Extraction Pipeline. |
|---|---|
| Autori: | Nawalny, Marcin, Łępicki, Mateusz, Latkowski, Tomasz, Bujak, Sebastian, Bukowski, Michał, Świderski, Bartosz, Baranik, Grzegorz, Nowak, Bogusz, Zakowicz, Robert, Dobrakowski, Łukasz, Oczeretko, Agnieszka, Sadowski, Piotr, Szlaga, Konrad, Kubica, Bartłomiej, Kurek, Jarosław |
| Zdroj: | Applied Sciences (2076-3417); Jan2026, Vol. 16 Issue 1, p217, 36p |
| Predmety: | LANGUAGE models, EMPLOYEE recruitment, JOB resumes, ANONYMITY, GENERATIVE pre-trained transformers, DATA protection laws |
| Abstrakt: | Recruitment automation increasingly relies on Large Language Models (LLMs) for extracting structured information from unstructured CVs and job postings. However, production data often arrive as heterogeneous, privacy-sensitive PDFs, limiting reproducibility and compliance. This study introduces a deterministic, GDPR-aligned pipeline that converts recruitment documents into structured, anonymized Markdown and subsequently into validated JSON ready for downstream AI processing. The workflow combines the Docling PDF-to-Markdown converter with a two-pass anonymization protocol and evaluates three LLM back-ends—GPT-4o (Azure, frozen proprietary), GPT-OSS-120B and Llama-3.1-8B-Instruct—using identical prompts and schema constraints under near-zero-temperature decoding. Each model's output was assessed across 2280 multilingual CVs using two complementary metrics: reference-based completeness and content similarity. The proprietary GPT-4o achieved perfect schema coverage and served as the reproducibility baseline, while the open-weight models reached 73–79% completeness and 59–72% content similarity depending on section complexity. Llama-3.1-8B-Instruct performed strongly on standardized sections such as contact and legal, whereas GPT-OSS-120B better-handled less frequent narrative fields. The results demonstrate that fully deterministic, auditable document extraction is achievable with both proprietary and open LLMs when guided by strong schema validation and anonymization. The proposed pipeline bridges the gap between document ingestion and reliable, bias-aware data preparation for AI-driven recruitment systems. [ABSTRACT FROM AUTHOR] |
| Copyright of Applied Sciences (2076-3417) is the property of MDPI and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.) | |
| Databáza: | Complementary Index |
| FullText | Text: Availability: 0 CustomLinks: – Url: https://resolver.ebscohost.com/openurl?sid=EBSCO:edb&genre=article&issn=20763417&ISBN=&volume=16&issue=1&date=20260101&spage=217&pages=217-252&title=Applied Sciences (2076-3417)&atitle=Comparative%20Evaluation%20of%20GPT-4o%2C%20GPT-OSS-120B%20and%20Llama-3.1-8B-Instruct%20Language%20Models%20in%20a%20Reproducible%20CV-to-JSON%20Extraction%20Pipeline.&aulast=Nawalny%2C%20Marcin&id=DOI:10.3390/app16010217 Name: Full Text Finder Category: fullText Text: Full Text Finder Icon: https://imageserver.ebscohost.com/branding/images/FTF.gif MouseOverText: Full Text Finder – Url: https://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=EBSCO&SrcAuth=EBSCO&DestApp=WOS&ServiceName=TransferToWoS&DestLinkType=GeneralSearchSummary&Func=Links&author=Nawalny%20M Name: ISI Category: fullText Text: Nájsť tento článok vo Web of Science Icon: https://imagesrvr.epnet.com/ls/20docs.gif MouseOverText: Nájsť tento článok vo Web of Science |
|---|---|
| Header | DbId: edb DbLabel: Complementary Index An: 190819538 RelevancyScore: 1082 AccessLevel: 6 PubType: Academic Journal PubTypeId: academicJournal PreciseRelevancyScore: 1082.40466308594 |
| IllustrationInfo | |
| Items | – Name: Title Label: Title Group: Ti Data: Comparative Evaluation of GPT-4o, GPT-OSS-120B and Llama-3.1-8B-Instruct Language Models in a Reproducible CV-to-JSON Extraction Pipeline. – Name: Author Label: Authors Group: Au Data: <searchLink fieldCode="AR" term="%22Nawalny%2C+Marcin%22">Nawalny, Marcin</searchLink><br /><searchLink fieldCode="AR" term="%22Łępicki%2C+Mateusz%22">Łępicki, Mateusz</searchLink><br /><searchLink fieldCode="AR" term="%22Latkowski%2C+Tomasz%22">Latkowski, Tomasz</searchLink><br /><searchLink fieldCode="AR" term="%22Bujak%2C+Sebastian%22">Bujak, Sebastian</searchLink><br /><searchLink fieldCode="AR" term="%22Bukowski%2C+Michał%22">Bukowski, Michał</searchLink><br /><searchLink fieldCode="AR" term="%22Świderski%2C+Bartosz%22">Świderski, Bartosz</searchLink><br /><searchLink fieldCode="AR" term="%22Baranik%2C+Grzegorz%22">Baranik, Grzegorz</searchLink><br /><searchLink fieldCode="AR" term="%22Nowak%2C+Bogusz%22">Nowak, Bogusz</searchLink><br /><searchLink fieldCode="AR" term="%22Zakowicz%2C+Robert%22">Zakowicz, Robert</searchLink><br /><searchLink fieldCode="AR" term="%22Dobrakowski%2C+Łukasz%22">Dobrakowski, Łukasz</searchLink><br /><searchLink fieldCode="AR" term="%22Oczeretko%2C+Agnieszka%22">Oczeretko, Agnieszka</searchLink><br /><searchLink fieldCode="AR" term="%22Sadowski%2C+Piotr%22">Sadowski, Piotr</searchLink><br /><searchLink fieldCode="AR" term="%22Szlaga%2C+Konrad%22">Szlaga, Konrad</searchLink><br /><searchLink fieldCode="AR" term="%22Kubica%2C+Bartłomiej%22">Kubica, Bartłomiej</searchLink><br /><searchLink fieldCode="AR" term="%22Kurek%2C+Jarosław%22">Kurek, Jarosław</searchLink> – Name: TitleSource Label: Source Group: Src Data: Applied Sciences (2076-3417); Jan2026, Vol. 16 Issue 1, p217, 36p – Name: Subject Label: Subject Terms Group: Su Data: <searchLink fieldCode="DE" term="%22LANGUAGE+models%22">LANGUAGE models</searchLink><br /><searchLink fieldCode="DE" term="%22EMPLOYEE+recruitment%22">EMPLOYEE recruitment</searchLink><br /><searchLink fieldCode="DE" term="%22JOB+resumes%22">JOB resumes</searchLink><br /><searchLink fieldCode="DE" term="%22ANONYMITY%22">ANONYMITY</searchLink><br /><searchLink fieldCode="DE" term="%22GENERATIVE+pre-trained+transformers%22">GENERATIVE pre-trained transformers</searchLink><br /><searchLink fieldCode="DE" term="%22DATA+protection+laws%22">DATA protection laws</searchLink> – Name: Abstract Label: Abstract Group: Ab Data: Recruitment automation increasingly relies on Large Language Models (LLMs) for extracting structured information from unstructured CVs and job postings. However, production data often arrive as heterogeneous, privacy-sensitive PDFs, limiting reproducibility and compliance. This study introduces a deterministic, GDPR-aligned pipeline that converts recruitment documents into structured, anonymized Markdown and subsequently into validated JSON ready for downstream AI processing. The workflow combines the Docling PDF-to-Markdown converter with a two-pass anonymization protocol and evaluates three LLM back-ends—GPT-4o (Azure, frozen proprietary), GPT-OSS-120B and Llama-3.1-8B-Instruct—using identical prompts and schema constraints under near-zero-temperature decoding. Each model's output was assessed across 2280 multilingual CVs using two complementary metrics: reference-based completeness and content similarity. The proprietary GPT-4o achieved perfect schema coverage and served as the reproducibility baseline, while the open-weight models reached 73–79% completeness and 59–72% content similarity depending on section complexity. Llama-3.1-8B-Instruct performed strongly on standardized sections such as contact and legal, whereas GPT-OSS-120B better-handled less frequent narrative fields. The results demonstrate that fully deterministic, auditable document extraction is achievable with both proprietary and open LLMs when guided by strong schema validation and anonymization. The proposed pipeline bridges the gap between document ingestion and reliable, bias-aware data preparation for AI-driven recruitment systems. [ABSTRACT FROM AUTHOR] – Name: Abstract Label: Group: Ab Data: <i>Copyright of Applied Sciences (2076-3417) is the property of MDPI and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract.</i> (Copyright applies to all Abstracts.) |
| PLink | https://erproxy.cvtisr.sk/sfx/access?url=https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edb&AN=190819538 |
| RecordInfo | BibRecord: BibEntity: Identifiers: – Type: doi Value: 10.3390/app16010217 Languages: – Code: eng Text: English PhysicalDescription: Pagination: PageCount: 36 StartPage: 217 Subjects: – SubjectFull: LANGUAGE models Type: general – SubjectFull: EMPLOYEE recruitment Type: general – SubjectFull: JOB resumes Type: general – SubjectFull: ANONYMITY Type: general – SubjectFull: GENERATIVE pre-trained transformers Type: general – SubjectFull: DATA protection laws Type: general Titles: – TitleFull: Comparative Evaluation of GPT-4o, GPT-OSS-120B and Llama-3.1-8B-Instruct Language Models in a Reproducible CV-to-JSON Extraction Pipeline. Type: main BibRelationships: HasContributorRelationships: – PersonEntity: Name: NameFull: Nawalny, Marcin – PersonEntity: Name: NameFull: Łępicki, Mateusz – PersonEntity: Name: NameFull: Latkowski, Tomasz – PersonEntity: Name: NameFull: Bujak, Sebastian – PersonEntity: Name: NameFull: Bukowski, Michał – PersonEntity: Name: NameFull: Świderski, Bartosz – PersonEntity: Name: NameFull: Baranik, Grzegorz – PersonEntity: Name: NameFull: Nowak, Bogusz – PersonEntity: Name: NameFull: Zakowicz, Robert – PersonEntity: Name: NameFull: Dobrakowski, Łukasz – PersonEntity: Name: NameFull: Oczeretko, Agnieszka – PersonEntity: Name: NameFull: Sadowski, Piotr – PersonEntity: Name: NameFull: Szlaga, Konrad – PersonEntity: Name: NameFull: Kubica, Bartłomiej – PersonEntity: Name: NameFull: Kurek, Jarosław IsPartOfRelationships: – BibEntity: Dates: – D: 01 M: 01 Text: Jan2026 Type: published Y: 2026 Identifiers: – Type: issn-print Value: 20763417 Numbering: – Type: volume Value: 16 – Type: issue Value: 1 Titles: – TitleFull: Applied Sciences (2076-3417) Type: main |
| ResultId | 1 |
Full Text Finder
Nájsť tento článok vo Web of Science