Comparative Evaluation of GPT-4o, GPT-OSS-120B and Llama-3.1-8B-Instruct Language Models in a Reproducible CV-to-JSON Extraction Pipeline.

Uložené v:
Podrobná bibliografia
Názov: Comparative Evaluation of GPT-4o, GPT-OSS-120B and Llama-3.1-8B-Instruct Language Models in a Reproducible CV-to-JSON Extraction Pipeline.
Autori: Nawalny, Marcin, Łępicki, Mateusz, Latkowski, Tomasz, Bujak, Sebastian, Bukowski, Michał, Świderski, Bartosz, Baranik, Grzegorz, Nowak, Bogusz, Zakowicz, Robert, Dobrakowski, Łukasz, Oczeretko, Agnieszka, Sadowski, Piotr, Szlaga, Konrad, Kubica, Bartłomiej, Kurek, Jarosław
Zdroj: Applied Sciences (2076-3417); Jan2026, Vol. 16 Issue 1, p217, 36p
Predmety: LANGUAGE models, EMPLOYEE recruitment, JOB resumes, ANONYMITY, GENERATIVE pre-trained transformers, DATA protection laws
Abstrakt: Recruitment automation increasingly relies on Large Language Models (LLMs) for extracting structured information from unstructured CVs and job postings. However, production data often arrive as heterogeneous, privacy-sensitive PDFs, limiting reproducibility and compliance. This study introduces a deterministic, GDPR-aligned pipeline that converts recruitment documents into structured, anonymized Markdown and subsequently into validated JSON ready for downstream AI processing. The workflow combines the Docling PDF-to-Markdown converter with a two-pass anonymization protocol and evaluates three LLM back-ends—GPT-4o (Azure, frozen proprietary), GPT-OSS-120B and Llama-3.1-8B-Instruct—using identical prompts and schema constraints under near-zero-temperature decoding. Each model's output was assessed across 2280 multilingual CVs using two complementary metrics: reference-based completeness and content similarity. The proprietary GPT-4o achieved perfect schema coverage and served as the reproducibility baseline, while the open-weight models reached 73–79% completeness and 59–72% content similarity depending on section complexity. Llama-3.1-8B-Instruct performed strongly on standardized sections such as contact and legal, whereas GPT-OSS-120B better-handled less frequent narrative fields. The results demonstrate that fully deterministic, auditable document extraction is achievable with both proprietary and open LLMs when guided by strong schema validation and anonymization. The proposed pipeline bridges the gap between document ingestion and reliable, bias-aware data preparation for AI-driven recruitment systems. [ABSTRACT FROM AUTHOR]
Copyright of Applied Sciences (2076-3417) is the property of MDPI and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Databáza: Complementary Index
FullText Text:
  Availability: 0
CustomLinks:
  – Url: https://resolver.ebscohost.com/openurl?sid=EBSCO:edb&genre=article&issn=20763417&ISBN=&volume=16&issue=1&date=20260101&spage=217&pages=217-252&title=Applied Sciences (2076-3417)&atitle=Comparative%20Evaluation%20of%20GPT-4o%2C%20GPT-OSS-120B%20and%20Llama-3.1-8B-Instruct%20Language%20Models%20in%20a%20Reproducible%20CV-to-JSON%20Extraction%20Pipeline.&aulast=Nawalny%2C%20Marcin&id=DOI:10.3390/app16010217
    Name: Full Text Finder
    Category: fullText
    Text: Full Text Finder
    Icon: https://imageserver.ebscohost.com/branding/images/FTF.gif
    MouseOverText: Full Text Finder
  – Url: https://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=EBSCO&SrcAuth=EBSCO&DestApp=WOS&ServiceName=TransferToWoS&DestLinkType=GeneralSearchSummary&Func=Links&author=Nawalny%20M
    Name: ISI
    Category: fullText
    Text: Nájsť tento článok vo Web of Science
    Icon: https://imagesrvr.epnet.com/ls/20docs.gif
    MouseOverText: Nájsť tento článok vo Web of Science
Header DbId: edb
DbLabel: Complementary Index
An: 190819538
RelevancyScore: 1082
AccessLevel: 6
PubType: Academic Journal
PubTypeId: academicJournal
PreciseRelevancyScore: 1082.40466308594
IllustrationInfo
Items – Name: Title
  Label: Title
  Group: Ti
  Data: Comparative Evaluation of GPT-4o, GPT-OSS-120B and Llama-3.1-8B-Instruct Language Models in a Reproducible CV-to-JSON Extraction Pipeline.
– Name: Author
  Label: Authors
  Group: Au
  Data: <searchLink fieldCode="AR" term="%22Nawalny%2C+Marcin%22">Nawalny, Marcin</searchLink><br /><searchLink fieldCode="AR" term="%22Łępicki%2C+Mateusz%22">Łępicki, Mateusz</searchLink><br /><searchLink fieldCode="AR" term="%22Latkowski%2C+Tomasz%22">Latkowski, Tomasz</searchLink><br /><searchLink fieldCode="AR" term="%22Bujak%2C+Sebastian%22">Bujak, Sebastian</searchLink><br /><searchLink fieldCode="AR" term="%22Bukowski%2C+Michał%22">Bukowski, Michał</searchLink><br /><searchLink fieldCode="AR" term="%22Świderski%2C+Bartosz%22">Świderski, Bartosz</searchLink><br /><searchLink fieldCode="AR" term="%22Baranik%2C+Grzegorz%22">Baranik, Grzegorz</searchLink><br /><searchLink fieldCode="AR" term="%22Nowak%2C+Bogusz%22">Nowak, Bogusz</searchLink><br /><searchLink fieldCode="AR" term="%22Zakowicz%2C+Robert%22">Zakowicz, Robert</searchLink><br /><searchLink fieldCode="AR" term="%22Dobrakowski%2C+Łukasz%22">Dobrakowski, Łukasz</searchLink><br /><searchLink fieldCode="AR" term="%22Oczeretko%2C+Agnieszka%22">Oczeretko, Agnieszka</searchLink><br /><searchLink fieldCode="AR" term="%22Sadowski%2C+Piotr%22">Sadowski, Piotr</searchLink><br /><searchLink fieldCode="AR" term="%22Szlaga%2C+Konrad%22">Szlaga, Konrad</searchLink><br /><searchLink fieldCode="AR" term="%22Kubica%2C+Bartłomiej%22">Kubica, Bartłomiej</searchLink><br /><searchLink fieldCode="AR" term="%22Kurek%2C+Jarosław%22">Kurek, Jarosław</searchLink>
– Name: TitleSource
  Label: Source
  Group: Src
  Data: Applied Sciences (2076-3417); Jan2026, Vol. 16 Issue 1, p217, 36p
– Name: Subject
  Label: Subject Terms
  Group: Su
  Data: <searchLink fieldCode="DE" term="%22LANGUAGE+models%22">LANGUAGE models</searchLink><br /><searchLink fieldCode="DE" term="%22EMPLOYEE+recruitment%22">EMPLOYEE recruitment</searchLink><br /><searchLink fieldCode="DE" term="%22JOB+resumes%22">JOB resumes</searchLink><br /><searchLink fieldCode="DE" term="%22ANONYMITY%22">ANONYMITY</searchLink><br /><searchLink fieldCode="DE" term="%22GENERATIVE+pre-trained+transformers%22">GENERATIVE pre-trained transformers</searchLink><br /><searchLink fieldCode="DE" term="%22DATA+protection+laws%22">DATA protection laws</searchLink>
– Name: Abstract
  Label: Abstract
  Group: Ab
  Data: Recruitment automation increasingly relies on Large Language Models (LLMs) for extracting structured information from unstructured CVs and job postings. However, production data often arrive as heterogeneous, privacy-sensitive PDFs, limiting reproducibility and compliance. This study introduces a deterministic, GDPR-aligned pipeline that converts recruitment documents into structured, anonymized Markdown and subsequently into validated JSON ready for downstream AI processing. The workflow combines the Docling PDF-to-Markdown converter with a two-pass anonymization protocol and evaluates three LLM back-ends—GPT-4o (Azure, frozen proprietary), GPT-OSS-120B and Llama-3.1-8B-Instruct—using identical prompts and schema constraints under near-zero-temperature decoding. Each model's output was assessed across 2280 multilingual CVs using two complementary metrics: reference-based completeness and content similarity. The proprietary GPT-4o achieved perfect schema coverage and served as the reproducibility baseline, while the open-weight models reached 73–79% completeness and 59–72% content similarity depending on section complexity. Llama-3.1-8B-Instruct performed strongly on standardized sections such as contact and legal, whereas GPT-OSS-120B better-handled less frequent narrative fields. The results demonstrate that fully deterministic, auditable document extraction is achievable with both proprietary and open LLMs when guided by strong schema validation and anonymization. The proposed pipeline bridges the gap between document ingestion and reliable, bias-aware data preparation for AI-driven recruitment systems. [ABSTRACT FROM AUTHOR]
– Name: Abstract
  Label:
  Group: Ab
  Data: <i>Copyright of Applied Sciences (2076-3417) is the property of MDPI and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract.</i> (Copyright applies to all Abstracts.)
PLink https://erproxy.cvtisr.sk/sfx/access?url=https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edb&AN=190819538
RecordInfo BibRecord:
  BibEntity:
    Identifiers:
      – Type: doi
        Value: 10.3390/app16010217
    Languages:
      – Code: eng
        Text: English
    PhysicalDescription:
      Pagination:
        PageCount: 36
        StartPage: 217
    Subjects:
      – SubjectFull: LANGUAGE models
        Type: general
      – SubjectFull: EMPLOYEE recruitment
        Type: general
      – SubjectFull: JOB resumes
        Type: general
      – SubjectFull: ANONYMITY
        Type: general
      – SubjectFull: GENERATIVE pre-trained transformers
        Type: general
      – SubjectFull: DATA protection laws
        Type: general
    Titles:
      – TitleFull: Comparative Evaluation of GPT-4o, GPT-OSS-120B and Llama-3.1-8B-Instruct Language Models in a Reproducible CV-to-JSON Extraction Pipeline.
        Type: main
  BibRelationships:
    HasContributorRelationships:
      – PersonEntity:
          Name:
            NameFull: Nawalny, Marcin
      – PersonEntity:
          Name:
            NameFull: Łępicki, Mateusz
      – PersonEntity:
          Name:
            NameFull: Latkowski, Tomasz
      – PersonEntity:
          Name:
            NameFull: Bujak, Sebastian
      – PersonEntity:
          Name:
            NameFull: Bukowski, Michał
      – PersonEntity:
          Name:
            NameFull: Świderski, Bartosz
      – PersonEntity:
          Name:
            NameFull: Baranik, Grzegorz
      – PersonEntity:
          Name:
            NameFull: Nowak, Bogusz
      – PersonEntity:
          Name:
            NameFull: Zakowicz, Robert
      – PersonEntity:
          Name:
            NameFull: Dobrakowski, Łukasz
      – PersonEntity:
          Name:
            NameFull: Oczeretko, Agnieszka
      – PersonEntity:
          Name:
            NameFull: Sadowski, Piotr
      – PersonEntity:
          Name:
            NameFull: Szlaga, Konrad
      – PersonEntity:
          Name:
            NameFull: Kubica, Bartłomiej
      – PersonEntity:
          Name:
            NameFull: Kurek, Jarosław
    IsPartOfRelationships:
      – BibEntity:
          Dates:
            – D: 01
              M: 01
              Text: Jan2026
              Type: published
              Y: 2026
          Identifiers:
            – Type: issn-print
              Value: 20763417
          Numbering:
            – Type: volume
              Value: 16
            – Type: issue
              Value: 1
          Titles:
            – TitleFull: Applied Sciences (2076-3417)
              Type: main
ResultId 1