DocSpider: a dataset of cross-domain natural language querying for MongoDB.

Uložené v:
Podrobná bibliografia
Názov: DocSpider: a dataset of cross-domain natural language querying for MongoDB.
Autori: Özer, Arif Görkem, Cekinel, Recep Firat, Toroslu, Ismail Hakki, Karagoz, Pinar
Zdroj: Natural Language Processing (29770424); Nov2025, Vol. 31 Issue 6, p1367-1398, 32p
Predmety: LANGUAGE models, NONRELATIONAL databases, DATABASES, NATURAL languages, BIG data
Abstrakt: Natural language querying allows users to formulate questions in a natural language without requiring specific knowledge of the database query language. Large language models have been very successful in addressing the text-to-SQL problem, which is about translating given questions in textual form into SQL statements. Document-oriented NoSQL databases are gaining popularity in the era of big data due to their ability to handle vast amounts of semi-structured data and provide advanced querying functionalities. However, studies on text-to-NoSQL systems, particularly on systems targeting document databases, are very scarce. In this study, we utilize large language models to create a cross-domain natural language to document database query dataset, DocSpider , leveraging the well-known text-to-SQL challenge dataset Spider. As a document database, we use MongoDB. Furthermore, we conduct experiments to assess the effectiveness of the DocSpider dataset to fine-tune a text-to-NoSQL model against a cross-language transfer learning approach, SQL-to-NoSQL, and zero-shot instruction prompting. The experimental results reveal a significant improvement in the execution accuracy of fine-tuned language models when utilizing the DocSpider dataset. [ABSTRACT FROM AUTHOR]
Copyright of Natural Language Processing (29770424) is the property of Cambridge University Press and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Databáza: Complementary Index
FullText Text:
  Availability: 0
CustomLinks:
  – Url: https://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=EBSCO&SrcAuth=EBSCO&DestApp=WOS&ServiceName=TransferToWoS&DestLinkType=GeneralSearchSummary&Func=Links&author=%C3%96zer%20AG
    Name: ISI
    Category: fullText
    Text: Nájsť tento článok vo Web of Science
    Icon: https://imagesrvr.epnet.com/ls/20docs.gif
    MouseOverText: Nájsť tento článok vo Web of Science
Header DbId: edb
DbLabel: Complementary Index
An: 188601269
RelevancyScore: 1082
AccessLevel: 6
PubType: Academic Journal
PubTypeId: academicJournal
PreciseRelevancyScore: 1082.14831542969
IllustrationInfo
Items – Name: Title
  Label: Title
  Group: Ti
  Data: DocSpider: a dataset of cross-domain natural language querying for MongoDB.
– Name: Author
  Label: Authors
  Group: Au
  Data: <searchLink fieldCode="AR" term="%22Özer%2C+Arif+Görkem%22">Özer, Arif Görkem</searchLink><br /><searchLink fieldCode="AR" term="%22Cekinel%2C+Recep+Firat%22">Cekinel, Recep Firat</searchLink><br /><searchLink fieldCode="AR" term="%22Toroslu%2C+Ismail+Hakki%22">Toroslu, Ismail Hakki</searchLink><br /><searchLink fieldCode="AR" term="%22Karagoz%2C+Pinar%22">Karagoz, Pinar</searchLink>
– Name: TitleSource
  Label: Source
  Group: Src
  Data: Natural Language Processing (29770424); Nov2025, Vol. 31 Issue 6, p1367-1398, 32p
– Name: Subject
  Label: Subject Terms
  Group: Su
  Data: <searchLink fieldCode="DE" term="%22LANGUAGE+models%22">LANGUAGE models</searchLink><br /><searchLink fieldCode="DE" term="%22NONRELATIONAL+databases%22">NONRELATIONAL databases</searchLink><br /><searchLink fieldCode="DE" term="%22DATABASES%22">DATABASES</searchLink><br /><searchLink fieldCode="DE" term="%22NATURAL+languages%22">NATURAL languages</searchLink><br /><searchLink fieldCode="DE" term="%22BIG+data%22">BIG data</searchLink>
– Name: Abstract
  Label: Abstract
  Group: Ab
  Data: Natural language querying allows users to formulate questions in a natural language without requiring specific knowledge of the database query language. Large language models have been very successful in addressing the text-to-SQL problem, which is about translating given questions in textual form into SQL statements. Document-oriented NoSQL databases are gaining popularity in the era of big data due to their ability to handle vast amounts of semi-structured data and provide advanced querying functionalities. However, studies on text-to-NoSQL systems, particularly on systems targeting document databases, are very scarce. In this study, we utilize large language models to create a cross-domain natural language to document database query dataset, DocSpider , leveraging the well-known text-to-SQL challenge dataset Spider. As a document database, we use MongoDB. Furthermore, we conduct experiments to assess the effectiveness of the DocSpider dataset to fine-tune a text-to-NoSQL model against a cross-language transfer learning approach, SQL-to-NoSQL, and zero-shot instruction prompting. The experimental results reveal a significant improvement in the execution accuracy of fine-tuned language models when utilizing the DocSpider dataset. [ABSTRACT FROM AUTHOR]
– Name: Abstract
  Label:
  Group: Ab
  Data: <i>Copyright of Natural Language Processing (29770424) is the property of Cambridge University Press and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract.</i> (Copyright applies to all Abstracts.)
PLink https://erproxy.cvtisr.sk/sfx/access?url=https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edb&AN=188601269
RecordInfo BibRecord:
  BibEntity:
    Identifiers:
      – Type: doi
        Value: 10.1017/nlp.2024.63
    Languages:
      – Code: eng
        Text: English
    PhysicalDescription:
      Pagination:
        PageCount: 32
        StartPage: 1367
    Subjects:
      – SubjectFull: LANGUAGE models
        Type: general
      – SubjectFull: NONRELATIONAL databases
        Type: general
      – SubjectFull: DATABASES
        Type: general
      – SubjectFull: NATURAL languages
        Type: general
      – SubjectFull: BIG data
        Type: general
    Titles:
      – TitleFull: DocSpider: a dataset of cross-domain natural language querying for MongoDB.
        Type: main
  BibRelationships:
    HasContributorRelationships:
      – PersonEntity:
          Name:
            NameFull: Özer, Arif Görkem
      – PersonEntity:
          Name:
            NameFull: Cekinel, Recep Firat
      – PersonEntity:
          Name:
            NameFull: Toroslu, Ismail Hakki
      – PersonEntity:
          Name:
            NameFull: Karagoz, Pinar
    IsPartOfRelationships:
      – BibEntity:
          Dates:
            – D: 01
              M: 11
              Text: Nov2025
              Type: published
              Y: 2025
          Identifiers:
            – Type: issn-print
              Value: 29770424
          Numbering:
            – Type: volume
              Value: 31
            – Type: issue
              Value: 6
          Titles:
            – TitleFull: Natural Language Processing (29770424)
              Type: main
ResultId 1