Leveraging Large Language Models for advanced analysis of crash narratives in traffic safety research
Gespeichert in:
| Titel: | Leveraging Large Language Models for advanced analysis of crash narratives in traffic safety research |
|---|---|
| Autoren: | Carlino, Mattia, Wang, Xixi |
| Quelle: | Connected Transport Data (TREND). |
| Schlagwörter: | BERT family (BERT, open-source models, low-rank adaptation (LoRa), large language models (LLM), prompt engineering, fine-tuning, RoBERTa, crash narratives, SciBERT), traffic safety, information extraction, CISS dataset |
| Beschreibung: | Free-text crash narratives recorded in real-world crash databases have been shown to play a significant role in improving traffic safety. But they remain challenging to analyze at scale due to unstructured writing, heterogeneous terminology, and uneven detail. The development of Large Language Models (LLMs) offers a promising way to automatically extract information from narratives by asking questions. However, crash narratives remain hard for LLMs to analyze because of a lack of traffic safety domain knowledge. Moreover, relying on closed-source LLMs through external APIs poses privacy risks for crash data and often underperforms due to limited traffic knowledge. Motivated by these concerns, we study whether smaller open-source LLMs can support reasoning-intensive extraction from crash narratives, targeting three challenging objectives: the travel direction of the vehicles involved in the crash, identifying the manner of collision, and classifying crash type in multivehicle scenarios that require accurate per-vehicle prediction. In the first phase of the experiments, we focused on extracting vehicle travel directions by comparing small LLMs with 8 billion parameters (Mistral, DeepSeek, and Qwen) under different prompting strategies against fine-tuned transformers (BERT, RoBERTa, and SciBERT) on a manually labeled subset of the Crash Investigation Sampling System (CISS) dataset. The goal was to assess whether models trained on a generic corpus could approach or surpass the performance of domain-adapted baselines. Results confirmed that fine-tuned transformers achieved the best accuracy; however, advanced prompting strategies, particularly Chain of Thought, enabled some LLMs to reach about 90% accuracy, showing that they can serve as competitive alternatives. For the second and third tasks, to bridge domain gaps, we apply Low-Rank Adaption (LoRA) fine-tuning to inject traffic-specific knowledge. Experiments on the CISS dataset show that our fine-tuned 3B models can outperform GPT-4o while requiring minimal training resources. Further analysis of LLM-annotated data shows that LLMs can both compensate for and correct limitations in manual annotations while preserving key distributional characteristics. The results indicate that advanced prompting techniques and fine-tuned open-source models prove effective in large-scale traffic safety studies. |
| Dateibeschreibung: | electronic |
| Zugangs-URL: | https://research.chalmers.se/publication/548572 https://research.chalmers.se/publication/548572/file/548572_Fulltext.pdf |
| Datenbank: | SwePub |
| FullText | Text: Availability: 0 CustomLinks: – Url: https://research.chalmers.se/publication/548572# Name: EDS - SwePub (s4221598) Category: fullText Text: View record in SwePub – Url: https://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=EBSCO&SrcAuth=EBSCO&DestApp=WOS&ServiceName=TransferToWoS&DestLinkType=GeneralSearchSummary&Func=Links&author=Carlino%20M Name: ISI Category: fullText Text: Nájsť tento článok vo Web of Science Icon: https://imagesrvr.epnet.com/ls/20docs.gif MouseOverText: Nájsť tento článok vo Web of Science |
|---|---|
| Header | DbId: edsswe DbLabel: SwePub An: edsswe.oai.research.chalmers.se.710ab1cc.6b35.49bf.a0f2.0ca80d9ac414 RelevancyScore: 1050 AccessLevel: 6 PubType: Report PubTypeId: report PreciseRelevancyScore: 1049.736328125 |
| IllustrationInfo | |
| Items | – Name: Title Label: Title Group: Ti Data: Leveraging Large Language Models for advanced analysis of crash narratives in traffic safety research – Name: Author Label: Authors Group: Au Data: <searchLink fieldCode="AR" term="%22Carlino%2C+Mattia%22">Carlino, Mattia</searchLink><br /><searchLink fieldCode="AR" term="%22Wang%2C+Xixi%22">Wang, Xixi</searchLink> – Name: TitleSource Label: Source Group: Src Data: <i>Connected Transport Data (TREND)</i>. – Name: Subject Label: Subject Terms Group: Su Data: <searchLink fieldCode="DE" term="%22BERT+family+%28BERT%22">BERT family (BERT</searchLink><br /><searchLink fieldCode="DE" term="%22open-source+models%22">open-source models</searchLink><br /><searchLink fieldCode="DE" term="%22low-rank+adaptation+%28LoRa%29%22">low-rank adaptation (LoRa)</searchLink><br /><searchLink fieldCode="DE" term="%22large+language+models+%28LLM%29%22">large language models (LLM)</searchLink><br /><searchLink fieldCode="DE" term="%22prompt+engineering%22">prompt engineering</searchLink><br /><searchLink fieldCode="DE" term="%22fine-tuning%22">fine-tuning</searchLink><br /><searchLink fieldCode="DE" term="%22RoBERTa%22">RoBERTa</searchLink><br /><searchLink fieldCode="DE" term="%22crash+narratives%22">crash narratives</searchLink><br /><searchLink fieldCode="DE" term="%22SciBERT%29%22">SciBERT)</searchLink><br /><searchLink fieldCode="DE" term="%22traffic+safety%22">traffic safety</searchLink><br /><searchLink fieldCode="DE" term="%22information+extraction%22">information extraction</searchLink><br /><searchLink fieldCode="DE" term="%22CISS+dataset%22">CISS dataset</searchLink> – Name: Abstract Label: Description Group: Ab Data: Free-text crash narratives recorded in real-world crash databases have been shown to play a significant role in improving traffic safety. But they remain challenging to analyze at scale due to unstructured writing, heterogeneous terminology, and uneven detail. The development of Large Language Models (LLMs) offers a promising way to automatically extract information from narratives by asking questions. However, crash narratives remain hard for LLMs to analyze because of a lack of traffic safety domain knowledge. Moreover, relying on closed-source LLMs through external APIs poses privacy risks for crash data and often underperforms due to limited traffic knowledge. Motivated by these concerns, we study whether smaller open-source LLMs can support reasoning-intensive extraction from crash narratives, targeting three challenging objectives: the travel direction of the vehicles involved in the crash, identifying the manner of collision, and classifying crash type in multivehicle scenarios that require accurate per-vehicle prediction. In the first phase of the experiments, we focused on extracting vehicle travel directions by comparing small LLMs with 8 billion parameters (Mistral, DeepSeek, and Qwen) under different prompting strategies against fine-tuned transformers (BERT, RoBERTa, and SciBERT) on a manually labeled subset of the Crash Investigation Sampling System (CISS) dataset. The goal was to assess whether models trained on a generic corpus could approach or surpass the performance of domain-adapted baselines. Results confirmed that fine-tuned transformers achieved the best accuracy; however, advanced prompting strategies, particularly Chain of Thought, enabled some LLMs to reach about 90% accuracy, showing that they can serve as competitive alternatives. For the second and third tasks, to bridge domain gaps, we apply Low-Rank Adaption (LoRA) fine-tuning to inject traffic-specific knowledge. Experiments on the CISS dataset show that our fine-tuned 3B models can outperform GPT-4o while requiring minimal training resources. Further analysis of LLM-annotated data shows that LLMs can both compensate for and correct limitations in manual annotations while preserving key distributional characteristics. The results indicate that advanced prompting techniques and fine-tuned open-source models prove effective in large-scale traffic safety studies. – Name: Format Label: File Description Group: SrcInfo Data: electronic – Name: URL Label: Access URL Group: URL Data: <link linkTarget="URL" linkTerm="https://research.chalmers.se/publication/548572" linkWindow="_blank">https://research.chalmers.se/publication/548572</link><br /><link linkTarget="URL" linkTerm="https://research.chalmers.se/publication/548572/file/548572_Fulltext.pdf" linkWindow="_blank">https://research.chalmers.se/publication/548572/file/548572_Fulltext.pdf</link> |
| PLink | https://erproxy.cvtisr.sk/sfx/access?url=https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsswe&AN=edsswe.oai.research.chalmers.se.710ab1cc.6b35.49bf.a0f2.0ca80d9ac414 |
| RecordInfo | BibRecord: BibEntity: Languages: – Text: English Subjects: – SubjectFull: BERT family (BERT Type: general – SubjectFull: open-source models Type: general – SubjectFull: low-rank adaptation (LoRa) Type: general – SubjectFull: large language models (LLM) Type: general – SubjectFull: prompt engineering Type: general – SubjectFull: fine-tuning Type: general – SubjectFull: RoBERTa Type: general – SubjectFull: crash narratives Type: general – SubjectFull: SciBERT) Type: general – SubjectFull: traffic safety Type: general – SubjectFull: information extraction Type: general – SubjectFull: CISS dataset Type: general Titles: – TitleFull: Leveraging Large Language Models for advanced analysis of crash narratives in traffic safety research Type: main BibRelationships: HasContributorRelationships: – PersonEntity: Name: NameFull: Carlino, Mattia – PersonEntity: Name: NameFull: Wang, Xixi IsPartOfRelationships: – BibEntity: Dates: – D: 01 M: 01 Type: published Y: 2025 Identifiers: – Type: issn-locals Value: SWEPUB_FREE – Type: issn-locals Value: CTH_SWEPUB Titles: – TitleFull: Connected Transport Data (TREND) Type: main |
| ResultId | 1 |
Nájsť tento článok vo Web of Science