Open‐source large language models in action: A bioinformatics chatbot for PRIDE database

ABSTRACT We here present a chatbot assistant infrastructure (https://www.ebi.ac.uk/pride/chatbot/) that simplifies user interactions with the PRIDE database's documentation and dataset search functionality. The framework utilizes multiple Large Language Models (LLM): llama2, chatglm, mixtral (m...

Full description

Saved in:

Bibliographic Details
Published in:	Proteomics (Weinheim) Vol. 24; no. 21-22; pp. e2400005 - n/a
Main Authors:	Bai, Jingwen, Kamatchinathan, Selvakumar, Kundu, Deepti J., Bandla, Chakradhar, Vizcaíno, Juan Antonio, Perez‐Riverol, Yasset
Format:	Journal Article
Language:	English
Published:	Germany Wiley Subscription Services, Inc 2024-11-00 2024-Nov 20241101
Subjects:	Application programming interface Bioinformatics Chatbots Computational Biology - methods Data base management systems Databases, Protein dataset discoverability Datasets Documentation Humans Infrastructure Internet Large language models Programming Languages Proteomics Proteomics - methods public data Recommender systems Software software architectures training User experience User interfaces User-Computer Interface Web services public data proteomics software architectures bioinformatics documentation dataset discoverability large language models training
ISSN:	1615-9853, 1615-9861, 1615-9861
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	ABSTRACT We here present a chatbot assistant infrastructure (https://www.ebi.ac.uk/pride/chatbot/) that simplifies user interactions with the PRIDE database's documentation and dataset search functionality. The framework utilizes multiple Large Language Models (LLM): llama2, chatglm, mixtral (mistral), and openhermes. It also includes a web service API (Application Programming Interface), web interface, and components for indexing and managing vector databases. An Elo‐ranking system‐based benchmark component is included in the framework as well, which allows for evaluating the performance of each LLM and for improving PRIDE documentation. The chatbot not only allows users to interact with PRIDE documentation but can also be used to search and find PRIDE datasets using an LLM‐based recommendation system, enabling dataset discoverability. Importantly, while our infrastructure is exemplified through its application in the PRIDE database context, the modular and adaptable nature of our approach positions it as a valuable tool for improving user experiences across a spectrum of bioinformatics and proteomics tools and resources, among other domains. The integration of advanced LLMs, innovative vector‐based construction, the benchmarking framework, and optimized documentation collectively form a robust and transferable chatbot assistant infrastructure. The framework is open‐source (https://github.com/PRIDE‐Archive/pride‐chatbot).
Bibliography:	Jingwen Bai and Selvakumar Kamatchinathan contributed equally to this work. ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	1615-9853 1615-9861 1615-9861
DOI:	10.1002/pmic.202400005