A multimodal machine learning model for bipolar disorder mania classification: Insights from acoustic, linguistic, and visual cues
Mood fluctuations that can vary from manic to depressive states are a symptom of a disease known as bipolar disorder, which affects mental health. Interviews with patients and gathering information from their families are essential steps in the diagnostic process for bipolar disorder. Automated appr...
Uloženo v:
| Vydáno v: | Intelligence-based medicine Ročník 11; s. 100223 |
|---|---|
| Hlavní autoři: | , , , , , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
Elsevier B.V
2025
|
| Témata: | |
| ISSN: | 2666-5212, 2666-5212 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Shrnutí: | Mood fluctuations that can vary from manic to depressive states are a symptom of a disease known as bipolar disorder, which affects mental health. Interviews with patients and gathering information from their families are essential steps in the diagnostic process for bipolar disorder. Automated approaches for treating bipolar disorder are also being explored. In mental health prevention and care, machine learning techniques (ML) are increasingly used to detect and treat diseases. With frequently analyzed human behaviour patterns, identified symptoms, and risk factors as various parameters of the dataset, predictions can be made for improving traditional diagnosis methods. In this study, A Multimodal Fusion System was developed based on an auditory, linguistic, and visual patient recording as an input dataset for a three-stage mania classification decision system. Deep Denoising Autoencoders (DDAEs) are introduced to learn common representations across five modalities: acoustic characteristics, eye gaze, facial landmarks, head posture, and Facial Action Units (FAUs). This is done in particular for the audio-visual modality. The distributed representations and the transient information during each recording session are eventually encoded into Fisher Vectors (FVs), which capture the representations. Once the Fisher Vectors (FVs) and document embeddings are integrated, a Multi-Task Neural Network is used to perform the classification task, while mitigating overfitting issues caused by the limited size of the bipolar disorder dataset. The study introduces Deep Denoising Autoencoders (DDAEs) for cross-modal representation learning and utilizes Fisher Vectors with Multi-Task Neural Networks, enhancing diagnostic accuracy while highlighting the benefits of multimodal fusion for mental health diagnostics. Achieving an unweighted average recall score of 64.8 %, with the highest AUC-ROC of 0.85 & less interface time of 6.5 ms/sample scores the effectiveness of integrating multiple modalities in improving system performance and advancing feature representation and model interpretability.
•In this study, we develop an auditory, linguistic, and visual patient recordings-based multimodal decision system for three-stage mania classification. Learn the common representations across a total of five modalities.•This is done in particular for the audio-visual modality four of which are visual features. The distributed representations and the transient information during each recording session are eventually encoded into Fisher Vectors (FVs), which capture the representations.•Once the FVs and document embeddings are integrated, a Multi-Task Neural Network is used to perform the classification task while mitigating the overfitting issues caused by the limited amount of the BD dataset.•A comprehensive examination of fusion techniques and unimodal and multimodal systems has been provided. The performance on the dataset was improved. Obtained an unweighted average recall score of 64.8 percent by mixing auditory, linguistic, and perceptual information in a multimodal merging system. |
|---|---|
| ISSN: | 2666-5212 2666-5212 |
| DOI: | 10.1016/j.ibmed.2025.100223 |