CredoScientia - Classifying the Information Needs of Survivors of Domestic Violence in Online Health Communities Using Large Language Models: Prediction Model Development and Evaluation Study.

Résumé

BACKGROUND: Domestic violence (DV) is a significant public health concern affecting the physical and mental well-being of numerous women, imposing a substantial health care burden. However, women facing DV often encounter barriers to seeking in-person help due to stigma, shame, and embarrassment. As a result, many survivors of DV turn to online health communities as a safe and anonymous space to share their experiences and seek support. Understanding the information needs of survivors of DV in online health communities through multiclass classification is crucial for providing timely and appropriate support.OBJECTIVE: The objective was to develop a fine-tuned large language model (LLM) that can provide fast and accurate predictions of the information needs of survivors of DV from their online posts, enabling health care professionals to offer timely and personalized assistance.METHODS: We collected 294 posts from Reddit subcommunities focused on DV shared by women aged ≥18 years who self-identified as experiencing intimate partner violence. We identified 8 types of information needs: shelters/DV centers/agencies; legal; childbearing; police; DV report procedure/documentation; safety planning; DV knowledge; and communication. Data augmentation was applied using GPT-3.5 to expand our dataset to 2216 samples by generating 1922 additional posts that imitated the existing data. We adopted a progressive training strategy to fine-tune GPT-3.5 for multiclass text classification using 2032 posts. We trained the model on 1 class at a time, monitoring performance closely. When suboptimal results were observed, we generated additional samples of the misclassified ones to give them more attention. We reserved 184 posts for internal testing and 74 for external validation. Model performance was evaluated using accuracy, recall, precision, and F-score, along with CIs for each metric.RESULTS: Using 40 real posts and 144 artificial intelligence-generated posts as the test dataset, our model achieved an F-score of 70.49% (95% CI 60.63%-80.35%) for real posts, outperforming the original GPT-3.5 and GPT-4, fine-tuned Llama 2-7B and Llama 3-8B, and long short-term memory. On artificial intelligence-generated posts, our model attained an F-score of 84.58% (95% CI 80.38%-88.78%), surpassing all baselines. When tested on an external validation dataset (n=74), the model achieved an F-score of 59.67% (95% CI 51.86%-67.49%), outperforming other models. Statistical analysis revealed that our model significantly outperformed the others in F-score (P=.047 for real posts; P