CredoScientia - Evaluation of ChatGPT and Gemini in Answering Patient Questions after Gynecologic Surgery.

Résumé

This study aimed to explore the performance of ChatGPT version 4.0 (GPT-4) and Gemini Advanced (Gemini) large language models (LLMs) in addressing common patient questions after gynecology surgery with regards to accuracy, relevance, helpfulness to the average patient, and readability.In this cross-sectional study, the two LLMs were prompted to generate answers to postoperative patient questions after gynecologic surgery. Postoperative patient questions were developed to simulate common patient questions after gynecologic surgery, based on expert opinion and compiled from anonymous posters on Reddit (r/endometriosis). Questions were focused on six topics: endometriosis, vaginal bleeding, bowel/bladder function, incision care, resumption of activities, and sexual function. Questions were asked in a systematic three-step submission process with the memory reset after each query. Responses were then blinded and independently assessed for accuracy and relevance on a 5-Point Likert scale by four board-certified gynecologic surgeons with fellowship training in gynecologic surgery. Readability of the answers was calculated with the Flesch Kincaid grade level calculator. Responses were also evaluated by three clinic nurses.A total of 41 questions were posed to GPT-4 and Gemini three times. The responses were independently evaluated by four surgeons and three nurses leading to a total of 1,968 evaluations for accuracy, relevance, helpfulness to the average patient, and readability. Surgeons and nurses graded Gemini responses as more accurate (4.23 vs. 4.03, = 0.015) and helpful (4.37 vs. 4.21, = 0.025) than GPT-4 responses. Responses from both models were similarly found to be relevant or very relevant (4.45 vs. 4.36, = 0.2). Most responses by GPT-4 (85%) and Gemini (87%) were consistent across all questions. The average reading level for GPT-4 and Gemini responses were 11th and 10th grade, above the recommended 6th grade reading level for patient information.GPT-4 and Gemini provided overall accurate, relevant, and helpful responses to common postoperative patient questions for gynecologic surgery. Gemini outperformed GPT-4 in both accuracy and helpfulness and had objectively more readable responses.