Performance of foundation models vs physicians in textual and multimodal ophthalmological questions

Rocha, HenryChong, Yu JeatThirunavukarasu, Arun JamesWong, Yee LingWong, Shiao WeiChang, Yin-HsiAzzopardi, MatthewTan, Benjamin Kye JynSong, AnnaMalem, AndrewJain, NikhilZhou, SeanTan, Ting FangRauz, SaaehaAng, MarcusMehta, Jodhbir STing, Daniel Shu WeiTing, Darren Shu Jeng2025-12-172025-12-172025-11-13Rocha H, Chong YJ, Thirunavukarasu AJ, Wong YL, Wong SW, Chang YH, Azzopardi M, Tan BKJ, Song A, Malem A, Jain N, Zhou S, Tan TF, Rauz S, Ang M, Mehta JS, Ting DSW, Ting DSJ. Performance of Foundation Models vs Physicians in Textual and Multimodal Ophthalmological Questions. JAMA Ophthalmol. 2025 Nov 13:e254255. doi: 10.1001/jamaophthalmol.2025.4255. Epub ahead of print.2168-617310.1001/jamaophthalmol.2025.42552841079https://westmid.openrepository.com/handle/20.500.14200/9309Importance: There is an increasing amount of literature evaluating the clinical knowledge and reasoning performance of large language models (LLMs) in ophthalmology, but to date, investigations into its multimodal abilities clinically-such as interpreting images and tables-have been limited. Objective: To evaluate the multimodal performance of the following 7 foundation models (FMs): GPT-4o (OpenAI), Gemini 1.5 Pro (Google), Claude 3.5 Sonnet (Anthropic), Llama-3.2-11B (Meta), DeepSeek V3 (High-Flyer), Qwen2.5-Max (Alibaba Cloud), and Qwen2.5-VL-72B (Alibaba Cloud) in answering offline Fellowship of the Royal College of Ophthalmologists part 2 written multiple-choice textual and multimodal questions, with head-to-head comparisons with physicians. Design, setting, and participants: This cross-sectional study was conducted between September 2024 and March 2025 using questions sourced from a textbook used as an examination preparation resource for the Fellowship of the Royal College of Ophthalmologists part 2 written examination. Exposure: FM performance. Main outcomes and measures: The primary outcome measure was FM accuracy, defined as the proportion of answers generated by the model matching the textbook's labeled letter answer. Results: For textual questions, Claude 3.5 Sonnet (accuracy, 77.7%) outperformed all other FMs (followed by GPT-4o [accuracy, 69.9%], Qwen2.5-Max [accuracy, 69.3%], DeepSeek V3 [accuracy, 63.2%], Gemini Advanced [accuracy, 62.6%], Qwen2.5-VL-72B [accuracy, 58.3%], and Llama-3.2-11B [accuracy, 50.7%]), ophthalmology trainees (difference, 9.0%; 95% CI, 2.4%-15.6%; P = .01) and junior physicians (difference, 35.2%; 95% CI, 28.3%-41.9%; P < .001), with comparable performance with expert ophthalmologists (difference, 1.3%; 95% CI, -5.1% to 7.4%; P = .72). GPT-4o (accuracy, 69.9%) outperformed GPT-4 (OpenAI; difference, 8.5%; 95% CI, 1.1%-15.8%; P = .02) and GPT-3.5 (OpenAI; difference, 21.8%; 95% CI, 14.3%-29.2%; P < .001). For multimodal questions, GPT-4o (accuracy, 57.5%) outperformed all other FMs (Claude 3.5 Sonnet [accuracy, 47.5%], Qwen2.5-VL-72B [accuracy, 45%], Gemini Advanced [accuracy, 35%], and Llama-3.2-11B [accuracy, 25%]) and the junior physician (difference, 15%; 95% CI, -6.7% to 36.7%; P = .18) but was weaker than expert ophthalmologists (accuracy range, 70.0%-85.0%; P = .16) and trainees (accuracy range, 62.5%-80%; P = .35). Conclusions and relevance: Results of this cross-sectional study suggest that for textual questions, current FMs exhibited notable improvements in ophthalmological knowledge reasoning when compared with older LLMs and ophthalmology trainees, with performance comparable with that of expert ophthalmologists. These models demonstrated potential for medical assistance for answering ophthalmological textual queries, but their multimodal abilities remain limited. Further research or fine-tuning models with diverse ophthalmic multimodal data may lead to more capable applications with multimodal functionalities.enOphthalmologyArtificial IntelligencePerformance of foundation models vs physicians in textual and multimodal ophthalmological questionsArticle