AI for Healthcare: Benefits, Limits and Risks (2025)
Introduction
AI Chatbot Technology
A chatbot consists of two main components: a general-purpose AI system and a chat interface. This article considers specifically an AI system called GPT-4 (Generative Pretrained Transformer 4) with a chat interface; this system is widely available and in active development by OpenAI, an AI research and deployment company. (5)
![]() |
Figure 1: An Example Conversation with GPT-4. |
AI Chatbots and Medical Applications
OpenAI, with support from Microsoft, has been developing a series of increasingly powerful AI systems, among which GPT-4 is the most advanced that has been publicly released as of March 2023. Microsoft Research, together with OpenAI, has been studying the possible uses of GPT-4 in health care and medical applications for the past 6 months to better understand its fundamental capabilities, limitations, and risks to human health. Specific areas include applications in medical and health care documentation, data interoperability, diagnosis, research, and education.
Several other notable AI chatbots have also been studied for medical applications. Two of the most notable are LaMDA (Google)7 and GPT-3.5,8 the predecessor system to GPT-4. Interestingly, LaMDA, GPT-3.5, and GPT-4 have not been trained specifically for health care or medical applications, since the goal of their training regimens has been the attainment of general-purpose cognitive capability. Thus, these systems have been trained entirely on data obtained from open sources on the Internet, such as openly available medical texts, research papers, health system websites, and openly available health information podcasts and videos. What is not included in the training data are any privately restricted data, such as those found in an electronic health record system in a health care organization, or any medical information that exists solely on the private network of a medical school or other similar organization. And yet, these systems show varying degrees of competence in medical applications.
![]() |
Figure 2: Using
GPT-4 to Assist in Medical Note Taking. |
-
Updated Knowledge Base: As a newer model, DeepSeek may include more recent medical data and guidelines by default. It's like a more updated ChatGPT.
-
Advanced Reasoning: It’s capable of handling complex, multi-layered clinical queries better than most models.
-
Example: When asked, “For a 50-year-old patient with type 2 diabetes and CKD, what is the recommended BP target and why?”, DeepSeek may offer a guideline-based response with explanations.
-
-
More Engaging Outputs: In some tests, DeepSeek provided more compelling, specific responses—e.g., writing elevator pitches or outlining project proposals.
-
Features: Requires manual activation of internet search via the “search” button, and more advanced reasoning via the “deep think” button.
Perplexity:
Unlike the standalone AI models, Perplexity is an AI-powered search engine—a blend of Google and ChatGPT.
-
Real-Time Search with Citations: It answers questions in plain language and provides source links and references. This makes it ideal for quick fact-checking or accessing the latest research.
-
Modes: Includes options like “academic” for scholarly sources or “web” for general browsing. You can also upload PDFs for summarization and analysis.
-
Ease of Use: No login required for basic use; creating a free account unlocks additional features like saving history or using the mobile app.
-
Use Case: If a medical paper or news came out yesterday, Perplexity will likely find it—ChatGPT may not know about it unless told directly.
ChatGPT could be your all-round assistant for drafting and general queries (fluent but must be fact-checked). DeepSeek could be the new junior specialist with up to date knowledge and sharp reasoning (good for clinical queries). Perplexity could be your AI librarian, searching answers with references. You could use all three depending on the task at hand. For example you could draft a patient referral letter or a patient education guide with ChatGPT, ask DeepSeek a complicated diagnostic question, and verify a medication dose using Perplexity.
Medical Note Taking
Final Words
Transcripts of conversations with GPT-4 that provide a more comprehensive sense of its abilities are provided in the Supplementary Appendix, including the examples that we reran using the publicly released version of GPT-4 to provide a sense of its evolution as of March of 2023. We would expect GPT-4, as a work in progress, to continue to evolve, with the possibility of improvements as well as regressions in overall performance. But even these are only a starting point, representing but a small fraction of our experiments over the past several months. Our hope is to contribute to what we believe will be an important public discussion about the role of this new type of AI, as well as to understand how our approach to health care and medicine can best evolve alongside its rapid evolution.
Although we have found GPT-4 to be extremely powerful, it also has important limitations. Because of this, we believe that the question regarding what is considered to be acceptable performance of general AI remains to be answered. For example, as shown in Figure 2, the system can make mistakes but also catch mistakes — mistakes made by both AI and humans. Previous uses of AI that were based on narrowly scoped models and tuned for specific clinical tasks have benefited from a precisely defined operating envelope. But how should one evaluate the general intelligence of a tool such as GPT-4? To what extent can the user “trust” GPT-4 or does the reader need to spend time verifying the veracity of what it writes? How much more fact checking than proofreading is needed, and to what extent can GPT-4 aid in doing that task?
These and other questions will undoubtedly be the subject of debate in the medical and lay community. Although we admit our bias as employees of the entities that created GPT-4, we predict that chatbots will be used by medical professionals, as well as by patients, with increasing frequency. Perhaps the most important point is that GPT-4 is not an end in and of itself. It is the opening of a door to new possibilities as well as new risks. We speculate that GPT-4 will soon be followed by even more powerful and capable AI systems — a series of increasingly powerful and generally intelligent machines. These machines are tools, and like all tools, they can be used for good but have the potential to cause harm. If used carefully and with an appropriate degree of caution, these evolving tools have the potential to help health care providers give the best care possible.
References
1.Ker J, Wang L, Rao J, Lim T. Deep learning applications in medical image analysis. IEEE Access 2018;6:9375-9389.Crossref. opens in new tabGoogle Scholar. opens in new tab
2.Han K, Cao P, Wang Y, et al. A review of approaches for predicting drug-drug interactions based on machine learning. Front Pharmacol 2022;12:814858-814858.Crossref. opens in new tab
Medline. opens in new tabGoogle Scholar. opens in new tab
3.Beaulieu-Jones BK, Yuan W, Brat GA, et al. Machine learning for patient risk stratification: standing on, or looking over, the shoulders of clinicians? NPJ Digit Med 2021;4:62-62.Crossref. opens in new tab
Medline. opens in new tabGoogle Scholar. opens in new tab
4.Milosevic N, Thielemann W. Comparison of biomedical relationship extraction methods and models for knowledge graph creation. Journal of Web Semantics, August 7, 2022 (https://arxiv.org/abs/2201.01647. opens in new tab).Google Scholar. opens in new tab
5. OpenAI. Introducing ChatGPT. November 30, 2022 (https://openai.com/blog/chatgpt. opens in new tab).Google Scholar. opens in new tab
6.Corbelle JG, BugarÃn-Diz A, Alonso-Moral J, Taboada J. Dealing with hallucination and omission in neural Natural Language Generation: a use case on meteorology. In: Proceedings and Abstracts of the 15th International Conference on Natural Language Generation, July 18–22, 2022. Waterville, ME: Arria, 2022.Google Scholar. opens in new tab
7.Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. arXiv, December 26, 2022 (https://arxiv.org/abs/2212.13138. opens in new tab).Google Scholar. opens in new tab
8.Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health 2023;2(2):e0000198-e0000198.Crossref. opens in new tab
Medline. opens in new tabGoogle Scholar. opens in new tab
9.Nuance. Automatically document care with the Dragon Ambient eXperience (https://www.nuance.com/healthcare/ambient-clinical-intelligence.html. opens in new tab).Google Scholar. opens in new tab
10.Kazi N, Kuntz M, Kanewala U, Kahanda I, Bristow C, Arzubi E. Dataset for automated medical transcription. Zenodo, November 18, 2020 (https://zenodo.org/record/4279041#.Y_uCZh_MI2w. opens in new tab).Google Scholar. opens in new tab
11.Cancarevic I. The US medical licensing examination. In: International medical graduates in the United States. New York: Springer, 2021.Google Scholar. opens in new tab
Comments
Post a Comment