Is ChatGPT ready to give CRC screening recommendations? | Đa khoa

The performance of ChatGPT in determining appropriate colorectal cancer (CRC) screening recommendations is relatively poor, with inconsistent responses in several questions, reports a study.

“Large language models developed with artificial intelligence (AI) require further refinements to serve as reliable assistants in clinical practice,” the researchers said. [J Clin Gastroenterol 2024;58:1022-1027]

In this study, researchers entered 10 multiple choice questions, including five CRC screening and five CRC surveillance clinical vignettes, into ChatGPT 3.5 in four separate sessions. Then, they recorded and screened the responses to determine the reliability of the tool.

Subsequently, researchers compared the mean number of correct answers against a control group of gastroenterologists and colorectal surgeons, who answered the same questions with and without the help of a previously validated CRC screening mobile app.

ChatGPT had an overall performance of 45 percent on average, with a mean number of 2.75 (95 percent confidence interval [CI], 2.26–3.24) correct answers for screening, 1.75 (95 percent CI, 1.26–2.24) for surveillance, and 4.5 (95 percent CI, 3.93–5.07) for total questions. The tool also gave inconsistent responses, showing a different answer in four questions among different sessions.

Some 238 physicians also gave answers to the questions given. Of these, 115 (48.3 percent) used a mobile app and 123 (51.7 percent) did not. Comparisons showed that ChatGPT had a significantly lower mean number of total correct answers than those of physicians with (7.71, 95 percent CI, 7.39–8.03; p<0.001) and without the mobile app (5.62, 95 percent CI, 5.32–5.92).

“Advanced language models developed by AI, such as ChatGPT, may provide useful assistance in various domains of healthcare and medicine,” the researchers said. “Unfortunately, current evidence suggests that this platform requires further refinements to match the performance of qualified physicians.” [Front Artif Intell 2023;6:1169595]

In previous tests, ChatGPT also demonstrated a relatively poor performance on the Royal College of General Practitioners Applied Knowledge Test and on the United States Medical Licensing Examination. It also failed the American College of Gastroenterology self-assessment test. [JMIR Med Educ 2023;9:e46599; JMIR Med Educ 2023;9:e45312; Am J Gastroenterol 2023;118:2280-2282]

"Furthermore, ChatGPT was not able to meticulously interpret the interaction of multiple variables in complex questions (eg, question 5 with a hereditary CRC syndrome case or question 10 describing a patient with serrated polyposis syndrome), which dictates the appropriate screening recommendation,” the researchers said. “In these questions, the answer was incorrect in all the attempts.”

This study, however, had several limitations. First, the 10 clinical vignettes were relatively small to assess accuracy. Second, the ChatGPT version used here was not the latest model at the time of publication. Third, there were no fine-tuning attempts made. Finally, recommendations might differ across medical societies, which could affect the accuracy of the answers.

"The use of validated mobile apps with decision-making algorithms could serve as more reliable assistants until large language models developed with AI are further refined,” the researchers said.