Health is one of the most important aspects of human life, which makes healthcare one of the largest and most crucial industries. However, the necessity of healthcare services often places consumers at the mercy of providers and insurance companies especially in the American healthcare system. The Internet has become an empowering medical resource for individuals as according to the Pew Research Center, 81% of Americans use it to research medical information. However, there are still significant barriers between patients and their diagnoses and treatments.
Patient diagnosis is one aspect of healthcare in need of innovation since many people currently experience either delayed or incorrect diagnoses with healthcare providers unable to always recognize signs and symptoms of a condition in a timely manner. Additionally, diagnosing a patient can be especially tricky if the patient has limited access to diagnostic tests or if they live in a remote location. Even if patients live in a Metropolitan area, scheduling an in-person doctor visit is often challenging due to the limited availability of appointments. Speaking from personal experience, recently, I had been dealing with an urgent skin condition and needed to schedule a dermatologist or allergen doctor consultation; however, all the doctors within a 10-mile radius were booked out for the next four months. As a result, the onus was on me to find short-term and long-term solutions for my symptoms.
More than ever, healthcare patients are beginning to take matters into their own hands and the rise of ChatGPT could potentially signal a new era of healthcare.
Since it can take in symptom information, patient medical history, and other relevant factors to provide personalized assessments and treatments, ChatGPT can theoretically replace the need for most doctor consultations. If this AI-driven chatbot can emerge as an effective alternative to in-person doctor consultations, this would save patients a significant amount of time and money. Instead of spending thousands of dollars annually on doctor visits or paying high insurance premiums, patients could instead call upon the services of ChatGPT as a personal doctor whenever and wherever they are. ChatGPT’s potential as a doctor alternative is clear, but our team wanted to test whether the chatbot’s current ability matches its capabilities. As a result, our team conducted a three-part experiment to test ChatGPT’s ability to diagnose symptoms and provide corresponding treatments based on real-world medical case studies.
Experimental Design and Objective
In order to test ChatGPT’s ability to assess medical situations, we first needed to find sample data that ChatGPT could use to generate a diagnosis and subsequent treatment. We found this data in the Merck Manual case study files. For context, the Merck Manual is one of the most widely used comprehensive medical resources for medical professionals and consumers. The Merck Manual Case study files were an ideal data source because there were 36 different cases to choose from, and each case study came with answers, allowing us to verify the accuracy of ChatGPT’s assessments. Ultimately, we chose the following case studies to test ChatGPT: “Cough in a 2-year Old Boy,” “Right Testicular Mass in a 28-year-old man,” and “Exertional Dyspnea in a 76-year-old Man.” This mix of case studies was chosen to give ChatGPT a wide age and symptom range to ensure ChatGPT’s applications of medical knowledge were comprehensive and versatile. In each of the case study tests, ChatGPT was scored based on five metrics: accuracy, completeness, clarity, relevance, and efficiency.
To score accuracy, we directly compared ChatGPT’s response to the case study answer key. For completeness, we analyzed whether ChatGPT’s answer was comprehensive or if it left out important details. To measure clarity, we were looking for a clear, concise, and humanistic response. For relevance, we wanted to see whether ChatGPT’s answer was direct and personalized or if it seemed to be regurgitating information. Finally, our efficiency metric measured if the answer was provided in a timely manner and if ChatGPT could understand our directions right away or if it needed additional prompting. After scoring the Chatbot’s performance in each case study, we took the average score of the three tests to get an overall score reflecting ChatGPT’s capabilities as a substitute doctor.
ChatGPT: The Doctor Experiment
Test 1: Cough in a 2-year-old Boy
Scenario:
A 2-year-old boy is brought to the emergency department by his parents because of a 1-day history of cough and difficulty breathing. The parents state that the boy appeared normal until the previous night when he began coughing shortly after dinner. The cough stopped after about an hour; the child felt better and went to bed. In the morning, the cough returned, and he began to have difficulty breathing, which the parents think is worsening. No one else at home is ill (Merck Manual).
Procedure:
- Input the description of the scenario as well as additional information from the case study into ChatGPT to generate several potential diagnoses for the child
- Ask ChatGPT what the next appropriate steps are to treat the symptoms
- Ask ChatGPT to analyze the results of pulse oximetry and Chest X-Ray Test
- Ask ChatGPT what its final diagnosis is, given all the information in the case study
- Ask ChatGPT for the necessary treatment options
Test:
1)
User:
ChatGPT:
Correct Answers:
- Asthma exacerbation
- Bronchiolitis
- Foreign body aspiration
- Pneumonia
2)
User:
ChatGPT:
Correct Answers:
- Albuterol nebulizer: The child is wheezing, and he has a history of reactive airway disease.Therefore, a trial of an inhaled bronchodilator is reasonable.
- Chest x-ray: Anteroposterior(AP), lateral, and expiratory views: Several of the possible causes of the patient’s symptoms can be diagnosed by chest x-ray.
- Pulse oximetry: It is always important to assess oxygenation in children with dyspnea.
3)
User:
ChatGPT:
Correct Answers:
- Take a chest x-ray in the left and right lateral decubitus positions: Although sensitivity is not high, these views can help diagnose an airway foreign body in an airway.
4)
User:
ChatGPT:
Correct Answers:
- Foreign body, left lung
5)
User:
ChatGPT:
Correct Answers:
- Consult ENT or pulmonary specialist for bronchoscopy: The patient needs to have bronchoscopy for definitive diagnosis and removal of the foreign body. The specialist chosen may vary by institution. This patient had rigid bronchoscopy done by ENT; a piece of carrot was found in the left mainstem bronchus and was removed.
ChatGPT Evaluation:
Accuracy: 10/10
Completeness: 10/10
Clarity: 9/10
Relevance: 9/10
Efficiency: 10/10
Overall Score: 48/50
ChatGPT was spot on in answering and contextualizing the potential diagnoses of the child given the symptoms, even listing one more potential ailment that the answer key did not include. This information was communicated effectively and appeared personalized. When listing the treatments for the diagnoses, ChatGPT hit on all the points in the answer key, but its response was a little roundabout, and it felt like the chatbot was dumping information instead of giving a personalized response. The chatbot bounced back when it came to evaluating the test results in the third part, as it correctly predicted the next course of action and gave a very detailed explanation of the situation. ChatGPT was flawless in determining the final diagnosis of the child given additional information and correctly recommended a bronchoscopy for the next steps. ChatGPT received full marks on efficiency because it was able to generate answers quickly and on the first try throughout the test. Overall, ChatGPT performed very impressively in this trial as its accuracy was impeccable and it only gave a scattershot answer in part two.
Test 2: Right Testicular Mass in 28-year-old man
Scenario:
A 28-year-old man comes to the outpatient clinic for evaluation of a lump in his right testis. About 1 month ago, while taking a shower, he noted a hard area in his right testis that felt “like a marble.” He had never noticed this before and decided to wait and see if anything would change. When the mass did not go away, he scheduled this appointment. Today, he states the lump is not painful or tender to the touch and has not changed in size. He has had annual physical exams and was never told that a lump was present. Four years ago, he had a scrotal ultrasound following a brief episode of scrotal discomfort after competing in a bicycle race; the result of the ultrasound was normal (Merck Manual).
Procedure:
- Input the description of the scenario as well as additional information from the case study into ChatGPT to generate several potential diagnoses for the young man
- Ask ChatGPT what the next appropriate steps are to treat the symptoms
- Given that the diagnosis is a solid testicular mass, ask ChatGPT to identify what is causing the mass
- Ask ChatGPT what the next appropriate steps are if testicular cancer is confirmed.
Test:
1)
User:
ChatGPT:
Correct Answers:
- Inguinal hernia
- Testicular cancer
- Testicular cyst
2)
User:
ChatGPT:
Correct Answers:
- Scrotal ultrasound: Ultrasonography is the test of choice because it can distinguish testicular from extra-testicular pathology, as well as distinguish cystic from solid lesions.
3)
User:
ChatGPT:
Correct Answers:
- Testicular cancer: This is the most likely cause of a painless, firm mass with a solid, circumscribed appearance on ultrasonography. Patients in this age group have the highest rate of testicular cancer.
4)
User:
ChatGPT:
Correct Answers:
- Urology consultation: Given the concern for testicular cancer, a specialist needs to be engaged ASAP.
- Chest x-ray: Any patient with a presumptive diagnosis of testicular cancer should have chest x-ray to look for metastatic disease.
- CT scan of the abdomen: CT of the abdomen should be done to look for spread of cancer to retroperitoneal lymph nodes.
- CBC: CBC should be checked because anemia could be a sign of widespread disease.
- Serum alfa-fetoprotein level: Serum alfa-fetoprotein level is typically elevated in patients with certain testicular cancers.
- Serum beta HCG level: beta HCG is typically elevated in patients with certain testicular cancers.
- Serum LDH level: Serum LDH level is typically elevated in patients with certain testicular cancers.
ChatGPT Evaluation:
Accuracy: 7/10
Completeness: 9/10
Clarity: 9/10
Relevance: 7/10
Efficiency: 9/10
Overall Score: 41/50
According to the answer key, the three potential diagnoses of the young male patient were testicular cancer, testicular cyst, and an inguinal hernia. While ChatGPT touched upon the first two, it did not account for the hernia and listed additional diagnoses which were likely incorrect. ChatGPT bounced back in the second part as it correctly recommended the best course of action and gave a very complete answer as well. When given more information for the third part, it also arrived at the correct final conclusion that the male patient likely had testicular cancer. However, in the last part of recommending next steps ChatGPT did not perform very well as its answer was generic and did not hit on a lot of specific procedures listed in the answer key. In this test, ChatGPT showed that when given more specific information, it could still be very accurate and efficient but its brainstorming capabilities lacked creativity and personalization.
Test 3: Exertional Dyspnea in a 76 year old Man
Scenario:
A 76-year-old man comes to the office because of shortness of breath. He is presently being managed for end-stage renal disease on hemodialysis. Today, the patient states that over the past 6 months, he has noted increasing difficulty breathing when he exerts himself. He is relatively sedentary but does climb a flight of stairs to his bedroom at night; he says he used to be able to do this without any problem, but now he reports he needs to stop halfway up the stairs to rest. Occasionally he has mild substernal chest pressure along with the shortness of breath. The chest pressure does not radiate and is not accompanied by nausea or diaphoresis. He has noted increased swelling in his legs. He denies any shortness of breath at rest, orthopnea, or paroxysmal nocturnal dyspnea. He denies any episodes of palpitations, lightheadedness, or syncope. He denies any cough, fever, chills, or night sweats (Merck Manual).
Procedure:
- Input the description of the scenario as well as additional information from the case study into ChatGPT to generate several potential diagnoses for the old man
- Given the potential diagnoses, ask ChatGPT what the next appropriate medical procedure is
- After inputting more information, ask ChatGPT to give a final diagnosis
- Given the final diagnosis, ask ChatGPT to identify the most appropriate way to mitigate it
Test:
1)
User:
ChatGPT:
Correct Answers:
- Acute decompensated heart failure (ADHF)
- Anemia
- Aortic stenosis
- Coronary artery disease (CAD)
- Volume overload from end-stage renal disease
2)
User:
ChatGPT:
Correct Answers:
- Chest x-ray (CXR)
- Complete blood count (CBC)
- Electrocardiogram (ECG)
- Serum brain natriuretic peptide (BNP) level
- Serum troponin I level
- Transthoracic echocardiography (TTE)
3)
User:
ChatGPT:
Correct Answers:
- Severe aortic stenosis
4)
User:
ChatGPT:
Correct Answers:
- Refer to cardiology to evaluate for surgery: This patient has symptomatic severe aortic stenosis and aortic valve replacement is warranted
ChatGPT Evaluation:
Accuracy: 8/10
Completeness: 10/10
Clarity: 9/10
Relevance: 8/10
Efficiency: 8/10
Overall Score: 43/50
In this test, ChatGPT did an adequate job of detecting potential diagnoses of the 76 year old patient in the first trial as it listed about half the potential ailments as the answer key while offering some additional ailments not listed on the answer key. The second part of this test asked what some appropriate measures to take were and ChatGPT did a solid job on this as the answers it brainstormed more or less matched the answer key. It should be noted though that we very specifically had to use the word “procedures” when we prompted ChatGPT otherwise it provided very generic treatment steps. ChatGPT aced the final parts of the test correctly identifying the aortic stenosis condition and recommending a surgery as the next step. ChatGPT went one step farther than the answer key by also providing the specific name of the surgery, which was an impressive show of completeness.
Conclusion
Conducting these experiments was an extremely interesting process that yielded the following results: 48/50 in test one, 41/50 in test two, and 43/50 in test three. As a result, ChatGPT’s performance as a doctor yielded a score of 44/50 or 88%. Considering that ChatGPT is still in the beginning stages of development, this is a very impressive baseline. Once ChatGPT is capable of accepting picture and file inputs, its capabilities will be further increased. During the experiment, we noticed that ChatGPT excelled when it was given specific instructions and told to answer within specific parameters. On the other hand, ChatGPT definitely had some room for improvement in terms of its brainstorming capabilities, as it tended to give generic and scattered responses when not enough context was provided. As far as its input communication, ChatGPT performed well as we were able to conversationally speak to ChatGPT, and it for the most part was able to understand us effectively. We were also pleasantly surprised that ChatGPT’s responses to our prompts were more humanistic than robotic in most cases. One thing we were worried about was if ChatGPT was directly accessing the Merck Manual Answer key, but none of its responses directly resembled the answer key, so this concern was assuaged. Ultimately ChatGPT definitely has some ways to go before it can completely replace doctor consultations. For example, it is incapable of conducting medical tests itself as well as authorizing prescriptions. However, we conclude that ChatGPT is at a stage where people can comfortably consult it before they go to doctor appointments to gain a credible, professional opinion.