Introduction
Isaac (Zak) Kohane is a renowned leader in biomedical informatics at Harvard Medical School, helping the medical community bridge the digital transformation. He is the co-author of The AI Revolution in Medicine: GPT-4 and Beyond (with Peter Lee and Carey Goldberg). He is also the founding editor-in-chief of NEJM AI, an online journal that span out of the New England Journal of Medicine (NEJM) in 2023.
In this interview, recorded for the State of AI in Precision Oncology virtual summit (December 16, 2025), Doug Flora sits down with Zak Kohane for the first time to talk about the ethics of AI in cancer and publishing.
This interview has been lightly edited for length and clarity using AI- and human editors.
Doug Flora: Zak, as the leader of NEJM AI, you are leading conversations about academic publishing in this area. We’re not just talking about technical accuracy anymore of AI-assisted research and thoughts about hallucination, which thankfully are going down. I want to discuss some of the ethical dimensions of its use.
I wanted to start with something you put in Chapter 10 of your book, a science fiction short story called The Little Black Bag, by Cyril Kornbluth. He talks about an inept alcoholic physician who receives a bag of medical devices from the future and then is immediately able to heal patients using newer tools. This is a cool construct for where we are in AI today, because now we have those tools. The lesson in the story was don’t let the tools be compromised by the human vices of ignorance and greed and exploitation. How did you happen upon that?
Zak Kohane: I was always a big science fiction reader, as was my co-author, Carey Goldberg. When she mentioned the story about the bag, I said, “Perfect, it is the right metaphor.” It is remarkably useful to have a science fiction background for understanding our current moment.
I trained as a pediatric endocrinologist but also have a PhD in computer science, working on AI back in the 1980s. If you told me back then that the main way to understand how AI works is by asking it questions rather than delving into the neural network, I would have been surprised. The only way we have a good sense of what these AI models are is by asking them as if they are human beings. This follows the path of Isaac Asimov’s I, Robot series. One unheralded character is Susan Calvin, who is described as a robo-psychologist. She debugs the robots because they are too complicated to understand at the neuronal level, similar to how individual neurons do not add up to our thoughts. Many metaphors in science fiction are helpful in thinking about the current moment…
There is another part that is more relevant: a lot of science fiction talks about the disappearance of expertise. One of the pioneers of AI, back in 2016, was predicting that we would be replacing radiologists within 4–6 years. Of course, we have not. In fact, we have the opposite problem: we don’t have enough doctors. We don’t fill half the pediatric endocrinologist slots nationwide every year, and this is true of many pediatric subspecialties. In primary care, the issue is pronounced. In Boston, our residents at Mass General Hospital can’t find primary care, as all the primary care practices are closed. Early science fiction stories talked about the disappearance of expertise and the need for this, portraying people as de-skilled dummies. I think we’re heading in that direction.
Flora: In that short story, they talked about ‘the marching morons’. I don’t want to become one of those, and that is why we both edit journals—because we want to disseminate knowledge… You have spent most of your career harnessing data to advance medicine. NEJM AI was a huge institutional commitment, particularly for the NEJM group—the epitome of academic rigor and conservatism. I’d love to know about the forces that convinced the institution that you could not contain these ideas within the flagship journal and needed its own.
Kohane: That’s a great question. From the outside, I would have asked exactly the same question. It turns out it’s just the opposite. About 6–7 years ago, Jeffrey Drazen and Eric Rubin approached me to start a journal in AI, and I said, ‘Absolutely not!’ I told them there were not enough good publications that clinicians would care about. They asked me to join the editorial board to better deal with submissions, so I joined the editorial board of the New England Journal of Medicine.
In 2020, I finally agreed that we could start thinking about it. They were very gung-ho. As we were getting ready to start the journal, I got a call from Peter Lee (head of Microsoft Research) in October 2022. He was letting me in on a secret, saying if I answered his phone call, I would find it worthwhile. To make a long story short, I accepted the call, and before I had heard of ChatGPT, he introduced me to GPT-4 (then codenamed DaVinci 3). I asked it a bunch of hard pediatric endocrine questions, and it got them right. Back then, it was not as aligned to behave in an obsequious way as it is currently, and it argued with me about diagnoses in an impertinent way, which I enjoyed. I told Jeff and Eric (now editor-in-chief at NEJM) that they needed to see this. They immediately saw that this was the future. When we started, everybody was wowed by it, and smart doctors in February 2023 were already using it to get prior authorization letters out to insurers. We got lucky about the timing as well.
Flora: It’s a great story. I was going through the same thing with oncology; we tend to lean forward into technologies a lot. Many FDA-approved devices at that time were radiology or pathology devices, which are fundamental to what I do. Now we live in a system where everybody’s having their mammograms over-read by AI, as well as CT scans, lung nodule detection, sepsis predictors, and GI Genius colonoscopy. It has entered the fold rapidly.
Kohane: Doug, in your institution, are you really using AI to over-read the radiology?
Flora: Overread is probably a relative thing. We use tools to augment the radiologist. On the mammography tool, for instance, it’s red, yellow, and green. The green has the negative predictive value of a D-dimer for PE. Red has some false positives that require adult nuance in a board-certified, thoughtful, intuitive way. Yellow makes them slow down and look. The tools are getting better, and you have a strong one coming out of Connie Lehman’s lab at Mass General Hospital.
Kohane: I am absolutely gung-ho about the future of AI in medicine. We are seeing rapid acceleration in the use of it for increased billing and moderation of reimbursement. The surprise is that health care systems are willing to pay for ambient dictation. However, for clinical applications, adoption has not been as widespread. Convolutional neural networks showed good performance in dermatology, radiology, and pathology back in 2018, but they are still not being used widely. This tells us where the health care system and doctors think there is value. If you go into most hospitals, you see open screens to an AI on most nursing stations and laptops that have not been regulated or viewed by any hospital authorities.
This AI is also sending targeted advertisements to doctors. Doctors love OpenEvidence, even though they know it sometimes hallucinates, because it is so useful. They keep using it even though it is not approved by any hospital. The contrast between the slow uptake of radiology, pathology, and dermatology AI, and the overnight adoption of OpenEvidence (a $12-billion company), tells us something about the underlying motivations. Oncologists are leading the clinical use, both because the stakes are high and because oncology has a culture of knowledge organizations. The NCCN [National Comprehensive Cancer Network] guidelines, for example, have a level of detail not seen in other professions. This is a reflection of what is being adopted and what is not.
Flora: We’re seeing the same thing across the country. Most doctors I know are using OpenEvidence, and other tools like Clinical Key AI through Epic and UpToDate AI are coming. Debra Patt (Texas Oncology) has 6 years of data using clinical decision support for over 1,000 clinicians. They are publishing frequently on the role of these tools in reducing ER utilization and readmission rates. The technology is hurtling forward. We need to make sure that we don’t let the hype outpace the reality, especially concerning vendors versus published data. Studies have shown that at least 13% of indexed abstracts have been LLM processed or contributing. A recent major journal had to retract 129 articles due to undisclosed LLM use. The integrity of the academic record is currently under some strain and we are trying to stay ahead of that conversation.
Since 2022, we have read more things with the word ‘delve’ in them than any other word that has appeared in the medical literature, along with beacon and tapestry. How do you approach this from an authorship standpoint, and the disclosures that you’re requiring? As we talk about these changes in medical publishing, we know the velocity of the use of words like delve, tapestry, and beacon. Authors are using these tools, and as an editor-in-chief, you have to set an editorial direction for what you will and won’t tolerate. How are you approaching this as a board to determine what levels of declaration authors are allowed to use? There is a huge difference between an article completely generated on Claude and somebody who uses Grammarly to remove extra commas.
Kohane: This is a hot-button issue where well-intentioned individuals can take different perspectives. When our journal started, Science magazine mandated no use of AI by authors. Our editorial board, which has many AI researchers, told me we should allow it, so long as people declare the use of it. I decided that NEJM AI is not publishing literature; we are publishing scientific results, and we want our authors to stand behind their science. We realize there is a fine line. If someone is overusing these models, the science itself could be compromised.
As an example, I recently showed data at the Congress of Peer Review demonstrating that papers by authors with a higher H-index were more likely to be withdrawn. Statistical tests showed these data were highly likely to be manufactured. I confirmed that I had manufactured the data and, after several tries, was able to coerce GPT-5 to create better, cleaned-up data that supported the hypothesis and could pass every statistical test. Therefore, while using AI to help generate text is acceptable, and I argue that it levels the playing field with non-native English speakers who might otherwise lack professional phrasing, there is a very blurred line. You could easily cross into having the AI make your science look better than it really deserves to look, if not be outright fraud.
Flora: I love that you mention non-first-language English speakers. We get many submissions from India that are high-quality science but not readable. There is a blurred, ethical line between disclosable AI use and minor language refinement. I made a point in the acknowledgment section of my book by detailing exactly what ChatGPT did for my outline and using Perplexity for historical data. Disclosing this is not terribly different than using a medical student or research assistant. We need to stop apologizing for using these tools. Instead, we should thank users for giving time back by doing a job in an hour and a half that might otherwise take 2 weeks.
Kohane: When we wrote our book, the early, less aligned version of GPT-4 insisted it should be a first author. We refused, though it had a point! Many studies have shown that for equal contributions by AI, humans give much less credit to the AI.
Flora: Your journal policy is clear: authors are accountable. Chatbots are not authors, but disclosure is important. We think disclosure is key, but many still fear a negative perception or retaliation, so AI use is very underreported.
Kohane: Let’s talk about the other bugaboo: recent studies suggest that in many journals, 20–25% of reviews are AI-assisted. Many of my students use various AI tools, notably Google’s AI LLM notebook, to quickly read articles in their research area. This tool generates little podcasts for each article, allowing them to catch up. On the other hand, editors know we receive many crappy reviews where reviewers either misunderstood the paper or did not ask the hard questions…
At a conference I run (SAIL), I had invited editors from AI journals, including myself, Lancet Digital Health, JAMA AI, and Nature Medicine. When asked when we would allow AI reviews, I said, within 2 years. My board went for the most aggressive scenario. We selected preprints we thought had a high probability of being published and reached out to the authors. We guaranteed a decision (accept or reject) in 7 days, but they had to agree to let us use an AI as a reviewer. We used one human editor and two AIs to do the review. We selected two good ambient dictation randomized controlled trials, and both accepted. The human editor did a good job, identified strengths and weaknesses, and recommended publication with minor revision. The two AIs made no mistakes but focused on different things. One focused on statistical analysis, and the other focused on generalizability. All three agreed, and the editorial board discussed the paper and agreed to publish, incorporating recommendations from both LLMs. The two papers, along with all the reviews by the humans and the two AIs, will be available online as supplements this coming month. Frankly, it was very good.
Flora: Fascinating. As our utility with the tools gets better, more people are facile with how to do thoughtful prompts. You can build in commands to take out the use of words like delve and tapestry. This stylistic drift, or excess vocabulary, is actually more concerning to me, because it is crappy writing.
Kohane: That’s right. We also share all the prompts that we used. We are approaching an interesting era where AI can actually speed up and make the review process more rigorous (with humans supervising), but it can also make cheating easier and make crappy science look better than it is. This is a huge challenge. I would much rather be part of the group that explores this with our readers and reviewers than wait to be hit in an unexpected way.
Flora: We need to be careful what we use for our control group when talking about the quality of peer review. I’ve gotten some lazy peer reviews that are disappointing. The ones that are generated by large language models are decent. How can we shift the view of AI use in responsible scientific academic publishing from a deficit to a responsible enhancement?
Kohane: For the near future, we need to keep a well-seasoned reviewer-editor in the loop. The AIs will point out things the human hasn’t thought about, and things that get wrong can be corrected by the human. It behooves us to take advantage of AI because it is an extender and catches things that humans miss. We should use these tools with the human loop to deliver an important value: getting the reviews out faster. It is sad that in fast-moving fields, things that make a difference take half a year or a year to publish. If we use AI responsibly, we get things out faster. But here is the problem: there are a bunch of predatory journals. The editors of these journals may not have concerns about integrity. They can churn out publications really fast, incentivized to use a robotic editor and reviewer. There is a certainty that many journals will be contaminating our literature with unsupervised publications.
Flora: Let’s go back to The Little Black Bag. In the story, this physician’s practice was elevated by the tools he received from the future. Ultimately, his greedy partner wanted to take advantage of the tools for her own wealth and did not have the best interests of the patient at heart. As we look forward, we will continue to publish the best science we can within the restrictions we’ve discussed. Where do you see NEJM AIgoing in the future?
Kohane: Look up NEJM AI Grand Rounds; we try to get state-of-the-art reviews. Medicine is becoming more corporatized, leading to financial pressures on both institutions and payers. AI is currently being used on both sides for billing and reimbursement. We have ads directed to doctors through tools like OpenEvidence. If you don’t think there is going to be a thumb on the scale on all sides using AI, you are mistaken. As these models become core parts of the health care business, all actors will try to maximize their own interests. Functionally, this means pushing things in directions that may not be in the patient’s best interests. My research in the Human Values Project focuses on finding out the values behind these AIs, especially as actors figure out how to maximize their own interests.
Flora: Zak, I appreciate your leadership as a less experienced editor! We appreciate efforts to make sure we are responsible, avoid the hype, and publish the things you need to read, ensuring there is a human at the wheel.
Kohane: Thank you.