Abstract

We recently read the joint report from the Nuclear Threat Initiative and the World Economic Forum titled “Biosecurity Innovation and Risk Reduction: A Global Framework for Accessible, Safe and Secure DNA Synthesis” and found it thought provoking. 1 As someone who has been working on this issue for over a decade, I (RC) agree that much needs to be done to globalize the US government's Screening Framework Guidance for Providers of Synthetic Double-Stranded DNA, 2 to foster implementation, and to update the guidance to accommodate changes in technology, such as the advent of desktop synthesizers. This white paper is a step in the right direction.
However, one concern I have is that the report repeats the oft-cited desire for a database of sequences of concern, which is generally based on the concept that a system that flags a housekeeping gene from a pathogen is of little value. In fact, an unknown customer ordering housekeeping genes that happen to be more similar to a dangerous pathogen than to any other near neighbor is a significant signal of malicious intent.
A legitimate scientist studying a host-pathogen relationship will often use the components of the pathogen that are involved in that system, and therefore many flagged orders are due to legitimate research on pathogens. However, if one wants to study DNA replication or cell signaling, one generally studies these phenomena in nonpathogenic organisms, and there is very little scientific reason to request the sequence from a pathogen (unless the research involves compiling all known enzymes of a type, but this effort would be immediately apparent in the other sequences ordered by the same customer).
An unknown actor ordering a housekeeping gene (or a whole genome segment that is not involved in pathogenicity) that is more similar to that of a pathogen than it is to any near neighbor is of questionable scientific value and therefore highly suspicious. Such a hit might not be detected when screening against a curated database of sequences of concern, rather than following the current Screening Framework Guidance, which recommends screening against a public database such as GenBank to identify the most similar known sequence (the best match).
Similarly, a significant and often overlooked step in having a useful database of sequences of concern is that similarity thresholds must be developed to screen against a database (unnecessary when screening against all known sequences). In order to limit false positives and negatives, a threshold must be set on a pathogen-by-pathogen basis, because some pathogens are highly diverse while others are very similar to nonpathogenic near neighbors.
Worse still, as the community sequences more and more pathogens, the diversity of pathogens that must be captured increases, forcing the threshold down. Simultaneously, as more and more near neighbors are sequenced (and strains are discovered that are similar to pathogens but don't cause disease), the threshold goes up. The amount of curation and bioinformatics effort required in an ongoing way is surprisingly burdensome.
Furthermore, even if the community still regards a database as having value in this context, more thought should be given to its biosecurity implications. As the NTI/WEF report acknowledges, the generation of a database of all genes known to endow or enhance pathogenicity is of significant value to a proliferator. How could such a database be shared with non-Australia Group countries that could misuse such information? A database of sequences of concern has significant value to the intelligence and biosecurity community but is unsuited to undergird a global sequence screening regime partially because of the power it affords.
The NTI/WEF report focused primarily on the initial sequence screening, to argue that the development of a common DNA sequence screening mechanism would significantly alleviate the costs of screening for DNA providers. However, the real cost of DNA screening is the human-intensive detective work of following up on flagged orders, done by people who are usually highly compensated due to their scientific training. More global discussion is needed on how modern tools such as machine learning and natural language processing could be used to automate much of this human detective work.
