A Clonal Evolution Simulator for Planning Somatic Evolution Studies

Abstract

Somatic evolution plays a key role in development, cell differentiation, and normal aging, but also in diseases such as cancer. Understanding mechanisms of somatic mutability and how they can vary between cell lineages will likely play a crucial role in biological discovery and medical applications. This need has led to a proliferation of new technologies for profiling single-cell variation, each with distinctive capabilities and limitations that can be leveraged alone or in combination with other technologies. The enormous space of options for assaying somatic variation, however, presents unsolved informatics problems with regard to selecting optimal combinations of technologies for designing appropriate studies for any particular scientific questions. Versatile simulation tools are needed to explore and optimize potential study designs if researchers are to deploy multiomic technologies most effectively. In this study, we present a simulator allowing for the generation of synthetic data from a wide range of clonal lineages, variant classes, and sequencing technology choices, intended to provide a platform for effective study design in somatic lineage analysis. Users can input various properties of the somatic evolutionary system, mutation classes, and biotechnology options, and then generate samples of synthetic sequence reads and their corresponding ground truth parameters for a given study design. We demonstrate the utility of the simulator for testing and optimizing study designs for various experimental queries.

1. INTRODUCTION

Advanced sequencing technologies have made it possible to profile genetic variation at the single-cell level on population scales, revealing in part that the human body is a continuously evolving genetic mosaic (García-Nieto et al., 2019; Abascal et al., 2021). Genetic and epigenetic modifications in somatic cells over many generations of cell growth and replication result in heterogeneity between cells, tissues, and organs in normal aging and development, as well as in disease conditions such as neurodegeneration and, most notably, cancer (Olafsson and Anderson, 2021). Accumulating genomic data has made it apparent that somatic mutability is much more complicated than early models of tumor clonal evolution first suggested (Coorens et al., 2021) and far more extensive in even healthy tissues (c.f. Colom et al., 2021).

Somatic variation produces complex patterns of “mutational signatures” (Alexandrov et al., 2020) reflecting different endogenous and exogenous mechanisms of mutability. In cancers and other precancerous conditions, high levels of somatic mutability are frequently observed due to damage to cell replication or error correction machinery (Salk et al., 2010). They further may include not just single nucleotide variations (SNVs) but also potentially extensive copy number alterations (CNAs) and structural variations (SVs), including complex chromosomal rearrangement patterns and genome duplication events (Li et al., 2020).

As we have come to understand the extent and importance of somatic evolution, enormous effort has been put into developing biotechnological tools for profiling somatic variability at ever greater scales, precision, and accuracy (Ellis et al., 2021). No one technology is able to comprehensively characterize somatic variability across a complex tissue and do so with precision and accuracy and at low cost. Rather, investigators attempting to characterize somatic variation processes now have available to them a vast array of technologies, for example, short read versus long read versus single-cell sequencing, liquid biopsy versus tissue biopsy, whole genome versus whole exome versus targeted sequencing each with distinctive different properties and tradeoffs (Slatko et al., 2018).

Current work increasingly depends on multiomic biotechnology combinations (long read and short read, single cell, whole genome and exome, etc.), along with various other involved study design choices (e.g., number of biopsy replicates), with uncertain knowledge of how these choices together with analysis software will influence one's ability to quantify any particular feature of the somatic evolution process (Koboldt, 2020). There is currently little empirical or theoretical basis on which an investigator planning a study can select a combination of technologies and study design well suited for any particular investigation.

Simulation presents a viable solution to these issues by allowing for efficient tests of various study designs with direct knowledge of most biological parameters of interest. Synthetic data have held a long tradition of use in computational biology, allowing for testing and algorithm design without the expense of carrying out an experimental protocol. One prominent example is the popular BAMSurgeon simulator (Ewing et al., 2015), which has been valuable in testing tumor variant calling algorithms.

However, current simulators fail to capture the broad range of hypermutability processes that occur in cancer cell populations, and often focus on one particular aspect of cancer evolution (e.g., copy number or spatial analysis). Furthermore, no simulator currently exists that allows for the exploration of widely varying study designs and multiomic technologies. For a comparison between features available in our simulator versus others, see Table 1.

Table 1.
A Feature Comparison Between Our Simulators and a Few Other Sequencing Simulators with Related Functions

Our simulator pSITE CellCoal BAMSurgeon

Lineage properties ✓ ✓ ✓ ✘

Single cell ✓ ✓ ✓ ✘

Mutational signatures ✓ ✘ ✓ ✘

Variable read length ✓ ✘ ✘ ✘

Variable error rate ✓ ✘ ✘ ✘

SNP/indel variation ✓ ✓ ✓ ✓

Resource usage Medium Medium Low Medium

Multiple sites/samples ✓ ✘ ✓ ✘

Runtime length Medium High Low Medium

CNV ✓ ✓ ✘ ✓

Rare structural variation ✓ ✘ ✘ ✓

Whole genome reads ✓ ✓ ✘ ✓

Whole exome reads ✓ ✓ ✘ ✓

Targeted sequencing reads ✓ ✘ ✘ ✓

Mutation frequencies ✓ ✓ ✓ ✘

Mutation distributions ✓ ✘ ✓ ✘

Clonal frequency parameters ✓ ✘ ✓ ✘

Sequencer quality consideration ✘ ✓ ✓ ✓

RNA sequencing ✘ ✘ ✘ ✓

Liquid biopsy ✓ ✘ ✘ ✘

	Our simulator	pSITE	CellCoal	BAMSurgeon
Lineage properties	✓	✓	✓	✘
Single cell	✓	✓	✓	✘
Mutational signatures	✓	✘	✓	✘
Variable read length	✓	✘	✘	✘
Variable error rate	✓	✘	✘	✘
SNP/indel variation	✓	✓	✓	✓
Resource usage	Medium	Medium	Low	Medium
Multiple sites/samples	✓	✘	✓	✘
Runtime length	Medium	High	Low	Medium
CNV	✓	✓	✘	✓
Rare structural variation	✓	✘	✘	✓
Whole genome reads	✓	✓	✘	✓
Whole exome reads	✓	✓	✘	✓
Targeted sequencing reads	✓	✘	✘	✓
Mutation frequencies	✓	✓	✓	✘
Mutation distributions	✓	✘	✓	✘
Clonal frequency parameters	✓	✘	✓	✘
Sequencer quality consideration	✘	✓	✓	✓
RNA sequencing	✘	✘	✘	✓
Liquid biopsy	✓	✘	✘	✘

Here, we seek to meet the needs of sequencing study design for somatic variation studies through a new clonal evolution simulator. Our simulator links a coalescent model of clonal evolution to a versatile model of read generation with user-configurable variant classes, mutation rates, evolutionary models, sequencing setups, and study design decisions. Our framework focuses on general properties of sequencing that allow for the design of better experiments and future sequencing technologies.

It also introduces a wide variety of features important to somatic variation studies that are not, to our knowledge, found in any other current simulator, such as capturing broad classes of complex SV that have been implicated in certain cancers. We demonstrate utility of this simulator through application to a series of hypothetical questions in testing and optimizing study design for somatic evolution studies.

2. MATERIALS AND METHODS

The complete simulator consists of four main modules: (1) sampling an evolutionary lineage tree for the clonal evolution, (2) sampling mutation events on the lineage tree, (3) simulating sequence reads, and (4) sampling reads based on experimental design decisions. Hereunder, we describe each module in turn. Each module has a number of user-tunable parameters to control different biological parameters of the presumed cell lineages as well as experimental parameters of the sequencing strategy. Additional information on mutation types and module implementations is provided in the Supplementary Data. The major tunable parameters are summarized in Table 3.

2.1. Lineage simulation and mutation events

For each simulation, we generate a cell lineage assuming that mutations are selectively neutral and generally follow the assumptions of the standard coalescent model (Nordborg, 2019). User-definable parameters include a total population size $(N_{e})$ , as well as a number of clones (k) to be sampled. Coalescent sampling of tree topologies and edge lengths is implemented using msprime (Kelleher et al., 2016). The unit of time we use is a generation, that is, the time of a single cell division. A single sample from this process represents a single-cell lineage tree, each cell of which is taken to be representative of some “clone” in the tree.

The simulator supports commonly discovered types of somatic variations, particularly those implicated in the development of cancer (Dentro et al., 2021). These currently include the following: SNVs, CNAs, insertions, deletions, kataegis, chromothripsis, chromoplexy, aneuploidy, translocations, inversions, and breakage fusion bridge cycles. Each mutation type is implemented by sampling from various probability distributions for location and length while simultaneously maintaining constraints encapsulating our knowledge of the mechanism for each mutation type. For a short description of these mutations, see Table 2.

Table 2.
Description of Available Mutation Types

Mutation type Description

SNV A single base is changed to a different nucleotide

Single base signature A single base is mutated with frequency according to its trinucleotide context, termed a mutational signature

CNV A random region of the genome is repeated a number of times

Inversion A random region of the genome has its nucleotide sequence reversed

Deletion A random region of the genome is removed

Translocation A trailing portion of one chromosomal sequence is either added or exchanged with a trailing part of another chromosome sequence

Insertion A random sequence of a given length is inserted at a random position in the genome

Aneuploidy A chromosome is either deleted or copied a random number of times

Kataegis Within a randomly sampled region of the genome, a particular type of single base substitution (e.g., C>T) occurs with high frequency

Chromothripsis A region of a single chromosome is broken into many pieces, these pieces are deleted with a certain frequency, and finally rearranged randomly to replace the selection region

Chromoplexy Multiple chromosomes are shattered into pieces and rearranged in an order consistent with telomere endings

Breakage fusion bridge A chromosome loses a telomeric segment and proceeds to repeatedly break and fuse with itself to create new chromosomes

Mutation type	Description
SNV	A single base is changed to a different nucleotide
Single base signature	A single base is mutated with frequency according to its trinucleotide context, termed a mutational signature
CNV	A random region of the genome is repeated a number of times
Inversion	A random region of the genome has its nucleotide sequence reversed
Deletion	A random region of the genome is removed
Translocation	A trailing portion of one chromosomal sequence is either added or exchanged with a trailing part of another chromosome sequence
Insertion	A random sequence of a given length is inserted at a random position in the genome
Aneuploidy	A chromosome is either deleted or copied a random number of times
Kataegis	Within a randomly sampled region of the genome, a particular type of single base substitution (e.g., C>T) occurs with high frequency
Chromothripsis	A region of a single chromosome is broken into many pieces, these pieces are deleted with a certain frequency, and finally rearranged randomly to replace the selection region
Chromoplexy	Multiple chromosomes are shattered into pieces and rearranged in an order consistent with telomere endings
Breakage fusion bridge	A chromosome loses a telomeric segment and proceeds to repeatedly break and fuse with itself to create new chromosomes

CNV, copy number variation; SNV, single nucleotide variation.

The scale of many of these forms of variations can be tuned, but with default values set based on estimated distributions of sizes found in current studies of cancer genomes (Li et al., 2020). Size distributions for SVs are modeled as a truncated mixture of negative binomials to represent small-, medium-, and large-scale events. Each mutation type also has a rate at which that mutation appears that we assume may differ between tumor stages and in healthy tissues per the “mutator phenotype” hypothesis (Loeb, 2001). The simulator also currently supports simulating mutations drawn from single base substitution signatures derived from the COSMIC data set (Forbes et al., 2010). Distributions over mutation size and location are also flexible.

Once a lineage is simulated, we apply mutations to this lineage going forward from the most recent common ancestor of all the clones to be sampled. Mutation rates for each class of variation are defined as a uniform discrete distribution of potential rates, M_i. For each edge of the lineage, a specific rate is generated through $r_{i} \sim M_{i}$ and mutation times are generated through a Poisson process, with rate r_i. This is done for each edge of the phylogeny independently for each class of mutation. The end result of this process is a list of times for each mutation type and its occurrence on each edge of the phylogeny.

Next, the simulator imposes each of these mutations on a reference genome to establish the sequences of clones at all nodes of the lineage tree. Given that mutations were independently simulated for each mutational class, we first merge and sort the mutational events by time. We first compute all root-to-leaf paths in the tree, which allows us to generate all potential clonal genomes. We start with the root node as the reference and impose sorted mutations for each root-to-leaf path. The end result of this process is a stored genomic sequence for each clone, including those at internal nodes, which is later sampled in the sequencing step.

Pseudocodes for this procedure are shown in Algorithms 1 and 2.

2.2. Sequencing implementation and experimental decisions

Sequencing procedures differ depending on the type of sequencing chosen (e.g., whole genome sequencing or targeted sequencing). The general strategy, however, is similar. First, a clone from the tree is sampled from a Dirichlet process with distribution $g (k)$ and concentration parameter $α$ , that is, $k \sim G \sim D P (g (k), α)$ . This allows for a flexible, and potentially skewed, distribution of the sample clones, as may occur in a biopsy sample. The sampled clone may be either a leaf node or an internal node, but not the root node. For this selected clone, stochastic fragment lengths are drawn according to the read length, that is, $f l \sim T r u n c a t e d N e g B i n (r l)$ .

Using these fragment lengths, operations are performed on fragmented clonal reads defined by the user parameters. Specifically, clone k is loaded, “chopped” according to fragment length, subsetted based on read length and paired sequencing type, seeded with errors according to the error rate, and written to a FASTQ file, a standardized genomic file format developed by the Wellcome Sanger Institute. After this particular clone is sequenced, the coverage of the simulator is updated according to the fraction of the genome covered depending on the read and fragment length. This process repeats with independent repeats of each stochastic process until the desired coverage is reached.

The process for exome and targeted sequencing requires additionally identifying reads that align to exon or targeted gene sequences. This approximate string matching problem is computationally infeasible over every possible read so we make a few simplifying assumptions. First, we create k-mer sets for each of our target sequences and use locality sensitive min-hashing—a technique where we use a min-hash function to generate a lower dimension of the genomic sequence, and a locality-sensitive hashing algorithm to quickly determine match probability—to index these sequences for fast lookup (Rajaraman and Ullman, 2011). Then, for genome locations surrounding our target sequence intervals, we calculate whether reads originating from these genome locations match some target sequence in the hashing index above a certain threshold. If this location matches some sequence, then we add it to a list of locations.

We repeat this sampling according to the user set parameter G_N times, and use this list combined with the original sequence locations to generate a discrete probability density to sample read locations. We sample reads from this generated list with the given read length while simultaneously sampling different clones until our desired coverage is reached. Psuedocodes for genome and exome sequencing are shown in Algorithms 3 and 4.

Modifications also occur for single-cell sequencing, where we do not continually resample cell clones, but instead sample only once and use the given clone until the desired coverage is reached. Finally, for liquid biopsy, we perform a similar iterative clonal sampling procedure, but do not chop the sequence uniformly. Rather, we draw random fragments from the genome strings of each clone and mix them at some frequency with reference DNA then return these in a read file.

The full simulation is defined by looping the lineage, mutation, and sequence sampling over tumors and samples. Specifically, we independently define and execute a separate lineage sampling and mutational frequency for the number of user-defined tumors generated. Similarly, we configure parameters related to sequencing decisions, and execute the procedures already listed according to the number of user-defined samples requested. The final result of the simulator is a labeled directory with subdirectories corresponding to reference reads, tumor reads, and sample reads as shown in Figure 1.

FIG. 1.

Example simulator output and directory structure for a single run.

Each of these directories holds ground truth parameters with information about the tumor and sequencing parameters. Although we described the simulation method for singular values of the parameters listed in Table 3, the practical implementation of the simulator encodes most of the parameters as a list of values and samples from each list to generate the total simulation. This random sampling procedure allows for grid search exploration of study design spaces more easily. We refer to each parameter with a subscript to denote the sample number. For example, n_i would denote the number of single cells on the ith sample. In cases where we survey multiple sites of somatic evolution, we use a double subscript to remove ambiguity. For example, $c_{i j}$ would denote the coverage of the jth sample on the ith tumor.

Table 3.

Summary of the Main Tunable Parameters

Parameter	Symbol	Units	Description
Effective population size	N_e	Cells	Total cellular population of the region to be sampled. Impacts coalescent times
Number of clones	k	Clones	Number of distinct genetic somatic cell populations to be sampled
Mutation rate lists	M_i	Mutation events per locus per cell division	Lists for each variant class, defining rates per locus per cell division
Mutation size and location distributions	$S, z (x)$	Number of bases, none	Each mutation type can be tuned over size distributions and single base substitutions can be tuned over signature distributions
Number of tumors	t	Tumors	Number of distinct sites of somatic evolution to be sampled
Number of samples	s	Samples	Number of distinct tissue biopsies (regions) to be drawn from each “tumor” site
Read length	rl	Bases	Size of reads to be generated from sequencer
Fragment length	fl	Bases	Defines a superstring from which reads are derived during the sequencing process
Depth/coverage	c	Reads per base	Average number of times each nucleotide of the genome is sequenced
Error rate	e	Fraction of incorrectly sequenced bases	Rate of incorrectly sequenced nucleotides
Dirichlet concentration, clonal frequency distribution	$α$ , $g (k)$	None, none	Parameter of a Dirichlet process that is used to derive the concentration of the baseline distribution in a sample. A high value will lead to approximately uniform sampling of clones during sequencing, whereas a low value would favor very uneven clonal frequencies. The clonal frequency distribution is the baseline distribution at high $α$
Number of single cells	n	Cells	Number of individually sequenced cells per sample
Paired end/single end	Paired	Boolean	Binary parameter describing whether the reads are paired end or single end. Paired-end reads have two related reads derived from the same fragment
Whole genome/whole exome/targeted sequencing	Genome	Boolean	Binary parameter describing whether the sequencer extracts genes from the entire genome or only a subset of the genome
Liquid biopsy	LiquidBiopsy	Boolean	True or false parameter to produce liquid biopsy-sequenced reads
Sampling number	G_N	Genome positions	Number of random positions on the genome to sample to build a hash table for approximate string matching for exon/targeted sequencing

2.3. Runtime, space, and resource analysis

Defining $c_{i j}$ as the sequencing coverage of the jth sample on tumor t_i, $n_{i j}$ as the number of single-cell samples drawn on the jth sample of tumor t_i, and s_i as the number of samples drawn on tumor t_i. The run time is approximately proportional to $O (\sum_{i = 1}^{t} \sum_{j = 1}^{s_{i}} c_{i j} (1 + n_{i j}))$ , which approximately calculates the amount of times we traverse and sequence the genome.

The maximum memory usage is constrained by the genome size, read length, and sequencing type. In general, this factors to around 5–10 × the size of the genome in standard sequencing settings. If the user wants to minimize memory footprint, then he or she should set the batch and subblock size to 1, or avoid generating long-read exome-sequenced data. The amount of disk storage of the program is bounded by the storage of the clonal genomes during the mutational process as well as the sequencing FASTQ files. This is approximately proportional to $O ([\sum_{i = 1}^{t} k_{i} + \sum_{i = 1}^{t} \sum_{j = 1}^{s_{i}} c_{i j} (1 + n_{i j})] \times d)$ , where d is the genome size, k_i denotes the number of clones in tumor t_i, and $c_{i j}$ and $n_{i j}$ are defined as already mentioned.

The code for the simulator is implemented in Python, and a single run of the simulator involves changing the parameter file and running the Python command. A single run of the simulator only uses one core, allowing for parallelization across cores or nodes, provided the system has enough memory. All our tests were performed on a multinode Ubuntu system with 184 cores, 850GB of memory, and 10TB of storage. As a reference point, three 30 × -paired whole genome sequenced (WGS) samples take ∼3.5 hours to generate on our system and use ∼250–300GB of disk storage and <40G of memory.

2.4. Simulator usage

The simulation process already described has a number of tunable biological and experimental parameters (detailed in Table 3). The main biological parameters include mutation rate, number of clones, and number of tumors. The main experimental parameters include the number of samples and various parameters describing the sequencing modalities. Each major tunable parameter is encoded as a list, which is randomly sampled, allowing for the user to vary both the biology of the tissue and the different experimental setups. The intended usage of the simulator revolves around testing the limitations of various study design paradigms against different somatic evolution instances.

For example, if one were interested in finding a study design for testing presumably “healthy” tissue for mutation burden, he or she might fix the mutation rate list to a low rate and replicate a large number of tumors as well as search over a broad experimental design space. In contrast, if one were interested in the limitations of a particular study design (for instance, 30 × WGS), he or she might replicate a broad range of tumors with various mutation rates and clone numbers while fixing the experimental strategy.

3. EXPERIMENTAL RESULTS

In this section, we demonstrate the utility of our simulator in evaluating or optimizing sequencing study designs for profiling clonal evolution. A primary motivation of this simulator is to plan study designs for evaluating questions about differences in somatic mutability between subsets of samples. The questions for study design are to test whether a difference can be detected between two subsets of samples under a given study design or to find a study design optimizing power to detect such differences. Here, study design might include changes in the types of sequencing applied, the informatics software used to evaluate it, and features such as the number of tumors and tumor sites or regions to be examined.

We would then assume that the study is being used to test for differences in biological parameters between subsets of samples. Such biological parameters might include mutation rate differences, presence of rare variations, variation in mutational signatures, phylogeny structures, or clonal frequencies.

3.1. Notation and performance measures

For the analyses presented here, we take the sequencing read outputs from our simulator and perform alignment to the hg38 reference genome, after which we call several forms of variation for analysis. The aligners used were minimap2 (Li, 2018), Bowtie (Langmead and Salzberg, 2012), and bwa-mem (Li, 2013); the callers used were Strelka (Kim et al., 2018) and Delly (Rausch et al., 2012). Throughout the tests, we reference our study design, which we formally define as a collection of matrices $X = {X_{1}, \dots, X_{t}}$ , where t denotes the number of tumors.

Matrices X_i denote the sequencing decisions taken on tumor i and encapsulate every sample. Namely, each matrix X_i is of dimension $7 \times s_{i}$ , where s_i denotes the number of samples taken on tumor i. The columns of X_i denote sequencing and informatics choices on a single sample s_i. That is, the matrix is of the form: $x_{i} = (\begin{matrix} r l_{i} ∕ f l_{i} \\ c_{i} \\ 1 - e_{i} \\ n_{i} \\ P a i r e d_{i} \\ G e n o m e_{i} \\ I n f o r m a t i c s_{i} \end{matrix}) X_{i} = [\begin{matrix} | & | & | \\ \dots & x_{i} & \dots \\ | & | & | \end{matrix}] .$ (1)

Vector $x_{i}$ contains the variables defined in Table 1, along with a variable to encode informatics options. In our experimental tests, we consider the single tumor, single sample case. With this instance, we can collapse the collection of matrices down to a single vector $x$ .

To judge a study design's utility, we require performance measures that assess whether the design is recovering a signal related to the intended hypothesis. Judging the performance of variant caller outputs is a challenging problem in its own right, especially in the case of highly mutated and heterogeneous somatic samples. Temporally layered mutation events add difficulty to pinpointing the original source of variation. This difficulty is compounded in the case of SV, where we do not have a standardized labeling scheme, nor a way to traceback the location of an event with respect to the original genome.

However, we aimed to create measures that correlated with overall performance, particularly on the types of somatic populations we would expect in cancer patients. We use recall as the metric for highly mutated SNV samples and F1 score for less heavily mutated SNV samples. For structural variants, we use the measure described in the following text. Call the variant output locations for chromosome i, $C_{i} = {(a, b)}$ , and call our ground truth set of structural variant locations, $D_{i} = {(c, d)}$ . The measure is then defined as follows: $J = \sum_{i} \frac{| C_{i} + D_{i} |}{\sum_{i} | C_{i} + D_{i} |} \sum_{j \in D_{i}} \frac{1 (| j \cap C_{i} | > 0)}{| D_{i} |}$ .

In this measure, we assign a score by checking whether each ground truth event overlaps with some called event (this is our rough estimate of a “correct” call). The overall chromosome's score—the inner sum—is then determined as the fraction of ground truth events that have been overlapped. The scores for each chromosome are then weighted averaged with weight according to the total number of variant events in that chromosome for the total score. By judging the structural variant callers on overlap rather than precise matching, we are able to bypass some of the labeling fidelity problems mentioned previously. Finally, all of these measures have their range in the unit interval [0, 1] that allows for straightforward construction of more nuanced measures.

3.2. Statistical test for mutation rate variation

We first evaluate whether a given study design would be able to detect a difference in SNV mutation rates between tumors. A motivating hypothesis for these tests is the idea that cancerous and precancerous tissues should typically exhibit hypermutability phenotypes, that is elevated rates of particular kinds of variation that lead to genetic heterogeneity across cells. We would then wish to detect whether a specific study design would be powered to detect a hypothetical difference in mutation rate between two samples indicative of a hypermutability phenotype specific to one sample.

We assume that we have two independently sampled tissues, and we wish to determine whether the mutation rates of various mutation classes are different between these tissues. To create a specific scenario, we generated two sets of SNV data, one where the average rate was high and the other where it was comparatively low (denote the rates as $λ_{1} \approx 1 0^{- 8}$ , $λ_{2} \approx 1 0^{- 10}$ mutations per nucleotide per generation). We specifically tested whether a 30 × WGS screen on both tissues would be statistically powered to detect the mutation rate difference.

All other parameters were kept equal at reasonable values (0 error rate, paired-end sequencing, 0 single cells, with 1 sample per tissue). We assume that the first tissue has mutation count generated as $M_{1} \sim P o i s s o n (λ_{1} t_{1})$ and the second tissue has mutation count generated as $M_{2} \sim P o i s s o n (λ_{2} t_{2})$ . We can then test for a rate difference using the following null and alternative hypotheses: $H_{0} : \frac{λ_{1}}{λ_{2}} \leq 1$ and $H_{1} : \frac{λ_{1}}{λ_{2}} > 1$ . In our setting, we have an estimate of the mutation counts M₁ and M₂ from the output of our variant calling software, and we estimate t₁ and t₂ as biologically plausible tumor formation times as described hereunder.

Several statistics can be used to establish p values for a two-sample Poisson rate test (c.f. Gu et al., 2008), but here we favor a conditional test based on the fact that the conditional distribution of M₁ given $M_{1} + M_{2}$ is binomial. Under the assumptions of the problem, this expression, which we denote $p (t_{1}, t_{2})$ to emphasize its dependence on each time, is: ! $2 m i n (P (X_{1} \geq M_{1} | n = M 1 + M 2, p = \frac{t_{1} ∕ t_{2}}{1 + \frac{t_{1}}{t_{2}}}), P (X_{1} \leq M_{1} | n = M 1 + M 2, p = \frac{t_{1} ∕ t_{2}}{1 + \frac{t_{1}}{t_{2}}}))$ . These values are both computed easily using binomial cumulative distribution function computational (cdf) packages.

In our empirical test case, 244 mutations were called in the higher SNV rate data set and 2 were called in the lower SNV rate data set. Times t₁ and t₂ are estimated, with uncertainty, as depths of somatic lineages that could plausibly lead to a tumor; we assign them random variables $t_{1}, t_{2} \sim^{i n d .} U n i f [1, 30]$ years. Current evidence suggests that the time from a normal cell to clinical cancer sequencing could be as high as many decades, although studies are in their nascence for somatic charting in healthy and precancerous tissues, Jolly and Van Loo (2018). To generate a p value, we approximate the expected p value with respect to the time through a bootstrap procedure, that is, $ℰ_{t_{1}, t_{2}} [p (t_{1}, t_{2})] = \int_{1}^{30} \int_{1}^{30} p (t_{1}, t_{2}) d t_{1} d t_{2} \approx \frac{1}{N} \sum_{i \in [N]} p (t_{i 1}, t_{i 2})$ .

In our test case, we took 500 random draws of times $t_{1}, t_{2}$ and computed an average p value of $6.9 \times 1 0^{- 7}$ with variance $1.4 \times 1 0^{- 10}$ . Therefore, we can conclude in this test case that the given study design should be powered to provide a strongly significant detection of the given rate variation.

3.3. Evaluating study parameter choices

A more involved use of the simulator would be to test a range of study designs and identify those powered to detect a hypothesized effect. To provide a concrete example of such a study design question, we first evaluate how varying coverage would change our ability to call SNVs. To do this, we generated eight sets of data, each with an average SNV rate of ∼ $1 0^{- 8}$ , and a lower rate of ∼ $1 0^{- 10}$ for deletion and inversion variants. The read lengths for the data set were fixed at 125, the error rate at 0, zero single cells were generated, one tumor was generated with eight samples from that tumor, and paired-end whole genome sequencing was performed.

All biological parameters were fixed. The depth of coverage parameter was sampled from the set ${1, 2, 5, 10, 15, 25, 30}$ for each of the eight samples. The F1 score for variant calling as a function of depth of coverage is shown in Figure 2, with the main takeaway being that a coverage <15 has a significant negative effect on our ability to call SNVs, whereas more incremental gains can be seen >15 × coverage.

FIG. 2.

F1 score for single nucleotide variation calling accuracy on simulated bulk sequencing data as a function of depth of coverage.

3.4. Optimizing a study design

The most involved intended use of the simulator is to optimize a study design to evaluate a particular hypothesis about somatic variability. The presumed goal is to design a study that is optimally powered to detect the hypothesized signal within available resource constraints. Here, we demonstrate the use of our simulator to answer such a study design question to find an approximately optimal study design for a particular hypothesis. The prior premise can be framed as the following optimization: $a r g m i n_{x} ℰ_{q \sim Q} [c (x) + λ L_{q} (x)]$ (2) $s . t . d (x) \leq b$ (3)

x \geq 0 .

(4)

Here $x$ is the study design vector defined in Equation $(1)$ . $c (x)$ is a cost function for a particular study design, $d (x)$ is a budget function that assigns a resource usage to a study design, and $b$ is a maximal budget vector. We also define a loss function $L_{q} (x)$ , which describes the error with which the study design answers the question we pose. The subscript q and the expectation term over the distribution Q are used to emphasize that instances of our simulator are evaluated on a biological parameter vector q drawn from a stochastic high-dimensional biological parameter distribution Q.

The expectation can be approximated by Monte–Carlo methods, where a singular study design, $x$ , would be evaluated on multiple instances q_i, and the result averaged, that is, $\frac{1}{N} \sum_{i = 1}^{N} c (x) + L_{q_{i}} (x)$ .

We assume that our study design variable $x$ can vary only in numerical read length, coverage, error rate, and a binary decision of whole genome versus whole exome sequencing. We assume the study is meant to recover inversions, deletions, and SNVs in a sample, so we define a scoring function intended to provide a balanced measure of performance at these tasks: $\begin{matrix} S c o r e = 1 - L_{q} (x) & = 0.4 \times [\frac{3 R e c a l l_{s n v} + F 1_{s n v}}{4}] + 0.4 \times [\frac{J_{d e l} + J_{i n v}}{2}] \\ + 0.2 * 1 (R e c a l l_{s n v} > 0.1 \land J_{d e l} > 0.1 \land J_{i n v} > 0.1) . . \end{matrix}$

We fix our callers as Strelka and Delly, fix most biological parameters, and fix our number of samples at 1. We assume that all study designs have fixed cost $c (x) = 0$ and constrain our study design as follows: $r l \in [75, 2000]$ ; $c \in [2, 30]$ , $e \in [0, 0.001]$ , $1 (P a i r e d) \in {T r u e}$ , $1 (G e n o m e) \in {T r u e, F a l s e}$ , $n \in {0}$ . We replicated tumors independently with fixed biological parameters. For evaluation, ∼75 study design vectors $x$ were generated and evaluated from ∼15 tumors. With additional computational time and resources, a user might consider generating a larger number of tumor replicates, or exploring a larger study design space. The score function was averaged across all tumors with respect to study design to generate a final score for each study design.

The best study design had a read length of 2000, a coverage of 25 × , a 0.0 error rate, and was whole-genome sequenced. Further study design scores are given in Table 4, as well as the github repository. As expected, the best study designs were mostly whole-genome sequenced, which allowed the recovery of more variants across noncoding regions. Similarly, high coverage did seem to boost the power of the design, but there did not appear to be a large difference between 10 × and 30 × coverage.

Table 4.

Top Four Scoring and Bottom Four (Nonzero) Scoring Study Designs

Read length	Coverage	Error rate	Exome sequenced	Score
2000	25	0.0	False	0.587395
2000	30	0.001	False	0.496574
2000	10	0.0	False	0.391434
150	25	0.0	False	0.283998
500	2	0.0001	True	0.001818
150	30	0.001	True	0.005085
75	10	0.00001	True	0.011801
2000	2	0.00001	True	0.018933

Increasing the read length boosted our ability to detect structural variants in samples, in both exome- and genome-sequenced samples. The error rate parameter did impact the rate of false calls and the ability to detect SNVs, but the overall scores were not heavily affected by the error rate even though many more false SNV calls were generated. However, with respect to SV, the larger error rate samples did not appear to do significantly worse. This meant that long-read high error rate designs did well with respect to our scoring measure. There were also some sample size effects, as shown by the variance bars in Figure 3.

FIG. 3.

Visualizations of various parameters and their impact on the efficacy of a study design. Read length showed the strongest positive correlation with study design score (b), and similarly whole genome-sequenced data had a higher score than exome-sequenced data (c). Higher coverage had a generally positive, but not entirely consistent, effect on the study design score (d). Higher error rates also generally produced lower scoring study designs, but high error rates did not preclude a study from having a high score (a). Error bars are depicted in each figure that assign confidence to our results and depend on the sample sizes from our experiment.

For instance, the point with highest error rate in Figure 3a had a relatively high variance, which might be due to its low sample size in conjunction with association with other advantageous study design parameters. We might expect more confident identification of trends with an increase in the number of replicated tumors. In particular, we might expect the “dip” seen in the coverage plot of Figure 3d to revert to a smooth logistic-type curve, and the error rate plot of Figure 3a to show a more clearly negative trend of score versus error rate.

The worst study designs were often exome sequenced, had poor coverage, or completely failed to recover a particular variant type due to noise. Figure 3 shows tradeoffs between score and the model design parameters in our exploration of the search space. Different conclusions may have been reached if we imposed budget or cost constraints for various parameters.

Aside from identifying efficacious study design vectors and study design parameters, our simulations yielded a number of tangentially interesting results. One finding that may have implications for future cancer informatics development was the effect of hypermutability and rearrangements on our ability to recall variants. High-frequency SV introduced a substantial amount of noise in our SNV caller report. As expected, the task of calling layered SVs at various frequencies was more challenging than that of SNV. With regard to analysis, the patterns that large scale hypermutability, high sequencing error, and imbalanced clonal samples can induce on sequencing reads should be considered when developing the next generation of cancer informatics tools.

In particular, the standard practice of genome alignment and variant calling may not return valuable information in the case of a highly structurally varied and low frequency sample. A final observation generated by the simulator was that ensembled and merged study designs often tended to perform better than single-sample designs with respect to profiling the variant spectrum of a tumor, though this often came at the cost of increased false positive mutations. Namely, in tumor samples where we generated multiple genomic sequencing samples on a single tumor, the merged samples often recovered significantly more forms of variation, even with relatively rudimentary sequencing protocols.

This holds ramifications for researchers in the practical use case as it suggests that relatively large improvements in signal can be gained for relatively a cheap cost by resampling the tumor or using a differing sequencing protocol. The raw data for the ensembled data and each of their constituent sequencing designs are present as comma separated value (CSV) files in the referenced github repository.

3.5. Comparison to real cancer reads

We compared reads generated by our simulator with an Ion Torrent targeted cancer sequencing panel in a colorectal cancer patient, found in the sequencing read archive (SRA Access Key: SRX9731615). This particular case used single-end reads in the read-length range 50–300 base pairs. For a comparison case, we generated 150 base pair exome-sequenced reads. The guanine cytosine (GC) content distributions as well as the GC percentage of the reads in the real reads seemed to closely mirror that in our simulated reads, with a real GC content of 46% versus simulated GC content of 44%. GC bias is likely platform specific (Benjamini and Speed, 2012) and may be less pronounced for this system than others.

The Ion Torrent reads used in the study had a read length range of $[25, 354]$ , whereas ours ranged within $[0, 150]$ by a user-set distribution. The final noteworthy difference was with respect to quality scores. In our simulator, we placed a uniform quality score on all bases. The real sequencing case had fairly uniform sequencing quality scores except for extremely long reads, which lost quality with longer extension.

Although the simulated read files do not exactly mirror the distributions found in current sequencing technology platforms, we are mainly concerned with invariance with respect to sequencing attributes. That is, a shift in signal accuracy caused by simulated sequencing parameters (say, read length) should produce a similar shift in accuracy in real sequencing technologies under that same shift in sequencing parameters. In a loose sense, our simulated reads can be viewed as a limiting case of current sequencing platforms, which does not over-represent parts of the genome and has a more defined distribution of read lengths.

This is not necessarily an issue for the purpose of this simulator, as it is conceivable that future technologies/sequencing protocols could be developed that do not possess the same read length distributions, read over-representations, or sequencing quality scores. In addition, these distributions vary depending on the sequencer and random stochasticity they could, however, be plausibly integrated into the simulator in future iterations. Our main conclusion is that it appears possible to mimic arbitrary sequencer properties by altering distributions within our simulator.

4. DISCUSSION AND CONCLUSION

In this study, we introduced a new simulation toolkit to generate sequencing reads from somatic variation processes under a wide range of biological and technological parameters. We used a bottom-up approach, encoding various aspects of somatic evolution and sequencing with user customizable probability densities. We demonstrated the utility of our simulator for several hypothetical questions in evaluating and optimizing study designs for profiling somatic variability in cell lineages.

A downside of our approach is that, in the service of modeling general classes of technologies, we may not encode some unique properties of specific sequencing platforms. Details such as distributions over sequencing quality and exon error tolerance are somewhat crudely approximated by our simulator and might need to be customized to specific current platforms in future study. In the exon sequencing case, we try to find subsets of reads that may match exon sequences well, but this is done in an inexact way due to computational considerations.

Another approximation in our experimental tests was the de facto alignment to a linear reference genome as the first part of our experiments. In the case of highly rearranged cancer genomes, alignments may not provide high-quality insights into the original mutation sources. Alternatives such as graph-based alignments or reference-free sketching ideas could be explored in the future. In the case of ultra-long reads, it may be computationally feasible to assemble the genome, raising further questions not explored here such as the lengths at which assembly becomes feasible.

The simulator might also be extended in various ways in future study. Although DNA sequencing has come to be the standard lens by which researchers view the cancer evolutionary system, a growing body of work on epigenetic theory demonstrates that some neoplasms may use epigenetic modifications to generate a selective advantage (Kanwal and Gupta, 2012). Incorporating various forms of epigenetic modifications—three-dimensional genome alterations, methylation, etc.—and the technological methods used to probe these changes could be a valuable addition to our simulator.

As our knowledge of the mechanisms of somatic evolution and mutagenesis change, modifications could be made to our simulation system to incorporate these novel patterns. For instance, progress in understanding complex patterns of SVs and the mechanisms by which they arise in somatic cells is still in its infancy, with new patterns of SV are continuing to be discovered (Hadi et al., 2020). It may be possible for future iterations of the simulator to encode arbitrary rearrangements in the genome rather than only some defined arrangements.

In addition, novel mutational signatures are being discovered with a wide variety of endogenous and exogenous causes. Arbitrary mutational signatures and alternative forms of SV could be readily incorporated into our current framework by modifying the distributions over genome lengths and frequencies of mutation. Evolutionary modeling is another potential area of improvement. We utilized a neutral coalescent model to represent the evolutionary process stemming from a single cell, and a Dirichlet process to model the clonal frequencies in each sample. There is room here to incorporate various selection pressures, clonal dynamics, drift, and bottleneck effects with greater knowledge of how these processes act in the cancerous setting.

The primary goal of this simulator is to allow thorough exploration and optimization of spaces of study design decisions and evaluate their impacts on our power to detect significant patterns of somatic evolution. We are particularly interested in our ability to reconstruct evolutionary lineages, find their characteristic mutational signatures and rates, and detect patterns of SV. An important task going forward is to provide user-friendly software for study design inquiries. This software would allow a user to input properties they wish to detect in a cancer sample along with cost settings; the software would then return sets of study design parameters that allow for their detection under minimal cost.

Ideas from Bayesian hyperparameter optimization will likely prove useful to our optimization goals since each iteration of our output function is expensive to obtain. Ideally, we wish for a symbiotic loop between sequencing technology development and simulation study design optimization. That is, simulations could produce realistic sets of data of a neoplastic process, optimization techniques could then produce feasible sets of technological parameters with which details of this process are revealed, and finally sequencing technological development could then be targeted toward parameter sets that provide maximal amounts of information.

Footnotes

DISCLAIMER

The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The Pennsylvania Department of Health specifically disclaims responsibility for any analyses, interpretations, or conclusions.

AVAILABILITY

The complete code for the simulator and current experiments can be found at: https://github.com/CMUSchwartzLab/MosaicSim. Additional pseudocodes and testing information can be found in the .

AUTHOR DISCLOSURE STATEMENT

The authors declare they have no conflicting financial interests.

FUNDING INFORMATION

Portions of this study have been funded by Pennsylvania Department of Health award FP00003273. Research reported in this publication was supported by the National Human Genome Research Institute of the National Institutes of Health under award number R01HG010589.

SUPPLEMENTARY MATERIAL

References

Abascal

, Harvey

, Mitchell

, et al. Somatic mutation landscapes at single-molecule resolution. Nature, 2021; 593(7859):405–410.

Alexandrov

, Kim

, Haradhvala

, et al. The repertoire of mutational signatures in human cancer. Nature, 2020; 578(7793):94–101.

Benjamini

, Speed

. Summarizing and correcting the gc content bias in high-throughput sequencing. Nucleic Acids Res, 2012; 40(10):e72.

Colom

, Herms

, Hall

, et al. Mutant clones in normal epithelium outcompete and eliminate emerging tumours. Nature, 2021; 598(7881):510–514.

Coorens

, Moore

, Robinson

, et al. Extensive phylogenies of human development inferred from somatic mutations. Nature, 2021; 597(7876):387–392.

Dentro

, Leshchiner

, Haase

, et al. Characterizing genetic intra-tumor heterogeneity across 2,658 human cancer genomes. Cell, 2021; 184(8):2239–2254.

Ellis

, Moore

, Sanders

, et al. Reliable detection of somatic mutations in solid tissues by laser-capture microdissection and low-input DNA sequencing. Nat Protocols, 2021; 16(2):841–871.

Ewing

, Houlahan

, Hu

, et al. Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection. Nat Methods, 2015; 12(7):623–630.

Forbes

, Tang

, Bindal

, et al. Cosmic (the catalogue of somatic mutations in cancer): A resource to investigate acquired mutations in human cancer. Nucleic Acids Res, 2010; 38(Suppl. 1):D652–D657.

10.

García-Nieto

, Morrison

, Fraser

. The somatic mutation landscape of the human body. Genome Biol, 2019; 20(1):1–20.

11.

, Ng

HKT

, Tang

, Schucany

. Testing the ratio of two poisson rates. Biom J, 2008; 50(2):283–298.

12.

Hadi

, Yao

, Behr

, et al. Distinct classes of complex structural variation uncovered across thousands of cancer genome graphs. Cell, 2020; 183(1):197–210.

13.

Jolly

, Van Loo

. Timing somatic events in the evolution of cancer. Genome Biol, 2018; 19(1):1–9.

14.

Kanwal

, Gupta

. Epigenetic modifications in cancer. Clin Genet, 2012; 81(4):303–311.

15.

Kelleher

, Etheridge

, McVean

. Efficient coalescent simulation and genealogical analysis for large sample sizes. PLoS Comput Biol, 2016; 12(5):e1004842.

16.

Kim

, Scheffler

, Halpern

, et al. Strelka2: Fast and accurate calling of germline and somatic variants. Nat Methods, 2018; 15(8):591–594.

17.

Koboldt

. Best practices for variant calling in clinical sequencing. Genome Med, 2020; 12(1):1–13.

18.

Langmead

, Salzberg

. Fast gapped-read alignment with bowtie 2. Nat Methods, 2012; 9(4):357–359.

19.

Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997; 2013.

20.

. Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics, 2018; 34(18):3094–3100.

21.

, Roberts

, Wala

, et al. Patterns of somatic structural variation in human cancer genomes. Nature, 2020; 578(7793):112–121.

22.

Loeb

. A mutator phenotype in cancer. Cancer Res, 2001; 61(8):3230–3239.

23.

Nordborg

. Coalescent theory. In: Handbook of Statistical Genomics: Two Volume Set. 2019; pp. 145-30.

24.

Olafsson

, Anderson

. Somatic mutations provide important and unique insights into the biology of complex diseases. Trends Genet, 2021; 37(10):872–881.

25.

Rajaraman

, Ullman

. Mining of Massive Datasets. Cambridge University Press: New York, NY, USA; 2011.

26.

Rausch

, Zichner

, Schlattl

, et al. Delly: Structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics, 2012; 28(18):i333–i339.

27.

Salk

, Fox

, Loeb

. Mutational heterogeneity in human cancers: Origin and consequences. Annu Rev Pathol Mech Dis, 2010; 5:51–75.

28.

Slatko

, Gardner

, Ausubel

. Overview of next-generation sequencing technologies. Curr Protoc Mol Biol, 2018; 122(1):e59.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.26 MB