Ancestral Recombinations Graph: A Reconstructability Perspective Using Random-Graphs Framework

Abstract

We present a random graphs framework to study pedigree history in an ideal (Wright Fisher) population. This framework correlates the underlying mathematical objects in, for example, pedigree graph, mtDNA or NRY Chr tree, ARG (Ancestral Recombinations Graph), and HUD used in literature, into a single unified random graph framework. It also gives a natural definition, based solely on the topology, of an ARG, one of the most interesting as well as useful mathematical objects in this area. The random graphs framework gives an alternative parametrization of the ARG that does not use the recombination rate ρ and instead uses a parameter M based on the (estimate of ) the number of non-mixing segments in the extant units. This seems more natural in a setting that attempts to tease apart the population dynamics from the biology of the units. This framework also gives a purely topological definition of GMRCA, analogous to MRCA on trees (which has a purely topological description i.e., it is a root, graph-theoretically speaking, of a tree). Secondly, with a natural extension of the ideas from random-graphs we present a sampling (simulation) algorithm to construct random instances of ARG/unilinear transmission graph. This is the first (to the best of the author's knowledge) algorithm that guarantees uniform sampling of the space of ARG instances, reflecting the ideal population model. Finally, using a measure of reconstructability of the past historical events given a collection of extant sequences, we conclude for a given set of extant sequences, the joint history of local segments along a chromosome is reconstructible.

1. Introduction

This study is motivated by the question: Given a collection of N chromosomal segments, under the best computational scenario (in terms of time, space and sophistication of algorithms), is the joint history of the N (extant) sequences reconstructible? The presence of recombinations in the evolutionary history of the sequences makes the reconstruction process nontrivial and understanding the manifestations of this genetic exchange event in the chromosome sequences has been the subject of intense and careful study (Hudson, 1983; Griffiths, 1999; Hein et al., 2005).

The effect of recombinations on the traditional phylogenetic tree reconstruction (Schierup and Hein, 2000), on combinatorial complexity, in terms of deviations and error bounds (Wiuf and Hein, 1999b), and on the overall effect on ancestral relationships (Wiuf and Hein, 1999a,b; Griffiths, 1999; Davies et al., 2007) has been studied in the literature. The evolution of the statistical properties of an ideal population, with genealogical relationship between sequences in a diploid population, can be understood through simulations. One of the most important mathematical object in this context is the Ancestral Recombinations Graph (ARG) introduced by Griffith and Marjoram (1997). The underlying ideas have been used in simulation algorithms (Gabriel et al., 2002), with migrations, populations subdivision, and other influencing factors layered in, to simulate human population evolution. Algorithmic approaches to estimate the ARG are discussed in Parida et al. (2008, 2009) and the same applied to the study of genetic variations in human populations (Mele et al., 2009).

To address the ARG reconstructability question in its generality, as a point of reference we use the assumption that the maternal (or paternal) pedigree tree is completely reconstructible for all practical purposes (Jobling et al., 2004; Hein et al., 2005). Thus, there is a need for a unified framework to enable the comparisons of the tree structures with ARGs, as well as the location of the common ancestors (Griffiths, 1999). Random graph theory in general proposes to study the properties of graphs defined in a probabilistic setting. For our purposes, we consider random directed acyclic graphs on a countably infinite vertex set. This framework allows the embedding of the pedigree tree of purely maternal lineage via say mitochondrial DNA or purely paternal lineage via say NRY chromosome, into the general pedigree graph, including the ARG model. This allows for a comparison, under some ideal population (Wright-Fisher) setting, of the unilinear tree with the biparental pedigree graph and address the reconstructability question.

The results are as follows. Firstly, the unified framework gives a natural definition, based solely on the topology, of the ARG. It gives an alternative parametrization of the ARG that does not use the recombination rate ρ and instead uses a parameter M, an upper bound on the number of non-mixing segments in the extant units. This is more natural in a setting that attempts to isolate (as far as possible) population dynamics from the biology of the units under study. This framework also gives a purely topological or graph-theoretic definition of the GMRCA. Recall that the MRCA on trees has a purely topological description i.e., it is a root, graph-theoretically speaking, of a tree. Secondly, we identify a natural measurable space for the pedigree graph instances, as well as the ARG and unilinear transmission trees. Thirdly, with a natural extension of the ideas used to define the measurable space, we present a simulation algorithm that uniformly samples the space of ARG (as well as unilinear transmission tree instances). This is the first algorithm that guarantees uniform sampling of the space of ARG instances, reflecting the ideal population model. Finally, using a measure of reconstructability of the ARG, we conclude for a given set of extant sequences, the joint history of local segments along a chromosome is reconstructible.

The Population Model: Wright-Fisher. The ideal population or Wright-Fisher Model assumes three properties of the evolving population: (1) constant size, (2) non-overlapping generations, and (3) panmictic with random mating and no selection. While the first two properties appear non-realistic at first glance, these assumptions are reasonable for the purposes of the study of the genetic variations at the population level. In fact, models with varying population size and/or overlapping generations can be reparameterized for an equivalent Wright-Fisher Model (Hein et al., 2005; Jobling et al., 2004; Bürger 2000). Panmictic means that there is no substructuring of the population due to mating restrictions caused by mate selection, geography or any other such factors. Thus, the model assumes equal sex ratio and equal fecundity.

Roadmap. The rest of the article is organized as follows. Section 2 describes the modeling of the general pedigree graph as an infinite random graph. Section 3 models the pedigree trees and the ARG as two classes of subgraphs respectively: one simply on monochromatic vertices and the other on mixed vertices with restricted number per generation. Next, for both the classes of subgraphs, a natural measurable space is defined in Section 4. Using this measure, Section 5 discussed the computation of the expected size of the subgraph in terms of the depth of the most recent common ancestor. As a natural consequence of this model, an algorithm uniformly sampling the space of ARG instances (as well as pedigree tree instances) is described in Section 6, and we conclude in Section 7.

2. Random Graph Framework: Pedigree Graph G_PG(K, N)

Let V be the set of vertices and E the set of edges in a directed graph G(V,E). Each vertex v corresponds to an individual or a unit in the population. The edges denote the flow of genetic material between the units (Fig. 1). The characteristics of these two sets are as follows.

FIG. 1.

The first 10 generations of an instance of a relevant pedigree graph G_PG(K,N) with K = 4 and N = 8. The solid (blue) dots represent one gender, say males and the hollow (red) dots represent the other gender (females). Each row is a generation with the direction on edges indicating the flow of the genetic material and the four extant units are at the bottom row, i.e., row 0. Under the Wright-Fisher Population model, there are equal number of males and females in each row, and the two distinct parents, one male and one female from the immediately preceding generation, are randomly chosen.

Vertices (K,N): The vertex set V is a countably infinite set. We suppose that the vertices are organized in rows, each of fixed size. Each row represents a generation and is numbered as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$0 , 1 , 2 , 3 , \ldots.$$ \end{document} Row 0 has K vertices and each row >1 has 2N (N ≥ 1) vertices, N of which are colored blue and N are colored red, denoting the gender of the units. The fixed size of 2N per row (or generation) is due to the constant population size model. The panmictic nature of the WF population dictates that the number of blue and red vertices be equal (see Section 1).

The vertex set in row 0 has K elements whose color is immaterial: these K nodes are also called the extant vertices.

The N vertices of each color in each row g can be labeled by a pair (g, j) where 1 ≤ j ≤ N. A graph instance is vertex-labeled if this label is associated with every vertex v of the graph instance.

Edges: All the edges in E are directed and are only between vertices of adjacent rows. Also, the direction of the edge is from the vertex at row g + 1 to the vertex at row g. Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$(v , u) \in E$$ \end{document} be a directed edge from v to u. The direction indicates the flow of genetic information and v is called the parent of u and similarly u is the child of v. When a node u at row g has two incoming edges (v₁,u) and (v₂,u), then v₁ and v₂ (of row g + 1) must have different colors.

The parent vertex (vertices) is chosen at random reflecting the panmictic nature of the WF population. A parent and child are in adjacent rows due to the non-overlapping generations in the model (see Section 1). However, this can be easily relaxed to have overlapping generations and the essence of each discussion below still holds.

Note that one needs to distinguish between a specific instance or realization (sometimes called replica in the simulation parlance) of the random graph from the entity random graph itself which is a probability measure (to be specified later) on the space of infinite directed graphs with a countably infinite set of vertices. An instance of the random graph is obtained after executing the edge construction procedure as below.

Repeat for each row g: For each vertex u in row i, pick exactly one blue vertex v₁ and exactly one red vertex v₂ at random from row g + 1. The two directed edges are v₁u and v₂u.

Then,

Every instance of G_PG(K,N ) is a directed acyclic graph (DAG). This follows from the fact that no vertex can be an ancestor of itself.

An instance of G_PG(K,N) corresponds to the entity termed pedigree graph in literature (Steel and Hein, 2006).

The number of vertices in row g is denoted by k_g. Note that when g = 0, k_g = K.

Forbidden structures. Can the pedigree graph be monochromatic? It turns out, that if the color is not taken into account, then the graph has certain forbidden structures (Fig. 2). These are topologies where the parents of a set of vertices cannot be colored satisfying the condition that the two parents of a vertex must be of different colors. For simplicity of computations, we will assume that the pedigree graph retains the two distinct color of the vertices.

FIG. 2.

Can the pedigree graph be monochromatic? Forbidden structure in an instance of the pedigree graph. There exists no consistent assignment of red and blue colors (different genders) to the parents of the three vertices in the bottom row.

2.1. Least common ancestor (LCA)

The vertex set of an instance of the pedigree graph can be trimmed by focussing only on the flow of genetic information to the extant vertices. This is termed the relevant pedigree graph. A vertex v_a is an ancestor of vertex v if there exists a directed path from v_a to v. In graph-theoretic terms, it means that any vertex on the relevant pedigree must be an ancestor of at least one extant vertex. However, a relevant pedigree graph is also an infinite object. In the rest of the discussion, a pedigree graph is always a relevant pedigree graph.

A common ancestor v_a of vertices \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$v_1 , v_2 , \ldots , v_k$$ \end{document} is called the least common ancestor (LCA) if for all common ancestors \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$v_a^{\prime} (\ne v_a) , v_a^{\prime}$$ \end{document} is an ancestor of v_a (Cormen et al., 1990). In the rest of the article, an LCA always refers to an LCA of all the K extant vertices in the pedigree graph.

Note that even though in every instance of the pedigree graph the indegree and outdegree of every vertex is bounded: indegree by 2 and outdegree by 2N and each row is bounded by 2N vertices, the number of possible LCA's might not be finite. Let Z(K,N ) be the random variable denoting the number of LCAs in G_PG(K,N). Then \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} 0 \leq Z (K , N) \leq \infty. \end{align*} \end{document}

Moreover,

Lemma 1

1. There exist instances where Z(K,N) attains the value 0.

2. There exist instances where Z(K,N) attains the value ∞ .

Proof Sketch: We first prove the following two lemmas.

Lemma 2

Let v_l at row g > 1 be an LCA.

(a) For an LCA v at row g′ > g, there exists no path from v to v_l in the pedigree graph.

(b) At every row g′ < g there exist at least two distinct vertices \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$v_1^{\prime}$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$v_2^{\prime}$$ \end{document} , none of which is an LCA, (termed path-vertices) such that there is a path from v_l to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$v_1^{\prime}$$ \end{document} and a path from v_l to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$v_2^{\prime}$$ \end{document} in the pedigree graph.

Proof. (a) This follows directly from the definition of the LCA, i.e., it is not possible to have more than one LCA in a any path on the pedigree graph. (b) Assume the contrary. Case 1: There exists no vertex in row g′ < g with a path from v_l. This is a contradiction since in a pedigree graph there must exist a path from g_l to at least one of the extant vertices. Case 2: There is only one vertex v′ at row g′ < g such that there is a path from v_l. Since v_l is an LCA, there exist paths to all the extant vertices and thus there exist paths from v′ to all the extant vertices contradicting the fact that v_l is an LCA.

Lemma 3

Let an LCA occur at depth g_l with at least one LCA at a depth g′ < g_l and with at least one LCA at a depth g″ > g_l. Then row g_l must have at least five vertices.

Proof. Observe that for each vertex v at any row g, there exist at least two vertices, say v₁ and v₂, in rows > g with paths from v₁ and v₂ to v. If v is an LCA, we call v₁ and v₂ of each row as block-vertices in the proof below. Thus, row g_l has at least two block-vertices corresponding to the LCA at row g′. Also by Lemma 2(b), row g_l has at least two path-vertices. The result follows from the fact that all these vertices are distinct.

Corollary 1

For 1 ≤ N ≤ 2, there exists no instance with infinitely many LCAs.

Proof. Assume the contrary. Then there must exist at least three rows g′ < g_l < g″ with LCAs in each of these rows. Then row g_l must have at least five vertices, leading to a contradiction.

Back to the proof of Lemma 1: Case 1: We construct an instance of the pedigree graph with no LCAs, i.e., Z(K,N) = 0. Let N ≥ 2. This construction is done in the following two steps. Let N = N₁ + N₂ and K = K₁ + K₂ with N₁,N₂ ≥ 1 and K₁,K₂ ≥ 1. In Step 1, construct an instance G¹ of the pedigree graph G_PG(K₁,N₁) and an instance G² of the pedigree graph G_PG(K₂,N₂). In Step 2, construct the union of the two graphs assuming that the two vertex sets are nonoverlapping (distinct labels in the two instances). It can be easily verified that this union is an instance of the pedigree graph G_PG(K,N) and since it has at least two connected components, it has no LCAs.

Case 2: By Corollary 1, N > 2. We create an instance of the random graph G_PG(4,3) that has an infinite number of LCAs, i.e., Z(K,N) = ∞ (Fig. 3). The construction is as follows. Row 0 has the four extant vertices. The outgoing edges from vertices in Row 0 and 1 are constructed as shown in the figure. The vertices in row 3 and higher are of three categories: (1) Two vertices of different colors called the blocked-vertices (two left vertices in the figure); (2) one vertex, called the LCA-vertex of any color (the middle vertex in the figure); (3) two vertices of the same color, but different from the color of the LCA-vertex called the path-vertices (two right vertices in the figure). The edge constructions follow a simple pattern as shown in the figure. Under this construction, the following can be verified: (1) the instance of the pedigree graph is valid, i.e., the color of the two parents of a node are of different colors, (2) every LCA-vertex of row 2 and higher is indeed the LCA of all the extant vertices.

FIG. 3.

An instance of the pedigree graph G_PG(4,3) (i.e., 4 extant vertices and population of size 3 for each gender at every generation) with an infinite number of LCAs. A possible coloring (or gender assignment) is shown for rows 1 and above; the colors of vertices of row 0 are immaterial. Only the first six rows are shown, marked from 1 to 5 (bottom row is 0). The LCAs are shown with an extra concentric ring. Rows 2 or higher: The block-vertices are the two leftmost vertices; the path-vertices are the two vertices in the right; the LCA-vertex is in the center. The same pattern of edges can be followed for all rows to define an infinite number of LCAs.

Corollary 2

For fixed parameters, K > 1 and N > 2,

1. there are infinite number of instances of G_PG(K,N) each with no LCAs.

2. there are infinite number of instances of G_PG(K,N) each with an infinite number of LCAs.

Next we make the assumption that any ancestor of an LCA is not of consequence and can be excised from the relevant pedigree graph. Is the pedigree graph after excising all the ancestors of all the LCAs finite? To summarize the answer to the question of finiteness of the excised pedigree graph with fixed parameters K(>1) and N(>2):

It is possible that an instance of the (excised) pedigree graph is infinite.

Further, there are infinitely many such instances.

3. Pedigree Subgraphs

It is perhaps not very surprising to note that a pedigree graph may have multiple LCAs. However, it is rather surprising to note that even in a finite population model, the number of LCAs could be infinite (Lemma 1). This counter intuitive characteristic of the pedigree graph can be addressed by exploiting the biparental mode and coupling genetic exchange information with the topology of the pedigree graph.

Population dynamics versus biology. Usually, a mutation rate θ is associated with a population and different populations can be Wright-Fisher populations with different mutation rates. Note that different values of θ do not (and should not) affect the population dynamics under such population models. Also, θ is not a direct observable: it is inferred from the observed mutations or allele values. In fact mutations, reflected as allele frequencies, can be even viewed as external markers (say like Lagrangian markers in fluid dynamics) to study the evolution of the statistical properties of the ideal population. Ideally, θ does not affect the topology of the unilinear transmission trees: it only affects the sequences that each unit represents. Since θ is “external” to the population, this additional parameter does not affect the modeling (or understanding) of the dynamics of the population.

Then, how about the parameter recombination rate ρ in a biparental model? Analogous to mutations (and other duplication-model genetic events), this should not affect the population dynamics but only the sequences of the units. Again, it is (by current biotechnologies) not a direct observable but can be inferred from the sequences in the population. However, just as in the unilinear transmission model, the vertices that have no paths to an extant unit is not relevant for the study, so in the biparental model, vertices that have no genetic material ancestral to any in the extant units are not of relevance. The definition of the relevant pedigree graph is now extended to exclude those vertices that do not carry any genetic material to the extant units (although there may be a path in the graph to an extant vertex). It now seems more natural to annotate the vertices of the biparental graph with the nonmixing segments (i.e., a segment that is inherited completely from the mother or the father but not mixed by the two parents) of genetic materials. Thus, instead of ρ, a more natural parameter seems to be M, the number of nonmixing units in the extant population. We claim that the parameter M models the biparental mode as a natural extension of the (well-accepted) unilinear transmission mode and ρ continues to be “external” to the population.

We identify two classes of subgraphs of the pedigree graph G_PG(K,N) as our objects of study (Fig 4. shows the different subgraph models of the pedigree graph):

1. Unilinear Transmission: A monochromatic subgraph G_PT(K,N) is induced on the vertices of one color (either only blue or only red). Thus each vertex has exactly one parent. The biological interpretation of a monochromatic subgraph is as follows. The genetic material that is transmitted only through the blue vertices (father) is the nonrecombining Y chromosome (NRY). Similarly, the genetic material that is transmitted only through the red vertices (mother) is the mitochondrial DNA. These subgraphs of the pedigree graph actually represent the duplications-only model. Lemma 5 has an interesting consequence: All the genetic material in the extant sequences can be traced back to a unique vertex in the pedigree graph. Topologically, this vertex is the LCA and is called the most recent common ancestor (TMRCA).

2. Genetic Exchange Model: In this model, genetic material is additionally associated with the vertices. M is an upper bound on the number of non-mixing (or genetic exchange) segments in the extant units, which is used as an additional parameter. A mixed subgraph G_PGE(K,N,M) is induced on some blue and some red vertices. Thus the vertices have may have either one or two parents. Lemma 7 has an interesting consequence: All the genetic material in the extant sequences can be traced back to a unique vertex in the pedigree graph. Topologically, this vertex is the LCAA (defined in Section 3.2.1). This is called the grand most recent common ancestor (GMRCA).

FIG. 4.

(a) An instance of a pedigree graph. (b) Monochromatic subgraph induced on the blue nodes (each node has exactly one parent) (c) Mixed subgraph induced on a subset of the blue and the red nodes (each node has one or two parents).

3.1. Unilinear transmission: monochromatic subgraphs G_PT(K,N)

Mutation events or genetic events such as the ones leading to Short Tandem Repeat (STR) polymorphisms are modeled as duplication events or simply non-recombining events (Hein et al., 2005; Jobling et al., 2004). Hence, this is also called the duplications-only model and each vertex has only one parent. Since we do not model any gender-specific characteristics, the duplications-only model is equivalent to the monochromatic (all vertices of the same color) model in our general setting. The following is easily verified.

Lemma 4

The monochromatic subgraph is a forest (tree), i.e., the graph has no closed paths.

Hence the monochromatic subgraph is written as G_PT(K,N).

Lemma 5

Given an instance of a monochromatic subgraph G_PT(K,N):

1. The number of vertices can neither increase with depth nor be zero at any row, i.e., \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} K \geq k_1 \geq k_2 \geq k_3 \geq , \ldots , (\geq 1) . \end{align*} \end{document}

2. The number of LCAs is at most 1.

(k_g = 1) ⇔ The vertex at row g is the LCA.

Proof Sketch: 1. This follows from the fact that the graph is a tree. 2. Assume the contrary that an instance has l > 1 LCAs. Let v₁ and v₂ be two distinct LCAs. Then there must exist vertices u₁ and u₂ (possibly with u₁ = u₂) where u₂ is an extant vertex and there is a path from v₁ to u₁, a path from v₂ to u₁ and a path from u₁ to u₂. Further let u₁ be such that there is no other vertex u′ with a path from v₁ to u′, v₂ to u′ and u′ to u₁ (if such is the case then we call u′ as u₁). Observe that each vertex of the monochromatic subgraph has at most one parent. However, u₁ must have at least two parents (each on the two distinct paths to the two LCAs v₁ and v₂) contradicting this fact. Hence the assumption must be wrong and the number of LCAs l ≤ 1. 3. This follows from 2.

Corollary 3

For fixed parameters, K > 1 and N ≥ 1, there are infinitely many monochromatic subgraphs with no LCAs.

3.2. Genetic exchange model: mixed subgraph G_PGE(K,N,M)

Given K extant sequences, the most recent common ancestor (MRCA) is a sequence S from some generation such that the genealogy of every segment (nucleotide) of every extant sequence can be traced back to S. Further, amongst all such common ancestors, S is the most recent one. In population genetics, usually the term MRCA is used for the duplications-only model, and the term grand MRCA (GMRCA) is used when the genetic events include rearrangements of the sequence, such as recombinations (Hein et al., 2005). In the following, we use M, an upper bound on the number of mixing segments in the extant units, to parameterize a general genetic exchange model. A Mixed Subgraph G_PGE(K,N,M) is defined as follows. For each instance G of the mixed subgraph:

Each vertex in G is annotated with M nonmixing segments and must have genetic material that flows to at least one of the extant vertices.

This implies that a vertex may have only one parent (if the other parent has no genetic material flowing to an extant unit).

Each genetic mixing event is equally likely to occur.

This is only a simplifying assumption and in the same spirit as random mating or panmictic condition of the Wright Fisher population. A deviation from this assumption can be handled quite simply by modifying the set definitions in Section 6.3.2.

Genetic Material gm Notation The genetic material of a unit v is gm(v) and the flow of the genetic material through an edge e is gm(e). The M nonmixing units are written as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$\{1 , 2 , \ldots , M \}$$ \end{document} and is associated with each extant vertex v. The genetic material may have nonconsecutive segments, say, 2, 3, 7, i.e. gm(v) = {2, 3, 7}. Thus for all v and e of the subgraph instance \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} gm (v) , gm (e) \subset \{1 , \ldots , M \} . \end{align*} \end{document}

The flow of the genetic material gm(e) through an edge e and the genetic material gm(v) of a vertex v are not independent and are related by the two rules.

Rule 1: Let u be a vertex with d ascendant (incoming) edges e_i,i = 1.d (the valid values of d are 1 or 2). \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} gm (u) = \begin{cases} gm (e_1) , &\ {\rm if}\ d = 1,\\ gm (e_1) \coprod gm (e_2) . &\ {\rm if}\ d = 2 \end{cases}. \end{align*} \end{document}

(Note that S = S₁ ∐ S₂ denotes that S is the disjoint union of S₁ and S₂, i.e., S₁ ∩ S₂ = ∅.)

Rule 2: Let v be a vertex with d descendant (outgoing) edges e_i,i = 1.d. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} gm (v) = \bigcup_{i = 1}^d gm (e_i) . \end{align*} \end{document}

Let m graphs G_i(V_i,E_i) with vertex set V _i and edge st E_i be defined on (labeled) vertices, 1 ≤ i ≤ m. Then the induced graph on vertices \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$V = \bigcup\nolimits_{i = 1}^m V_i$$ \end{document} (with edges \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$E = \bigcup\nolimits_{i = 1}^m E_i$$ \end{document} ) is written as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} G = G_1 \cup G_2 \cup \ldots \cup G_m. \end{align*} \end{document}

Lemma 6

Given G, an instance of a mixed subgraph G_PGE(K,N,M), the following hold.

1. For each vertex v and each edge e of G, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} gm (v) , gm (e) \ne \emptyset. \end{align*} \end{document}

2. Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$V_m = \left\{v \mid m \in gm (v) \right\}$$ \end{document} , for a nonmixing unit 1 ≤ m ≤ M with induced graph T_m on V_m.

(a) T_m is a forest for all 1 ≤ m ≤ M.

(b) \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$G = T_1 \cup T_2 \cup \ldots \cup T_M.$$ \end{document}

3. Let the set of vertices at depth g be V_g. The following holds for each depth g.

(a) |V_g| ≤ KM.

(b) For each nonmixing unit 1 ≤ m ≤ M, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$\mid \{v \in V_g \mid m \in gm (v) \} \mid \leq K$$ \end{document} .

Proof Sketch: (1) This follows from the definition of the mixed subgraph (each node must have genetic material that flows to at least one extant vertex). (2a) Assume that the result is not true: For some m, T_m has a closed path. By the nature of the direction of the edges in the pedigree graph, then there exists a vertex v with two distinct paths P₁ and P₂ to u. Without loss of generality, let the two paths be nonintersecting, except at v and u. Clearly, by Rule 1, the two incoming edges e₁ and e₂ on u cannot be such that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$m \in gm (e_1) , gm (e_2)$$ \end{document} . This leads to a contradiction and the assumption is false. (2b) Let the vertex set of G be V and the edge set be E. Then \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} V_1 \cup V_2 \cup \ldots \cup V_M \subseteq V , & \qquad \hbox{(\rm by definition of $V_m$)} \\ V \subseteq V_1 \cup V_2 \cup \ldots \cup V_M. & \qquad \hbox{(\rm by (1) each $v$ \rm of $G$ \rm must belong to at lease one of $T_m$)} \end{align*} \end{document}

Thus \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} V & = V_1 \cup V_2 \cup \ldots \cup V_M , \\ E & = E_1 \cup E_2 \cup \ldots \cup E_M. \qquad \hbox{(\rm by similar arguments)} \end{align*} \end{document}

Hence, the result. (3a, 3b) In each T_m the number of vertices per row does not exceed K. Also, each vertex in T_m must be annotated with the genetic element m. Thus, by 2(b), the number of vertices in G cannot exceed KM and the number of nodes with nonmixing element m cannot exceed K.

See Equation 1 for all possible gm annotations for some fixed values of M,K,k_g. An example is shown in Figure 5 of the embedded trees in an instance of a mixed subgraph. An instance of a graph where each vertex v has genetic material gm(v) defined is said to be gm-annotated. Note that in a monochromatic subgraph, for any two distinct vertices v₁ and v₂, gm(v₁) = gm(v₂). Thus, a monochromatic subgraph can be considered to be always gm-annotated. Also we write the gm annotation of vertex v_i simply as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} gm_i \end{align*} \end{document}

FIG. 5.

(a) An instance of a mixed subgraph G_GE(4,N,7). (b) For clarity, the chain-paths have been replaced by a single edge. (c) G with possible annotation of genetic units {1,2,3,4,5,6,7}. (d–j) The 7 distinct trees (forests), induced by each nonmixing unit i, embedded in G. In other words, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$G = \bigcup\nolimits_{i = 1}^7 T_i.$$ \end{document}

(instead of gm(v_i)). Also if vertex v_i is in row g, the annotation may be also written as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} gm_{g , i}. \end{align*} \end{document}

3.2.1 Topological definition of GMRCA: least common ancestor with ancestry (LCAA)

Conceptually the term LCA is equivalent to MRCA. However, LCA is not equivalent to GMRCA, which additionally is also ancestral to the genetic material in the extant vertices. This is due to the following fact: If node v_a is an ancestor of some node v in G_PG(K,N), then it is possible that not all the genetic material of v_a is ancestral to the genetic material of v. It is also possible that v_a is ancestral to no genetic material of v. In the latter case a topological ancestor is not a “genetic” ancestor. For example, see Figure 6. A natural question is if there exists a purely topology-based definition that captures the notion of a GMRCA. We call this the LCA with ancestry or LCAA, a graph-theory based term for the GMRCA (it is defined in the lemma below). Given G an instance of a mixed subgraph G_PGE(K,N,M), by Lemma 6, let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} G = T_1 \cup T_2 \cup \ldots \cup T_M. \end{align*} \end{document}

FIG. 6.

Ancestor without ancestry: Each node is annotated with the genetic material labeled as a combination of A, B, C, D, or -. The symbol “-” denotes material that is not ancestral to any in the extant units. The LCA (shown with an extra concentric ring) has no ancestral material although it is the LCA of the two extant vertices with genetic material AB and CD, respectively. Thus this LCA is an ancestor without any ancestry.

Then,

Lemma 7

1. The following two definitions of LCAA are equivalent:

(a) (population genetics based) the least, or most recent, common ancestor of the K extant units that is also ancestral to all the genetic material in the K units.

(b) (graph-theory based) the LCA of the LCA's of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$T_1 , T_2 , \ldots , T_M$$ \end{document} .

2. The number of LCAAs is at most 1.

3. (k_g = 1) ⇔ The vertex at row g is the LCAA.

Proof Sketch: We prove the second statement first. 2. Let every genetic unit (say a nucleotide) be tagged by a two tuple, its position in the chromosome and the label of extant vertex. Thus, assuming there are c nucleotides and K extant units, there are cK distinct tuples. Next the genetic flow from vertex to vertex through the edges is marked by the tuples. It is easy to see that a path marked with a specific tuple is a chain (that does not branch). Thus, if vertex v is a GMRCA, then by definition, v is on all the cK paths. Thus there cannot exist more than one GMRCA since all the marked paths are chains. 1. Let v″ be the LCAA by definition (a). Note that in tree, T_m, 1 ≤ m ≤ M, the LCA of T_m, say vertex v_m, is also the LCAA of the extant units, corresponding to genetic material m. Further let v′ be the LCA of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$v_1 , v_2 , \ldots , v_M$$ \end{document} . Then, clearly, v′ is a CAA (common ancestor with ancestry) of the K extant units. Case 1: If v″ is an ancestor of v′, it contradicts the definition of LCAA and v′ = v″. Case 2: If v′ is an ancestor of v″, it contradicts the definition of LCA of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$v_1 , v_2 , \ldots , v_M$$ \end{document} and v′ = v″. Case 3: There is no path between v′ and v″. Then both are LCAAs but this contradicts 2. and v′ = v″. Next, let v″ be the LCAA by definition (b). Then v″ is also a CAA of the K extant units. Let v′ be an LCAA. By considering the three cases as before, we show that v′ = v″. 3. This follows from 1. and 2.

Lemma 8

The effective value of M for a pedigree graph with N blue or red vertices at some depth is: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} M = NK^{- 1}. \end{align*} \end{document}

4. Probability Space of (Infinite) Pedigree Subgraphs: Uniform Distribution on [0,1]

In this section, we address the problem of mapping the instances of the pedigree graph to a measurable space with the goal of defining a natural probability space. The Wright-Fisher population model suggests a uniform probability space as an appropriate choice. The treatment of the models, monochromatic and mixed subgraphs, are very similar and to avoid repetition of the arguments in the following discussion, all the models are treated simultaneously. Our definition of the probability measure is inspired by (and along similar lines to) the classical construction of finitely additive, invariant measures on infinite groups such as the integers using Følner (1955) sequences.

Note that the set of graph instances is not enumerable (this can be seen by a mapping of the instances to the reals). To enable computations, let the graph have a fixed height h (>1), i.e., a finite number of generations. Now, since the graph is finite, the instances are finite. The limit as h → ∞ gives us a basis for defining a probability space (details in Section 4.2).

4.1. Fixed depth subgraphs

Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${\cal G}$$ \end{document} be the set of all (1) vertex-labeled and (2) additionally gm-annotated (in the genetic exchange model) graphs. Define a restriction mapping \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \pi_h: {\cal G} \rightarrow {\cal G}_h , \tag{1} \end{align*} \end{document}

where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$G \in {\cal G}$$ \end{document} is mapped to the graph of finite depth h, π_h(G), that is identical to G upto depth h. Note that the number of distinct trees on a finite number of (labeled) vertices is called the Cayley number which has a closed form formula (Parida, 2007). Here we address the problem of counting and enumerating the vertex-labeled (see Section 2), gm-annotated (see Section 3.2) graphs. Let L_h denote the maximum number of vertices at depth h. Then, for h ≥ 1, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} L_h = \left\{\begin{matrix} \hfill K , & \qquad \hbox{for monochromatic subgraphs} \ G_{\rm PT} (K , N) , \\ \min \{2^{h - 1} K , MK , 2N \} , & \hbox{for mixed subgraphs} \ G_{\rm PGE} (K , N , M) .\end{matrix} \right. \end{align*} \end{document}

For sufficiently large h, L_h = L, where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \begin{split} L = \left\{\begin{matrix} \hfill K , & \qquad \hbox{for monochromatic subgraphs} \ G_{\rm PT} (K , N) , \\ \min \{MK , 2N \} , & \hbox{for mixed subgraphs} \ G_{\rm PGE} (K , N , M) .\end{matrix} \right.\end{split} \tag{2} \end{align*} \end{document}

In order to compute and enumerate all the configurations, we define the following set-valued functions. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} {\cal Q}_0 (k_0) & = \left\{gm \left| \begin{matrix}gm \ \hbox{is a genetic material annotation of} \ k_0 \ {\rm vertices} : \hfill \\( 1) \ {\rm For} \ 1 \leq i \leq k_0 , \ gm_i \subset \{1 , 2 , \ldots M \} , \hfill \\( 2) \ {\rm For} \ 1 \leq m \leq M , \ 1 \leq \mid \{i \mid m \in gm_i \} \mid \leq K \hfill \\ (\hbox{corresponding to conditions 1. and 3. of Lemma 6}) \end{matrix} \right\}\right., \\ {\cal Q}_1 (k_0 , k_1) & =\left\{q \left| \begin{matrix}q \ \hbox{is a configuration between two adjacent rows,\ say} \ g \ {\rm and} \ g + 1 , \hfill \\ {\rm with} \ k_0\ \hbox{vertices at row} \ g \ {\rm and} \ k_1 \hbox{vertices at row} \ g + 1 , \hfill \\ {\rm and} \ gm \hbox{-\rm annotated in both rows} \ g \ {\rm and}\ g + 1 \hfill \\\end{matrix} \right\} \right., \\ \vdots \quad & \vdots \qquad \qquad \qquad \qquad \qquad \qquad \vdots \\ {\cal Q}_h (k_0 , k_1 , . , k_h) & = \left\{q \left| \begin{matrix}q \ \hbox{is a configuration between} \ h + 1 \ \hbox{adjacent rows,} \hfill \\ {\rm with} \ k_i \ \hbox{vertices at the $i$\rm th \rm row} \ (0 \leq i \leq h) \hfill \\\end{matrix} \right. \right\}. \end{align*} \end{document}

Recall that N is the the number of vertex-labeled vertices in every row. Thus for k vertices at any row, define a function, which we call the weight function, as follows (see Example 5 for illustrations): \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} wt (k) = {{N} \choose k}. \tag{3} \end{align*} \end{document}

The numbers (cardinalities) associated with each of the set-valued functions are as follows. For h ≥ 1, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} Q_h (k_0 , k_1 , .. , k_h) = \left| {\cal Q}_h (k_0 , k_1 , .. , k_h) \right| \prod_{i = 1}^h wt (k_i) . \tag{4} \end{align*} \end{document}

Example 1

Example 2

To avoid clutter, in the following Q₂( · , · ) is written simply as Q( · , · ). Thus, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${\cal Q} (k_g , k_{g + 1})$$ \end{document} enumerates the distinct gm-annotated (in the genetic exchange model) configurations in two adjacent rows say g and g + 1, with k_g vertices at g and k_g₊₁ at g + 1. Note that this function does not depend on the value of depth g. A more precise definition of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${\cal Q} (\cdot , \cdot)$$ \end{document} is as follows: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \begin{split}{\cal Q} (k_g , k_{g + 1}) = \left\{\begin{array}{ll} \coprod_{gm_g \in {\cal Q}_0 (k_g)} {\cal Q}_{ab} (gm_g , k_g , k_{g + 1}) , & \hbox{\hbox{for mixed subgraphs}} \ G_{\rm PGE} (K , N , M) \\ & (\hbox{\hbox{see Eqn 6 below}}) , \\ \hfill{\cal Q}_{cd} (k_g , k_{g + 1}) , & \hbox{\hbox{for monochromatic subgraphs}} \ G_{\rm PT} (K , N) \\ & (\hbox{\hbox{see Eqn 7 below}}) . \end{array} \right. \end{split}\tag{5} \end{align*} \end{document}

Mixed subgraph model. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${\cal Q}_{ab}$$ \end{document} is the set of all the configurations with k_g vertices at some row g, with genetic material annotation gm_g, and k_g₊₁ vertices at the adjacent row g + 1. For N ≥ k_g, k_g₊₁ ≥ 1: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} {\cal Q}_{ab} (gm_g , k_g , k_{g + 1}) = \left\{(q, gm_{g + 1}) \left| \begin{matrix} q = (s_1, s_2, \ldots, s_{k_{g + 1}}), \ \hbox{where} \hfill\\ s_k \ne \emptyset, 1 \leq k \leq k_{g + 1}, \ \hbox{satisfying conditions:} \hfill\\ (1) \ s_1 \cup s_2 \cup \ldots \cup s_{k_{g + 1}} = \{ 1, . , k_g \}. \hfill\\ (2) \ {\rm Let} \ x_i \in s_{k_1} \cap s_{k_2}, \hbox{then} \hfill\\ \quad (a) \, {\rm if} \ k_ 1 \ne k_2 \ {\rm then} \mid gm_{g , i} \mid > 1 \ {\rm and} \ x_i \not \in s_k, k \ne k_1 , \, k_2, \hfill\\ \quad (b) \, (gm_{g , i} \cap gm_{g + 1 , k_1}) \cup (gm_{g, i} \cap gm_{g + 1 , k_2}) = gm_{g , i}. \hfill\\ (3) \, {\rm If} \mid gm_{g, i} \mid = 1 \ \hbox{then there is unique} \ k \ \hbox{with} \ x_i \in s_k. \hfill \end{matrix} \right. \right\}. \tag{6} \end{align*} \end{document}

Condition (1) ensures that every vertex in row g has at least one parent (in row g + 1); condition 2(a) states a vertex with more than one nonmixing unit associated with it, could have at most two parents; condition 2(b) states that the genetic material is split between the parent(s); condition (3) states the condition where a vertex can have only one parent. Further details are presented in Section 6.3.2.

Monochromatic subgraph model. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${\cal Q}_{cd}$$ \end{document} is the set of all the configurations with vertex set X_g at some row g, with exactly one parent, and k_g₊₁ vertices at the adjacent row g + 1. For K ≥ |X_g| ≥ k_g_{+
1} ≥ 1: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} {\cal Q}_{cd} (X_g , k_{g + 1}) = \left\{ q \ \left| \begin{matrix} q = (s_1 , s_2 , \ldots , s_{k_{g + 1}}) , \ \hbox{where} \hfill \\ s_k \ne \emptyset , 1 \leq \ k \leq k_{g + 1}, \subset \ X_g \hbox{satisfying conditions:} \hfill \\ (1) \ s_1 \cup s_2 \cup \ldots \cup s_{k_{g + 1}} = X_g. \hfill \\ (2) \ s_{k_1} \cap s_{k_2} = \emptyset \ {for\ all} \ 1 \leq k_1 < k_2 \leq k_{g + 1}. \hfill \end{matrix} \right. \right\} . \tag{7} \end{align*} \end{document}

Condition (1) ensures that every vertex in row g has a parent (in row g + 1); condition (2) ensures that no vertex has more than one parent. Further details are presented in Section 6.3.1.

Back to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${\cal Q} (\cdot , \cdot)$$ \end{document} . Let Q(k_g, k_g₊₁) be the number of configurations between two gm-annotated rows with the second row being additionally vertex-labeled. Thus, using Eqn 3. (also equivalently Eqn. 4), \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} Q (k_g , k_{g + 1}) = wt (k_{g + 1})\mid {\cal Q} (k_g , k_{g + 1})\mid . \tag{8} \end{align*} \end{document}

Also,

Lemma 9

For all k_g and k_g₊₁, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} 1 \leq Q (k_g , k_{g + 1}) = {\cal O} (N^{k_{g + 1}}) . \end{align*} \end{document}

4.1.1. Equivalence classes

We identify isomorphic graphs, i.e., the graphs that are isomorphic after forgetting the labels, but not the color. Recall that k_g is the number of vertices at row g of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$G \in {\cal G}_h$$ \end{document} . Furthermore, these isomorphism classes of graphs are characterized by the signature of the graphs contained in them where the signature of a graph G is defined as follows: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \hbox{\rm sig} (G) = (k_1 , k_2 , k_3 , \ldots , k_h) . \end{align*} \end{document}

See Figure 7 for an example of the forgetful graphs. Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} {\cal G}_{(k_1 , k_2 , \ldots , k_h)} = \left\{G \in {\cal G}_h \mid \hbox{\rm sig} (G) = (k_1 , k_2 , \ldots , k_h) \right\} . \end{align*} \end{document}

FIG. 7.

To limit the number of equivalence classes, here we focus only on the ones where no vertex has only one parent. Then with K = 2, N = 2 and depth h = 2, the equivalence classes are shown with the unlabeled, forgetful member graphs.

To aid in computing the expected location of LCAA in the models, we also define \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} {\cal F} (g , k) & = \left\{G \in {\cal G}_g \mid G\ has\ k\ vertices\ at\ depth\ g \right\} \\ & = \coprod_{(k_1 , k_2 , . , k_{g} = k) } {\cal G}_{(k_1 , k_2 , . , k_g)} , \quad \hbox{\rm and} \\ F (g , k) & = \mid {\cal F} (g , k) \mid . & (9) \end{align*} \end{document}

4.2. Measurable space

The Wright-Fisher model implies every instance of the vertex-labeled (and additionally, gm-annotated in the genetic exchange model) graph is equally likely to occur. We make a few remarks about the underlying probability measure on \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${\cal G}$$ \end{document} that we use in the article (a non-mathematical reader may skip this section without loss of continuity of exposition).

Each instance of the (infinite) graph is mapped to a real number in the interval [0,1]. More precisely, there exists a bijection \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \phi: {\cal G} \rightarrow [0 , 1] , \end{align*} \end{document}

which has the property that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${\cal G}_{k_1 , k_2 , \ldots , k_h}$$ \end{document} , the set of graphs that are identical upto a depth h and having signature \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$k_1 , k_2 , \ldots , k_h ,$$ \end{document} , is mapped to a subinterval of [0, 1] of length (due to the implications of Wright Fisher model): \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \hbox{\rm len} \left(k_1 , k_2 , \ldots , k_h \right) = \left(\sum_{(i_1 , i_2 , \ldots , i_h) \in [1 \ldots L] ^h} Q_h (K , i_1 , . , i_h) \right) ^{- 1} Q_h (K , k_1 , . , k_h ) , \tag{10} \end{align*} \end{document}

using Eqns 2 and 4. Then the following holds.

Lemma 10

For each h > 0, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \sum_{{\cal G}_{(k_1 , k_2 , \ldots , k_h)} \subset {\cal G}_h} \hbox{\rm len} \left(k_1 , k_2 , \ldots , k_h \right) = 1. \end{align*} \end{document}

Then the uniform probability measure μ on \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${\cal G}$$ \end{document} is nothing but the pull-back by ϕ of the usual uniform probability measure (i.e., Lebesgue measure) on [0, 1]. Recall the restriction mapping \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \pi_h: {\cal G} \rightarrow {\cal G}_h , \end{align*} \end{document}

for any \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$G \in {\cal G}_h$$ \end{document} , by the definition of μ( · ) from above, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \mu \left(\pi_h^{- 1} (G) \right) = \hbox{\rm len} (\hbox{\rm sig} (G)) . \tag{11} \end{align*} \end{document}

For some (event) \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${\cal H} \subset {\cal G}$$ \end{document} and for each fixed h > 0, let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} P_ {\cal H} (h) = \frac {\mid \pi_h ({\cal H}) \mid} {\mid \pi_h ({\cal G}) \mid} . \end{align*} \end{document}

Suppose that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \lim_{h \rightarrow \infty} P_{\cal H} (h) = p. \end{align*} \end{document}

Then,

Lemma 11

Proof. For each h > 0, let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${\cal H}_h \subset {\cal G}$$ \end{document} denote the measurable set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$\pi^{- 1}_h (\pi_h ({\cal H})) .$$ \end{document} Then

Using the bijection ϕ, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \mu ({\cal H}_h) & = \sum_{G \in \pi_h ({\cal H})} \mu \left(\pi_h^{- 1} (G) \right) \\ & = \sum_{G \in \pi_h ({\cal H})} \hbox{\rm len} (\hbox{\rm sig} (G)) . \quad \hbox{(\rm by Eqn 11)} \end{align*} \end{document}

Thus \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \mu ({\cal H} _h) = \frac {\mid \pi_h \left({\cal H} \right) \mid} {\mid \pi_h \left({\cal G} \right) \mid} = P_ {\cal H} (h) . \end{align*} \end{document}

Since each \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${\cal H}_h$$ \end{document} is measurable it follows from the above that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$\mu ({\cal H}) = p$$ \end{document} .

5. Reconstructability of Pedigree History

Usually simplifying assumptions about genetic events (say mutations, recombinations) such as infinite-sites model (Kimura, 1969) or infinite-alleles model (Kimura and Crow, 1964) are made, either from a reconstructability or a modeling perspective. The former model implies that any two events at different points in time (generations) do not occur at the exact same location on the sequence. In a similar spirit, the latter model permits multiple mutations at a location but each produces a unique allele. Thus the models avoid recurrent events at the same location which can be distracting for a framework that is grappling with the inference of reconstructing the history—the identification and time ordering (over generations) of all the genetic events in all the sequences. However, recurrence of genetic events is observed in nature (called back mutations, parallel mutations, recombination hotspots and so on). Nevertheless, these models are reasonable and are widely accepted. Also, note that the HUD (Wiuf and Hein, 1999a) is a more restrictive ARG model, with a view to reconstructability. These restrictions can be translated to topological properties, such as galled trees and related graph theoretic ideas (Gusfield et al., 2007).

However, we take a much less conservative view: here we simply use the size of the pedigree history structure as a measure of reconstructability. A large structure is hard to reconstruct whereas a small structure is more amenable to faithful reconstruction (also a network is harder to reconstruct than a tree). Let the LCAA occur at depth g for an instance of the graph. The size is parameterized by the depth of the LCAA, since the vertices at a deeper depth (or at row > g) are not informative either for ancestry or reconstruction purposes. For this end, we compute the expected depth of the LCAA in both the subgraph models.

5.1. On LCAA probabilities

By Lemmas 5 and 7 if the number of LCAAs < ∞ in an instance of the mixed or monochromatic subgraph model, then it is a singleton vertex in some row of the instance. Consider Eqn. 9: F( · , · ) satisfies the following for 1 ≤ k ≤ L (see Eqn. 2 for L) and g > 1. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} & \qquad F (1 , k) = Q (K , k) , \\ &F (g + 1 , k) \left\{ \begin{matrix} = \quad \sum_{i = k}^{L} F (g , i) Q (i , k) , & \hbox{for monochromatic model} \ G_{\rm PT} \ (K, N), \hfill \\ & (\hbox{follows from definition}) \hfill \\ \leq \quad \sum_{i = 1}^{L} F (g , i) Q (i , k) , & \quad \hbox{for mixed model}\ G_{\rm PGE} \ (K , N , M). \hfill \\ & \quad(\hbox{see Eqn 17 in appendix})\hfill \end{matrix} \right.\tag{12} \end{align*} \end{document}

5.1.1. Common Ancestor with Ancestry (CAA)

Lemma 12

In a relevant pedigree graph (for both models),

1. if there exists a CAA at depth g, then the same is true for all g′ ≥ g, and,

2. if there exists no CAA at depth g, then the same is true for all g′ ≤ g.

Further, using Lemmas 5 and 7, the probability of a common ancestor with ancestry at depth h of a graph \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$G \in {\cal G}_h$$ \end{document} is given by: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \frac {F (h , 1)} {F (h , 1) + F (h , 2) + \ldots + F (h , L)} = T (h) . \tag {13} \end{align*} \end{document}

Monochromatic Subgraph Model. To get a workable form of F(g), it is convenient to expand Eqn. 12 for monochromatic models as: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} F (g + 1 , 1) & = F (g , K) Q (K , 1) + \ldots + F (g , 3 ) Q (3 , 1) + F (g , 2) Q (2 , 1) + F (g , 1) Q (1 , 1) , \\ F (g + 1 , 2) & = F (g , K) Q (K , 2) + \ldots + F (g , 3) Q (3 , 2) + F (g , 2) Q (2 , 2) , \\ F (g + 1 , 3) & = F (g , K) Q (K , 3) + \ldots + F (g , 3) Q (3 , 3) , \\ & \vdots \qquad \vdots \\ F (g + 1 , K) & = F (g , K) Q (K , K) . \end{align*} \end{document}

Then, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} T (g) = \frac {F (g , K) Q (K , 1) + \ldots + F (g , 2) Q (2 , 1) + F (g , 1) Q (1 , 1)} {F (g , K) \sum_ {i = 1} ^K Q (K , i) + \ldots + F (g , 2) \sum_ {i = 1} ^2 Q (2 , i) + F (g , 1) Q (1 , 1)} . \end{align*} \end{document}

Mixed Subgraph Model. Expand Eqn. 12 for mixed subgraph models as: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} F (g + 1 , 1) & \leq F (g , L) Q (L , 1) + \ldots + F (g , 3) Q (3 , 1) + F (g , 2) Q (2 , 1) + F (g , 1) Q (1 , 1 ) , \\ F (g + 1 , 2) & \leq F (g , L) Q (L , 2) + \ldots + F (g , 3) Q (3 , 2) + F (g , 2) Q (2 , 2) + F (g , 1) Q ( 1 , 2) , \\ F (g + 1 , 3) & \leq F (g , L) Q (L , 3) + \ldots + F (g , 3) Q (3 , 3) + F (g , 2) Q (2 , 3) + F (g , 1) Q (1 , 3) , \\ & \vdots \qquad \vdots \\ F (g + 1 , L) & \leq F (g , L) Q (L , L) + \ldots + F (g , 3) Q (3 , L) + F (g , 2) Q (2 , L) + F (g , 1) Q (1 , L) . \end{align*} \end{document}

Back to Computations. Then for both models (see Eqn. 18 in Appendix), \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} T (g) \geq \frac {F (g , L) + \ldots + F (g , 2) + F (g , 1)} {{\cal O} (N^L) (F (g , L) + \ldots + F (g , 2) + F (g , 1))} = \Omega (N^ {- L}) . \end{align*} \end{document}

5.1.2. Least Common Ancestor with Ancestry (LCAA)

Next we explore the event of the the common ancestor at depth h being the most recent as well, i.e., additionally there is no common ancestor below the depth of h. Recall that the LCAA for the monochromatic subgraph is just the LCA. We handle both the cases together. Let P(h) denote the probability of the LCAA at depth h for a graph \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$G \in {\cal G}_h$$ \end{document} . In other words, CAA occurs at depth h and has not occurred at a depth < h.

By Lemma 12, the event of occurrence of the LCA at depth h is not independent of the occurrence at h′ ≠ h. This dependence is captured in the following. Let F₁(h) be the number of instances in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${\cal G}_h$$ \end{document} with CAA at depth h and no CAA at all depths < h. Let F₂(h) be the number of instances in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${\cal G}_h$$ \end{document} with no CAA at all depths ≤ h. Then using L from Eqn 2 and using Lemma 9, we have \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} F_1 (1) \geq 1 , \ \ \ \ & F_2 (1) \leq {\cal O} (N^L - 1) , \ \ \ \ \\ F_1 (2) \geq {\Omega} (N^L - 1) , \ \ \ \ & F_2 (2) \leq {\cal O} ((N^L - 1) ^2) , \ \ \ \ \\ F_1 (3) \geq {\Omega} ((N^L - 1) ^2) , \ \ \ \ & F_2 (3) \leq {\cal O} ((N^L - 1) ^3) , \ \ \ \ \\ F_1 (4) \geq {\Omega} ( (N^L - 1) ^3) , \ \ \ \ & F_2 (4) \leq {\cal O} ((N^L - 1 ) ^4) , \ \ \ \ \\ \vdots \ \ \ \ & \vdots \end{align*} \end{document}

Also note that, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \mid {\cal G}_h \mid \leq {\cal O} (N^{hL}) . \end{align*} \end{document}

Then for all h ≥ 1, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} {\bf P} (h) \geq \frac {\Omega ((N^L - 1) ^ {h - 1})} {{\cal O} (N^ {hL})} . \tag {14} \end{align*} \end{document}

5.2. Expected depth \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${\mathbb E} \left[D \right]$$ \end{document} of LCAA

Let D be a random variable that denotes the depth (row or generation) of an LCA. Then \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} {\mathbb E} \left[D \right] = \sum_{h = 1}^{\infty} h {\bf P} (h) . \end{align*} \end{document}

Using Eqn. 14, the sum on the right is: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \frac {1} {N^L} + 2 Y \frac {1} {N^L} + 3 Y^2 \frac {1 } {N^L} + 4 Y^3 \frac {1} {N^L} + \ldots \end{align*} \end{document}

Using the identity, for |x| < 1, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} 1 + 2x + 3x^2 + 4x^3 + \ldots = \frac {1} {(1 - x) ^2} , \end{align*} \end{document}

for both models we have, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} {\mathbb E} \left[D \right] \geq {\Omega} (N^L) . \end{align*} \end{document}

Back to discussion on reconstructability. Using Lemma 8, for a pedigree graph \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} {\mathbb E} \left[D \right] \geq {\Omega} (N^N) . \end{align*} \end{document}

Assuming that the tree (monochromatic subgraph), of expected depth ≥ Ω(N^K), is reconstructible, we conclude that a mixed subgraph can be reconstructed when M is small. In other words, for a given set of extant sequences, the joint history of local segments along a chromosome is reconstructible.

6. Sampling The Space of Pedigree Subgraphs

Here we address the problem of random generation of the combinatorial structure of a pedigree subgraph (monochromatic or mixed). Here we weave the combinatorial arguments, presented earlier in this article, together into a random-instance construction algorithm. While approximate distribution functions can be used for computational efficiency, it is important to note that the only parameters that define the random instance of the mixed subgraph are K, N, and M, and the defining parameters for a monochromatic subgraph are K and N.

6.1. Sampling algorithm

We use two parameters g_max, the maximum number of generations and K_min, the minimum number of units per generation as a stopping criterion, for a loop that is theoretically infinite (since an instance may not have any LCAAs by Corollary 3). The random instance is constructed by the configurations that are defined at each step. For an example of interpreting a configuration encoding, see Figure 8.

FIG. 8.

Interpreting the encodings: (1–3) The number of vertices at row g is 3 labeled as x, y, and z. The vertices at row g + 1 are labeled implicitly by the position (first, second or third).

In the pseudocode, we use the following conventions: (1) Assignment: “Y ← y” is to be interpreted as variable Y being assigned the value y. (2) Autoincrement: “++ g” is to be interpreted as the variable g being first incremented by 1 before using it. (3) Loop: “REPEAT .code. UNTIL (condition)” is to be interpreted as repeating the code until the condition is satisfied. The rest should be clear from context.

The functions \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${\cal Q}_a () , {\cal Q}_b () , {\cal Q}_c () , {\cal Q}_d ()$$ \end{document} , and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${\cal Q}_{cdcd} ()$$ \end{document} used in the algorithm are described in Section 6.2. Note that the output is a graph, and the algorithm encodes them in terms of configurations between adjacent rows (Fig. 8). In practice, usually simplifying constraints are imposed such as no vertices with more than two outgoing edges (descendant edges) or two incoming edges (ascendant) and so on. It is also possible to use approximating distributions for efficiency purposes.

6.2. Algorithm property

Proposition 1. Algorithm 1 (resp. 2) picks a mononchromatic subgraph (resp. mixed subgraph) instance with uniform probability (according to the probability distribution defined in Section 4.2) as g_max → ∞ .

Proof Sketch: This follows from the descriptions of

and Q_() used in the algorithm.

is a set of configurations and Q_() is a number or “weight” associated with the set

that leads to uniform sampling of the space of the instances. The sets and the associated weights are defined below:

Algorithm 1.

(Monochromatic subgraph model G_PT(K,N)

1. Initializations:

g ← 0; k_g ← K (extant vertices).

2. REPEAT

(a) Pick a random value of k_g₊₁ as follows. Set k_g₊₁ ← k′ with probability

(b) Pick a random value of q_c as follows. Set q_c ← q′ with probability:

3. UNTIL (k_g₊₁ ≤ K_min) OR (⁺⁺g > g_max).

Algorithm 2.

Mixed subgraph model G_PGE(K,N,M)

1. Initializations:

(i) g ← 0; k_g ← K (extant vertices).

(ii) Annotate genetic material

with each of the K vertices.

2. REPEAT

(a) Pick a random value of k_g₊₁ as follows. Set k_g₊₁ ← k′ with probability:

(b) Pick a random value of q_a as follows. Set q_a ← q′ with probability:

(d) Pick a random value of gm_g₊₁ as follows. Set gm_g₊₁ ← gm′ with probability:

3. UNTIL (k_g₊₁ ≤ K_min) OR (⁺⁺g > g_max).

\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} {\cal Q}_a (gm_g , k_g , k_{g + 1}) & = \left\{q \ \ \left| \begin{matrix} q = (X_1 , k_{1 , g + 1} , X_2 , k_{2 , g + 1}) , {\rm where} \hfill \\ (1) \ X_1 = C \cup S_{3 , 1} , X_2 = C \cup S_{3 , 2} , \hfill \\ \ \ \ (S_1 = \left\{i \mid \mid gm_i \mid = 1 \right\} , S_2 = \left\{i \mid \mid gm_i \mid > 1 \right\} , \hfill \\ \ \ \ \ \ C \subset S_2\ {\rm with} \mid S_2 \mid \geq \mid C \mid \geq k_{g + 1} - k_g ,\ {\rm and} \hfill \\ \qquad S_{3 , 1} , S_{3 , 2}\ \hbox{\hbox {are partitions of}} ( S_2 \setminus C) \cup S_1 , i.e., \hfill \\ \ \ \ \ \ \ S_{3 , 1} \coprod S_{3 , 2} = (S_2 \setminus C) \cup S_1) \hfill \\ (2) \ k_{1 , g + 1} + k_{2 , g + 1} = k_{g + 1} , {\rm and} \hfill \\ \qquad k_{1 , g + 1} \leq \mid X_1 \mid ; k_{2 , g + 1} \leq \mid X_2 \mid \hfill \end{matrix} \right. \right\} , \\ {\cal Q}_c (X_g , k_{g + 1}) & = \left\{q \ \left| \begin{matrix} q = i_1 \leq i_2 \leq \ldots \leq i_{k_{g + 1}} , \hfill \\ \hbox{\rm with} \ i_1 + i_2 + \ldots + i_{k_{g + 1}} = \mid X_g \mid \hfill \\( {\rm note}\ \mid X_g\mid\ \geq k_{g + 1} ) \hfill\end{matrix} \right. \right\} , \hfill \\ {\cal Q}_d (q_c \in {\cal Q}_c) & = \left\{q \ \left| \begin{matrix} q = \left\{s_1 ,\ s_2 , \ldots , s_{k_{g + 1}} \right\} , {\rm with} \hfill \\ i_k = \mid s_k \mid ,\ s_k \subset \{1 , . , k_g \} , \rm for 1 \leq \it k \leq k_{g + \rm1}. \hfill \\ \hbox{\rm where} q_c = i_1 \leq i_2 \leq \ldots \leq i_{k_{g + 1}} \end{matrix} \right. \right\} , \hfill \\ {\cal Q}_{cdcd} (q_a \in {\cal Q}_a) & = \left\{ q \ \ \left| \begin{matrix} q = q_1 \cup q_2 {\rm with} \hfill \\ q_1 \in {\cal Q}_{cd} (X_1 , k_{1 , g + 1}) , q_2 \in {\cal Q}_{cd} (X_2 , k_{2 , g + 1}) , \hfill \\ {\rm where} \ q_a = (X_1 , k_{1 , g + 1} , X_2 , k_{2 , g + 1}) \hfill \\ (X_1\rm \ has\ \it k_{1 , g + \rm 1} \ \hbox{blue parents}, X_2\ {\rm has} \ k_{2, g + 1} \ \hbox{red parents}) \hfill \\ \end{matrix} \right. \right\} , \hfill \\ {\cal Q}_b (gm_g , q \in {\cal Q}_{cdcd}) & = \{ gm \ \ | \hbox{gm is a genetic material annotation of}q \}. \end{align*} \end{document}

Further, the associated numbers of the sets (cardinalities) used in the algorithm are: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} Q_c (X_g , k_{g + 1}) & = \sum_{q_c \in {\cal Q}_c} \mid {\cal Q}_d (q_c) \mid , \\ Q_{cdcd} (q_a \in {\cal Q}_a) & = \sum_{q_{cdcd} \in {\cal Q}_{cdcd}} \mid {\cal Q}_b (q_{cdcd}) \mid , \\ Q_a (gm_g , k_g , k_{g + 1}) & = \sum_{q_a \in {\cal Q}_a} Q_{cdcd} (q_a) . \end{align*} \end{document}

(see Examples 3 –8 for illustrations of the combinatorial sets).

6.3. Constructing local configurations between rows g and g + 1

Given a monochromatic model or a mixed model, Q(k_g,k_g₊₁) is the number of distinct vertex-labeled and gm-annotated configurations between rows g with k_g vertices and g + 1 with k_g₊₁ vertices. The following gives illustrative examples for all the sets and functions involved in their enumeration or computation.

6.3.1. Vertices in row g with 1 parent

Consider two adjacent rows g and g + 1 with k_g vertices in row g. When a vertex always has exactly one parent, the vertices in row g + 1 can be considered to be monochromatic (for computational purposes).

Example 3

(on \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${\cal Q}_c$$ \end{document} ). Let X_k be such that |X_k| = k. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} {\cal Q}_c (X_4 , 1) & = \left\{\begin{matrix}4\end{matrix} \right\} , \\ {\cal Q}_c (X_4 , 2) & = \left\{ \begin{matrix}1 \leq 3 , \\2 \leq 2\end{matrix} \right\} , \\ {\cal Q}_c (X_4 , 3) & = \left\{\begin{matrix}1 \leq 1 \leq 2\end{matrix} \right\} , \\ {\cal Q}_c (X_4 , 4) & = \left\{ \begin{matrix}1 \leq 1 \leq 1 \leq 1\end{matrix} \right\} . \end{align*} \end{document} \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \begin{matrix} {\cal Q}_c (X_6 , 1) = \left\{\begin{matrix}6\end{matrix} \right\} , \hfill \\ {\cal Q}_c (X_6 , 2) = \left\{ \begin{matrix} 1 \leq 5 , \hfill \\2 \leq 4 , \hfill \\3 \leq 3\end{matrix} \right\} , \hfill \\ {\cal Q}_c (X_6 , 3) = \left\{\begin{matrix} 1 \leq 1 \leq 4 , \hfill \\ 1 \leq 2 \leq 3 , \hfill \\2 \leq 2 \leq 2\end{matrix} \right\} , \end{matrix} \begin{matrix} {\cal Q}_c (X_6 , 4) = \left\{\begin{matrix} 1 \leq 1 \leq 1 \leq 3 , \hfill \\ 1 \leq 1 \leq 2 \leq 2\end{matrix} \right\} , \hfill \\ {\cal Q}_c (X_6 , 5) = \left\{\begin{matrix} 1 \leq 1 \leq 1 \leq 1 \leq 2\end{matrix} \right\} , \hfill \\ {\cal Q}_c (X_6 , 6) = \left\{ \begin{matrix} 1 \leq 1 \leq 1 \leq 1 \leq 1 \leq 1\end{matrix} \right\} .\end{matrix} \end{align*} \end{document}

Set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$\bf {\cal Q}_d$$ \end{document} . For partition \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$q_c = i_1 \leq i_2 \leq \ldots \leq i_{k_{g + 1}}$$ \end{document} , let k be the number of distinct values with each repeated \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$l_1 , l_2 , \ldots , l_k$$ \end{document} times. Then the cardinality of set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${\cal Q}_d (\cdot)$$ \end{document} is: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \mid {\cal Q} _d (q_c) \mid = {k_g \choose i_1} {k_g - i_1 \choose i_2} {k_g - (i_1 + i_2) \choose i_3} \cdots {k_g - ( i_1 + \ldots + i_ {{k_ {g + 1}} - 2}) \choose i_ {{k_ {g + 1}} - 1}} \frac {1} {l_1 ! \times l_2 ! \times \ldots \times l_k !} . \tag {15} \end{align*} \end{document}

Example 4

To avoid clutter the set of sets notation is simplified: for example the set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} s = \{\{u \} , \{v \} , \{w , x \} , \{y , z \} \} \end{align*} \end{document}

is written simply as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} s = \hbox{u v wx yz}. \end{align*} \end{document}

Then the 45 elements of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${\cal Q}_d (q_c = 1 \leq 1 \leq 2 \leq 2)$$ \end{document} are as follows: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \begin{matrix} {\cal Q}_d (q_c) = \hfill \\ \left\{\begin{matrix} \begin{matrix} \rm \hbox{u v wx yz,} \\ \rm \hbox{u v wy xz,} \\ \rm \hbox{u v wz xy,}\end{matrix} \begin{matrix} \rm \hbox{u w vx yz,} \\ \rm \hbox{u w vy xz,} \\ \rm \hbox{u w vz xy,}\end{matrix} \begin{matrix} \rm \hbox{u x vw yz,} \\ \rm \hbox{u x vy wz,} \\ \rm \hbox{u x vz wy,}\end{matrix} \begin{matrix} \rm \hbox{u y vw xz,} \\ \rm \hbox{u y vx wz,} \\ \rm \hbox{u y vz wx,}\end{matrix} \begin{matrix} \rm \hbox{u z vw xy,} \\ \rm \hbox{u z vx wy,} \\ \rm \hbox{u z vy wx,}\end{matrix} \begin{matrix} \rm \hbox{v w ux yz,} \\ \rm \hbox{v w uy xz,} \\ \rm \hbox{v w uz xy,}\end{matrix} \begin{matrix} \rm \hbox{v x uw yz,} \\ \rm \hbox{v x uy wz,} \\ \rm \hbox{v x uz wy,} \end{matrix} \begin{matrix} \rm \hbox{v y uw xz,} \\ \rm \hbox{v y ux wz,} \\ \rm \hbox{v y uz wx,}\end{matrix} \\ \begin{matrix} \rm \hbox{v z uw xy,} \\ \rm \hbox{v z ux wy,} \\ \rm \hbox{v z uy wx,}\end{matrix} \begin{matrix} \rm \hbox{w x uv yz,} \\ \rm \hbox{w x uy vz,} \\ \rm \hbox{w x uz vy,}\end{matrix} \begin{matrix} \rm \hbox{w y uv xz,} \\ \rm \hbox{w y ux vz,} \\ \rm \hbox{w y uz vx,}\end{matrix} \begin{matrix} \rm \hbox{w z uv xy,} \\ \rm \hbox{w z ux vy,} \\ \rm \hbox{w z uy vx,}\end{matrix} \begin{matrix} \rm \hbox{x y uv wz,} \\ \rm \hbox{x y uw vz,} \\ \rm \hbox{x y uz vw,}\end{matrix} \begin{matrix} \rm \hbox{x z uv wy,} \\ \rm \hbox{x z uw vy,} \\ \rm \hbox{x z uy vw,}\end{matrix} \begin{matrix} \rm \hbox{y z uv wx,} \\ \rm \hbox{y z uw vx,} \\\rm \hbox{y z ux vw}\end{matrix}\end{matrix} \right\} . \end{matrix} \end{align*} \end{document}

Wt( · ) function. See Eqn. 3.

Example 5

(on wt(k) function). Let k_g = 3 and X = {x,y,z}. Let q_c = 1 ≤ 2, with q_d = ({x},{y,z}), i.e., k_g₊₁ = 2. Let N = 4. Then the number of possible configurations is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} wt (2) = {4 \choose 2} = 6. \end{align*} \end{document}

The weight 6 corresponds to the 6 configurations shown below with the labels of row g + 1 implicit by their position (first, second, third or fourth): \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \left| \begin{matrix} \hbox{x yz - -} \\ \hbox{x - yz -} \\ \hbox{x - - yz}\end{matrix} \right| \left. \begin{matrix} \hbox{- x yz -} \\ \hbox{- x - yz} \\ \hbox{- x - yz}\end{matrix} \right| \end{align*} \end{document}

Note that for q_d = ({y, z}, {x}), the 6 possibilities are: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \left| \begin{matrix} \hbox{yz x - -} \\ \hbox{yz - x -} \\ \hbox{yz - - x}\end{matrix} \right| \left. \begin{matrix} \hbox{- yz x -} \\ \hbox{- yz - x} \\ \hbox{- yz - x}\end{matrix} \right| \end{align*} \end{document}

6.3.2. Vertices in row g with 1 or 2 parents.

Example 6

Thus S₁ = {y}, S₂ = {x,z}. Cardinality of C must satisfy the following: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \left(| S_2 | = 2 \right) \geq | C | \geq (k_{g + 1} - k_g = 1) . \end{align*} \end{document}

Thus the three possible C's are: C₁ = {x},C₂ = {z},C₃ = {x,z}.

Case C₁ = {x}: Let S₃ = (S₂ \ C) ∪ S₁ = {y,z}. The possibilities are: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} S_{3 , 1} = \emptyset , S_{3 , 2} = \{y , z \} , & \quad X_1 = \{x \} , X_2 = \{x , y , z \} , \quad k_{1 , g + 1} = 1 , k_{2 , g + 1} = 3. \\ S_{3 , 1} = \{y , z \} , S_{3 , 2} = \emptyset , & \quad X_1 = \{x , y , z \} , X_2 = \{x \} , \quad k_{1 , g + 1} = 3 , k_{2 , g + 1} = 1. \\ S_{3 , 1} = \{y \} , S_{3 , 2} = \{z \} , & \quad X_1 = \{x , y \} , X_2 = \{x , z \} , \quad k_{1 , g + 1} = 2 , k_{2 , g + 1} = 2. \\ S_{3 , 1} = \{z \} , S_{3 , 2} = \{y \} & \quad X_1 = \{x , z \} , X_2 = \{x , y \} , \quad k_{1 , g + 1} = 2 , k_{2 , g + 1} = 2.] \end{align*} \end{document}

Case C₂ = {z}: Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$S_3 = (S_2 \setminus C) \cup S_1 = \{x , y \}$$ \end{document} . The possibilities are: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} S_{3 , 1} = \emptyset , S_{3 , 2} = \{y , x \} , & \quad X_1 = \{x \} , X_2 = \{x , y , z \} , \quad k_{1 , g + 1} = 1 , k_{2 , g + 1} = 3. \\ S_{3 , 1} = \{y , x \} , S_{3 , 2} = \emptyset , & \quad X_1 = \{x , y , z \} , X_2 = \{x \} , \quad k_{1 , g + 1} = 3 , k_{2 , g + 1} = 1. \\ S_{3 , 1} = \{y \} , S_{3 , 2} = \{x \} , & \quad X_1 = \{y , z \} , X_2 = \{x , z \} , \quad k_{1 , g + 1} = 2 , k_{2 , g + 1} = 2. \\ S_{3 , 1} = \{x \} , S_{3 , 2} = \{y \} & \quad X_1 = \{x , z \} , X_2 = \{y , z \} , \quad k_{1 , g + 1} = 2 , k_{2 , g + 1} = 2. \end{align*} \end{document}

Case C₃ = {x, z}: Let S₃ = (S₂ \ C) ∪ S₁ = {y}. The possibilities are: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \begin{matrix}S_{3 , 1} = \emptyset , S_{3 , 2} = \{y \} , \quad X_1 = \{x , z \} , X_2 = \{x , y , z \} , \quad & k_{1 , g + 1} = 1 , k_{2 , g + 1} = 3. \\ \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \ & k_{1 , g + 1} = 2 , k_{2 , g + 1} = 2. \\ S_{3 , 1} = \{y \} , S_{3 , 2} = \emptyset , \quad X_1 = \{x , y , z \} , X_2 = \{x , z \} , \quad& k_{1 , g + 1} = 3 , k_{2 , g + 1} = 1. \\\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad & k_{1 , g + 1} = 2 , k_{2 , g + 1} = 2.\end{matrix} \end{align*} \end{document}

Thus for the gm defined in this example, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$| {\cal Q}_e (gm , 3 , 4) | = 12$$ \end{document} .

Example 7

Thus, of the three vertices labeled x, y, z in some row g, only vertex y has one parent and the other two have two parents each in row g + 1. In the following we use the simplified notation of the configuration of Example 4. Then \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$q \in {\cal Q}_{cdcd} (q_a)$$ \end{document} with \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} q = q_1 \cup q_2 = ({x \ z \ x \ yz}) , \end{align*} \end{document}

where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} q_1 & = (\hbox{x z}) \ \in {\cal Q}_{cd} (X_1 , 2) , \\ q_2 & = (\hbox{x yz}) \in {\cal Q}_{cd} (X_2 , 2) . \end{align*} \end{document}

Set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$\bf {\cal Q}_b$$ \end{document} . Recall that each configuration of the last step (with X₁ and X₂) implicitly assigns the number of ascendant edges on each vertex u in row g. If vertex v at row g has only one ascendant (incoming edge) e then gm(e) is assigned gm_g,v. If vertex u at row g has two ascendants e₁ and e₂ then the segments of gm_g,v are split into two nonoverlapping subsets gm(e₁) and gm(e₂) with gm_g,v = gm(e₁) ∪ gm(e₂). Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${\cal Q}_b$$ \end{document} denote all the distinct possibilities (as shown in the example below). \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} {\cal Q}_b ({q \in {\cal Q}}_{cdcd}) = \{ gm \, | \, gm \ is \ the \ genetic \ material \ annotations \ for \ q \}. \tag{16} \end{align*} \end{document}

Example 8

The different possibilities of genetic material assignment to the vertices with two parents is shown here.

(1)–(4) model recombination event and (5)–(6) model gene conversion event at node labeled x. Then the (unique) gm_g₊₁ annotation for case (1) above shown in (c) below with labels of vertices in row g + 1 as {t,u,v} (for convenience, incident edges on blue parents is shown in blue in (a) and incident edges on red parents is shown in red in (b)):

Note that handling blue and red parents separately ensures that there are no forbidden structures (Fig. 2) between rows g and g + 1.

7. Discussion

In conclusion, we have presented a unified random graphs framework to study pedigree history with focus on unilinear transmission trees and the biparental transmission ARGs, the two interesting mathematical objects in this context. In the unified framework, the two are modeled as monochromatic and mixed subgraphs respectively of the pedigree graph. In the former, each vertex has no more than one parent with all vertices of the same color (gender). In the latter, each vertex has one or two parents, dictated by genetic exchange and subsequent flow of the genetic material through the nodes to the extant units. One of the interesting consequences of this approach is a pure topological definition of the GMRCA (called LCAA to be analogous with the graph-theoretic LCA). This is the first time that an ARG as well as the GMRCA are given a graph-theoretic description with an alternative parametrization of the ARG. The article also identifies a natural measure space, which then helps estimate the expected depth of an LCAA in a pedigree graph/ARG/unilinear transmission tree.

The sampling algorithm presented in this article is a rather straightforward and direct application of the ideas used in defining the measurable space. It will be quite interesting to adapt the ideas to a coalescent (Hudson, 1990; Kingman, 1982) version of the algorithm that continues to guarantee uniform sampling of the ARG space. Yet another direction is the incorporation of selection, random environment variables, or other such (biasing) dynamics into the ideal population. The study of the effect of these on the topology, including LCAA depth, and scattering patterns of the genetic material over the vertices in the subgraph models, as well as design of sampling algorithms under these conditions, are interesting directions of exploration.

8. Appendix: A Mixed Subgraph Model

See Section 4 for the definitions used here. Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${\cal S}_{k_0 , k_1.k_h} \subseteq {\cal Q}_h (k_0 , k_1 , . , k_h)$$ \end{document} , then define two functions as: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} Q_{e} (gm , {\cal S}_{k_0 , k_1.k_h}) & = | \{q \in {\cal S}_{k_0 , k_1.k_h} | \ q \ {\rm is \ a \ configuration \ with \ row} \ h \ {\rm annotated \ by} \ gm \}| , \\ Q_{f} (gm , {\cal S}_{k_0 , k_1.k_h}) & = | \{ q \in {\cal S}_{k_0 , k_1.k_h} | \ q \ {\rm is \ a \ configuration \ with \ row \ 0 \ annotated \ by} \ gm \}|. \end{align*} \end{document}

Note that when M = 1 (monochromatic model), for all possible values of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$k_0 , k_1 , . , k_h$$ \end{document} and gm, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} Q_e (gm , {\cal S}_{k_0 , k_1.k_h}) = Q_f (gm , {\cal S}_{k_0 , k_1.k_h}) . \end{align*} \end{document}

For \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$1 \leq k_0 , k_1 , . , k_{h + 1} \leq 2N$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${\cal S}_{k_0 , k_1.k_h}$$ \end{document} as above we denote by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} {\cal S}_{k_0 , k_1.k_h} \odot {\cal Q} (k_h , k_{h + 1}) = \sum_{gm \in {\cal Q}_0 (k_h)} Q_e (gm , {\cal S}_{k_0 , k_1.k_h}) Q_f (gm , {\cal Q}_{k_h , k_{h + 1}}) . \end{align*} \end{document}

Then

Lemma 13

Thus for the mixed model, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} F (g + 1 , k) = \sum_{i = 1}^{L} {\cal F} (g , i) \odot {\cal Q} (i , k) \leq \sum_{i = 1}^{L} F (g , i) Q (i , k) . \tag{17} \end{align*} \end{document}

Also (see Eqn 13), \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} T (g) &= \frac {{\cal F} (g , L) \odot {\cal Q} (L , 1) + \ldots + {\cal F} (g , 1) \odot {\cal Q} (1 , 1)} {\sum_{i = 1}^L {\cal F} (g , L) \odot {\cal Q} (L , i) + \ldots + \sum_{i = 1}^L {\cal F} (g , 1) \odot {\cal Q} (1 , 1)} \\ & \geq \frac {\left( F (g , L) + \ldots + F (g , 1) \right)} {{\cal O} (N^L) (F (g , L) + \ldots + F (g , 1))} . \tag {18} \end{align*} \end{document}

Footnotes

Acknowledgments

This work would not have been possible without the indulgence and support of experts from different fields. I am thankful to Jaume Bertranpetit (and colleagues) for sowing the seeds of skepticism that motivated the study. I am indebted to Jotun Hein, Paul Marjoram, and Carsten Wiuf for useful discussions on the population models and help distill my thoughts on their stochasticity and implications. I am very grateful to David Sankoff and Saugata Basu for enlightening discussions on measure theory. Also, many thanks go to Asif Javed and Ajay Royyuru for reading a preliminary version of the manuscript.

Disclosure Statement

No competing financial interests exist.

References

Bürger

2000. The Mathematical Theory of Selection, Recombination, and Mutation. Wiley: New York.

Cormen

T.H.

, Leiserson

C.E.

, Rivest

R.L.

1990. Introduction to Algorithms. MIT Press: Cambridge, MA.

Davies

J.L.

, Simank

, Lyngs

et al. 2007. On recombination-induced multiple and simultaneous coalescent events. Genetics, 177:2151–2160.

Følner

1955. On groups with full banach mean value. Math. Scand., 3:243–254.

Gabriel

S.B.

, Schaffner

S.F.

, Nguyen

et al. 2002. The structure of haplotype blocks in the human genome. Science, 296:2225–2229.

Griffiths

R.C.

, Marjoram

1997. An ancestral recombinations graph, 257–270. Donnelly

, Tavare

In Progress in Population Genetics and Human Evolution. IMA Volumes in Mathematics and its Applications, 87.

Griffiths

R.C.

1999. The time to the ancestor along sequences with recombination. Theoret. Popul. Biol., 55:137–144.

Gusfield

, Bansal

, Bafna

et al. 2007. A decomposition theory for phylogenetic networks and incompatible characters. J. Comput. Biol., 14:1247–1272.

Hein

, Schierup

M.H.

, Wiuf

2005. Gene Genealogies, Variation and Evolution: A Primer in Coalescent Theory. Oxford Press: New York.

10.

Hudson

R.R.

1983. Properties of a neutral allele model with intragenic recombination. Theoret. Popul. Biol., 23:183–201.

11.

Hudson

R.R.

1990. Gene Genealogies and the Coalescent Process. Oxford Surveys in Evolutionary Biology. Oxford University Press: Oxford, UK.

12.

Jobling

M.A.

, Hurles

, Tyler-Smith

2004. Human Evolutionary Genetics: Origins, Peoples and Disease. Mathematical and Computaional Biology Series. Garland Publishing: New York.

13.

Kimura

, Crow

J.F.

1964. The number of alleles that can be maintained in a finite population. Genetics, 49:725–738.

14.

Kimura

1969. The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics, 61:893–903.

15.

Kingman

J.F.C.

1982. On the genealogy of large populations. J. Appl. Probabil., 19A:2743.

16.

Mele

, Javed

, Calafell

et al. 2009. Recombination-based genomics: a genetic variation analysis in human populations.

17.

Parida

2007. Pattern Discovery in Bioinformatics: Theory and Algorithms. Chapman Hall: New York.

18.

Parida

, Melé

, Calafell

et al. 2008. Estimating the ancestral recombinations graph (ARG) as compatible networks of SNP patterns. J. Comput. Biol., 15:1–22.

19.

Parida

, Javed

, Mele

et al. 2009. Minimizing recombinations in consensus networks for phylogeographic studies. BMC Bioinform., 10:S72.

20.

Schierup

M.H.

, Hein

2000. Consequences of recombination on traditional phylogenetic analysis. Genetics, 156:879–891.

21.

Steel

, Hein

2006. Reconstructing pedigrees: a combinatorial perspective. J. Theoret. Biol., 240:360–367.

22.

Wiuf

, Hein

1999a. Recombination as a point process along sequences. Theoret. Popul. Biol., 55:248–259.

23.

Wiuf

, Hein

1999b. The ancestry of a sample of sequences subject to recombination. Genetics, 151:1217–1228.

Ancestral Recombinations Graph: A Reconstructability Perspective Using Random-Graphs Framework

Abstract

Abstract

1. Introduction

2. Random Graph Framework: Pedigree Graph GPG(K, N)

2.1. Least common ancestor (LCA)

Lemma 1

Lemma 2

Lemma 3

Corollary 1

Corollary 2

3. Pedigree Subgraphs

3.1. Unilinear transmission: monochromatic subgraphs G PT (K,N)

Lemma 4

Lemma 5

Corollary 3

3.2. Genetic exchange model: mixed subgraph G PGE (K,N,M)

Lemma 6

3.2.1 Topological definition of GMRCA: least common ancestor with ancestry (LCAA)

Lemma 7

Lemma 8

4. Probability Space of (Infinite) Pedigree Subgraphs: Uniform Distribution on [0,1]

4.1. Fixed depth subgraphs

Example 1

Example 2

Lemma 9

4.1.1. Equivalence classes

4.2. Measurable space

Lemma 10

Lemma 11

5. Reconstructability of Pedigree History

5.1. On LCAA probabilities

5.1.1. Common Ancestor with Ancestry (CAA)

Lemma 12

5.1.2. Least Common Ancestor with Ancestry (LCAA)

6. Sampling The Space of Pedigree Subgraphs

6.1. Sampling algorithm

6.2. Algorithm property

6.3. Constructing local configurations between rows g and g + 1

6.3.1. Vertices in row g with 1 parent

Example 3

Example 4

Example 5

6.3.2. Vertices in row g with 1 or 2 parents.

Example 6

Example 7

Example 8

7. Discussion

8. Appendix: A Mixed Subgraph Model

Lemma 13

Footnotes

Acknowledgments

Disclosure Statement

References

2. Random Graph Framework: Pedigree Graph G_PG(K, N)

3.1. Unilinear transmission: monochromatic subgraphs G_PT(K,N)

3.2. Genetic exchange model: mixed subgraph G_PGE(K,N,M)