On the Number of Saturated and Optimal Extended 2-Regular Simple Stacks in the Nussinov

Abstract

It is known that both RNA secondary structure and protein contact map can be presented using combinatorial diagrams, the combinatorial enumeration and related problems of which have been studied extensively. Motivated by previous enumeration works on saturated RNA secondary structures and extended stack structures of protein contact maps, we are interested in the enumeration problems of saturated and optimal extended stacks in the Nussinov–Jacobson energy model, in which each base pair contributes energy −1. Then optimal structures are those with most arcs, and locally optimal structures are exactly the saturated structures, in which no more arcs can be added without violating the structure definition. For saturated extended 2-regular simple stacks, whose degree configuration is related to the protein fold in two-dimensional honeycomb lattice, we obtain generating function equation and asymptotic formula for its number. Moreover, an explicit formula for the number of optimal extended 2-regular simple stacks is also obtained.

1. INTRODUCTION

It is well known that the function of RNA depends on its 3D structure and dynamics, and the RNA 3D structure is largely determined by the secondary structure (Banerjee et al., 1993). The case of protein is analogous, and the contact map of protein can be seen as a counterpart of the secondary structure of RNA. Contact plays a fundamental role in the well-known hydrophobic-hydrophilic (HP) model for protein folding (Dill, 1990), and recently, contact map also plays an extremely important role in the high accuracy prediction of protein structure through the deep learning method (Xu, 2019; Yang et al., 2020).

Due to the pioneer work of Waterman and Schmitt (1978), combinatorial theory and method were introduced into the field of computational molecular biology, and the combinatorial enumeration of various RNA secondary structures and protein contact maps have attracted extensive study since then. Schmitt and Waterman (1994) provided an explicit formula for the number of RNA secondary structures on n vertices and k arcs by establishing a bijection between RNA secondary structures and linear trees. Nebel (2002) derived the generating function of m-regular RNA secondary structures by using the binary trees and the Horton–Strahler number. Clote (2006) obtained recurrence relations and asymptotic formulas for combinatorial problems related to the number of saturated secondary structures.

Jin et al. (2008) derived the recursion formulas of 3-noncrossing RNA structures. A collection of new combinatorial and computational approaches in the study of RNA structures with pseudoknots were presented in a monograph (Reidys, 2011).

Recently, enumerative combinatorics has also made progress in the study of protein contact map. Based on the classic HP model (Dill, 1990), protein fold can be considered as a self-avoiding walk in some lattice model. Lattice models usually retain important features of protein structures, and enable us to focus on dominant aspects of the structure. Goldman et al. (1999) showed that for any protein fold in a two-dimensional (2D) square lattice self-avoiding walk HP model, the contact map can be decomposed into (at most) two stacks and one queue. Actually, this decomposition also holds for the case of 2D honeycomb lattice, which will be shown in Section 2. Agarwal et al. (2007) found a similar decomposition result for the contact maps of protein folds in the 3D cubic lattice self-avoiding walk HP model.

In combinatorial terms, a stack is a noncrossing diagram and a queue is a non-nesting diagram; they are the two elementary structures of protein contact map. When folding a protein in the lattice model, different lattice models will introduce different degree and arc length constraints to the corresponding contact map. For instance, on the 2D square lattice, each internal vertex in the contact map has maximum degree 2, while the two terminal vertices can have maximum degree 3, and the arc length is at least 3. On the 2D honeycomb lattice, the degree of each internal vertex and terminal vertex is at most 1 and 2, respectively, while the length of each arc is at least 5. For a survey of various lattice models used for protein folding see Pierri et al. (2008). A stack with arc length at least m is called m-regular, and a stack with degree of each vertex bounded by one is called simple. Clearly, RNA secondary structures can be viewed as 2-regular simple stacks.

Istrail and Lam (2009) proposed the question concerning generalizations of the Schmitt–Waterman counting formulas for RNA secondary structures (Schmitt and Waterman, 1994) to enumerating protein stacks and queues, and they pointed out that the enumeration of stacks and of queues could provide insights into computing rigorous approximations of the partition function of protein folding in HP models.

To attack this question, Chen et al. (2014) proposed the primary decomposition method to study the combinatorial enumeration of m-regular (with arc length at least m) linear stack (with the degree of each vertex bounded by two), which is a generalization of the classic RNA secondary structure, and got enumeration results in the form of generating function equation, recurrence relation, and asymptotic formula. Furthermore, combinatorial enumeration results were obtained for extended (with the degree of two terminal vertices bounded by three) m-regular linear stacks (Guo et al., 2016), m-regular linear stacks with n vertices and k arcs (Guo and Sun, 2018), and 2-regular and 3-regular simple (with the degree of each vertex bounded by one) queues (Guo et al., 2017).

The RNA and protein folding problems can be formulated as energy minimization problems in some energy model, for example, in the classic Nussinov–Jacobson free energy model (Nussinov and Jacobson, 1980), the energy function is the negative of the number of base pairs (for RNA) or contacts (for protein), and the structures with minimum energy are called optimal. Another concept closely related to the optimal structure is saturated structure, which means no base pairs (contacts) can be added without violating the definition of the structure (Clote, 2006). Saturated structures are actually local minima in the energy landscape. The combinatorial problems related to the number of saturated RNA secondary structures have attracted much interest (e.g., Clote, 2006; Clote et al., 2009, 2007; Fusy and Clote, 2014).

Stimulated by the enumeration works on saturated and optimal RNA secondary structures and extended stacks of protein contact map, we are interested in the problem of enumerating saturated and optimal extended 2-regular simple stacks, in which the degree of two terminal vertices bounded by two, and the degree of the internal vertices bounded by one. This degree configuration is related to the protein fold in the 2D honeycomb lattice. The honeycomb lattice alleviates the “sharp turn” problem and models certain aspects of the protein secondary structure more realistically with reduced combinatorial complexity (Jiang and Zhu, 2005; Guo et al., 2018). Denote the number and the generating function of all saturated extended 2-regular simple stacks with n vertices by $s (n)$ and $S (x)$ , respectively. Based on the techniques of Clote (2006) and the idea of the structure decomposition as given in Chen et al. (2014), we obtain the following equation satisfied by $S (x)$ .

Theorem 1. We have $p_{3} (x) S^{3} (x) + p_{2} (x) S^{2} (x) + p_{1} (x) S (x) + p_{0} (x) = 0 .$ (1)

where

Based on Equation (1), we furthermore obtain the following asymptotic formulas for $s (n)$ through the singularity analysis method. $s (n) \sim 2.818464011 \times 2.35467360 5^{n} \cdot n^{- \frac{3}{2}} .$

The explicit expression for $s o (n)$ , the number of optimal extended 2-regular simple stacks with n vertices, is also obtained.

Theorem 2. We have for $n \geq 5$ , $s o (n) = \{\begin{matrix} \frac{1}{12} (n^{3} - 3 n^{2} - 7 n + 69), & i f n i s o d d, \\ n - 3, & i f n i s e v e n . \end{matrix}$ (2)

This article is organized as follows. In Section 2, we give basic definitions and notations, and some previous results that are the foundation of this article. In Section 3, we study the combinatorial enumeration problem of the saturated extended 2-regular simple stacks. At last, Section 4 focuses on the number of optimal extended 2-regular simple stacks.

2. BASIC DEFINITIONS AND NOTATIONS

First, we recall the definition of RNA secondary structure and protein contact map in a combinatorial way. A secondary structure on RNA sequence $r = r_{1} r_{2} \dots r_{n}$ is a set of ordered base pairs $(i, j)$ with the following three properties: (1) Each vertex can be bonded to at most one other vertex; (2) If r_i and r_j are bonded, then $j - i \geq m$ , where m means the minimum length of two bonded vertices; (3) If r_i and r_j are bonded, then any bonding of $r_{k} (i < k < j)$ must be with a vertex between r_i and r_j.

For the case of protein, when two nonconsecutive amino acid residues in a protein fold come very close to each other, say, closer than a predetermined threshold, they presumably form some kind of bond, which is called a contact. A protein contact map on protein sequence $p = p_{1} p_{2} \dots p_{n}$ is a set of ordered contact pairs $(i, j)$ , where p_i and p_j form a contact in the 3D structure of p.

In combinatorics, both RNA secondary structure and protein contact map can be presented by a diagram, that is, drawing vertices $1, 2, \dots, n$ on a horizontal line in an increasing order and connecting two vertices by an arc if they are bonded (for RNA) or in contact (for protein). In the diagram presentation, for any two arcs $(i, j)$ and $(k, l)$ , if $i < k < l < j$ , we say that they form a nesting; if $i < k < j < l$ , we say that they form a crossing. A noncrossing diagram is called a stack, and a non-nesting diagram is called a queue.

Stack and queue are the two elementary structures of protein contact map. Following Chen et al. (2014) and Guo et al. (2017), a structure (stack or queue) with arc length at least m is called m-regular, a structure with degree of each vertex bounded by one is called simple, and a structure with degree of each vertex bounded by two is called linear. Furthermore, an m-regular simple stack with the degrees of the two terminal vertices bounded by two is called an extended m-regular simple stack. Note that this degree of configuration is related to the protein contact map in the 2D honeycomb lattice.

Recall that the contact map of any protein fold in 2D square lattice can be decomposed into (at most) two stacks and one queue (Goldman et al., 1999). We claim that the same decomposition also holds in the 2D honeycomb lattice.

Theorem 3. Any self-avoiding walk in the 2D honeycomb lattice can be decomposed into (at most) 2 stacks and 1 queue.

Proof. For each vertex in the walk, we assign a label O or U to its adjacent edges in the lattice, which are not edges in the walk. Edges in the lattice will then have multisets of labels consisting of 0, 1, or 2 members. Labels are assigned inductively as follows:

Label nonwalk edges adjacent to vertex 1 as follows: assign O to one of the edges and assign U to the other.

Label nonwalk edges adjacent to vertex i, where $2 \leq i \leq n$ , as follows: if the walk edge adjacent to vertex i lies in the same honeycomb lattice as an edge labeled by vertex $i - 1$ with label L, assign label L to it, otherwise, assign label ${O, U} ∖ {L}$ to it.

After the labeling procedure, the edges in the contact map will be assigned exactly two labels. Following the similar arguments of Goldman et al. (1999), the graph consisting of edges labeled by ${O, O}$ (or ${U, U}$ ) is a stack, and the graph consisting of edges labeled by ${O, U}$ is a queue.

See Figure 1 for an illustration.

FIG. 1.

The contact map of a protein fold in two-dimensional honeycomb lattice. The graph consisting of arcs (solid) labeled by ${O, O}$ (or ${U, U}$ ) is a stack, and the graph consisting of arcs (dotted) labeled by ${O, U}$ is a queue.

Next, let us recall some previous results by Clote (2006) on the enumeration of saturated and optimal RNA secondary structures. Call a vertex $i \in [n]$ visible if it is not covered by any arc. Let $a (n)$ denote the number of all saturated RNA secondary structures (2-regular simple stacks) of length n, and $b (n)$ denote the number of such stacks with no visible vertex. For $n < 0$ , we define $a (n) = b (n) = 0$ . For $n \geq 0$ , $a (n)$ and $b (n)$ satisfy the following recurrence relations.

Lemma 1. (Clote, 2006, Propositions 3 and 4)

Denote the generating functions of $a (n)$ , $b (n)$ by

respectively, they satisfy the following relations.

Lemma 2. (Clote, 2006, Propositions 5 and 6) $x^{2} y z = z - 1 + x^{2} z,$ (4) $x^{2} y^{2} = y (x^{2} + 1) - z (x^{2} + x) - 1 .$ (5)

Moreover, the number of optimal 2-regular simple stacks of length n is given by the following formula.

Lemma 3. (Clote, 2006, Corollary 13) $L O_{0} (n) = \{\begin{matrix} 1, & n = 2 m - 1, \\ \frac{m (m + 1)}{2}, & n = 2 m, \end{matrix}$ (6)

where $m \geq 1$ and $L O_{0} (0) = 1$ .

3. SATURATED EXTENDED 2-REGULAR SIMPLE STACKS

In this section, we use the method of combinatorial structure decomposition method to study the generating function for the number of saturated extended 2-regular simple stacks.

Following the structure decomposition idea proposed by Chen et al. (2014), we call the component containing both vertices 1 and n the primary component. For the primary components of the saturated extended 2-regular simple stacks with n vertices, we distinguish six classes according to the degrees of the two terminal vertices and whether they form an arc or not (Fig. 2).

FIG. 2.

The six cases of the primary components of the saturated extended 2-regular simple stacks.

Let $⟨ i, j ⟩$ denote the interval between i and j, that is, . We use this notation $⟨ i, j ⟩$ to distinguish with the notation $(i, j)$ for an arc. Note that the interval is allowed to be empty.

As shown in Figure 2, the primary component splits $[n]$ into disjoint intervals, on which substructures can be constructed. To meet the restrictions of saturated extended 2-regular simple stack, the substructures can be classified into the following six types:

T₁: an isolated vertex followed by an arbitrary saturated structure with no visible vertex, or just an arbitrary saturated structure with no visible vertex; For a structure of type T₁, its reverse order structure is denoted of type $T'_{1}$ ;

T₂: an arbitrary saturated structure;

T₃: an arbitrary nonempty saturated structure;

T₄: an isolated vertex followed by an arbitrary saturated structure with no visible vertex, or just an arbitrary nonempty saturated structure with no visible vertex; $T'_{4}$ denotes the type of the reverse of a structure of type T₄;

T₅: an arbitrary saturated structure with no visible vertex;

T₆: an isolated vertex, or an arbitrary nonempty saturated structure with no visible vertex.

Note that, the substructures of types $T_{1}, T_{2}, T_{5}$ may be empty.

Let $s (n)$ denote the number of all saturated extended 2-regular simple stacks of length n, and $\bar{s} (n)$ denote the number of such stacks with no visible vertex. Denote the generating functions of $s (n), \bar{s} (n)$ by

respectively. Denote the number of the structures with n vertices of Case $(i)$ in Figure 2 by $s_{i} (n)$ , and the number of such structures with no visible vertex by ${\bar{s}}_{i} (n)$ . Let

Using the method of primary component decomposition, we obtain that $S (x)$ , $\bar{S} (x)$ can be expressed in terms of the generating functions y and z (3).

Theorem 4. We have $\begin{matrix} S (x) = 2 (x + 1) x^{5} y^{2} z^{2} + 2 x^{4} y^{2} z - 2 (x + 1) x^{5} y z^{2} + 2 x^{3} y z + x^{6} y^{5} - 2 x^{6} y^{4} \\ + (x^{2} + 1) x^{4} y^{3} - 2 x^{4} y^{2} + x^{4} y - (2 x^{2} + 2 x - 1) x^{2} z + (x - 1) x^{2} . \end{matrix}$ (7)

Proof. For a saturated extended 2-regular simple stack with n vertices, suppose that its primary component splits $[n]$ into disjoint intervals $I_{1}, I_{2}, \dots$ from left to right, each having k_i vertices.

Obviously, Case (1) and Case (1′) are symmetric, and we thus consider only Case (1). As shown in Figure 2, there are three intervals $I_{1}, I_{2}, I_{3}$ of type T₁, T₂, and T₃, respectively. Note that $n \geq 5$ since there are four vertices in the primary component and I₃ cannot be empty. The total number of vertices in $I_{1}, I_{2}, I_{3}$ is $k_{1} + k_{2} + k_{3} = n - 4$ , where $k_{1}, k_{2} \geq 0$ since I₁ and I₂ can be empty, and $k_{3} \geq 1$ since I₃ is not allowed to be empty.

We have two cases for the substructures on I₁. For the case of an isolated vertex followed by an arbitrary saturated 2-regular simple stack with no visible vertex, there will be $b (k_{1} - 1)$ substructures that can be built. For the case of an arbitrary saturated substructure with no visible vertex or empty, there will be $b (k_{1})$ substructures that can be built. Clearly, the numbers of substructures that can be built on I₂ and I₃ are $a (k_{2})$ and $a (k_{3})$ , respectively. It deduces that the number of structures of Case (1) is

from which we can obtain the generating function as follows:

Case (2) and Case (2′) are also symmetric, and so, we only consider Case (2). Following the similar discussions in Case (1), we have $n \geq 4$ . Moreover, there are two intervals $I_{1}, I_{2}$ of type T₁ and T₃ with k₁ and k₂ vertices, respectively, such that $k_{1} + k_{2} = n - 3$ and $k_{1} \geq 0, k_{2} \geq 1$ . Consequently, we have that the number of structures of Case (2) is

whose generating function is

Case (3) and Case (3′) are symmetric, and so, we only consider Case (3). There are four intervals $I_{1}, I_{2}, I_{3}, I_{4}$ of types T₄, T₅, T₂, and T₃, respectively. Note that $n \geq 7$ and the total number of vertices in $I_{1}, I_{2}, I_{3}, I_{4}$ is $k_{1} + k_{2} + k_{3} + k_{4} = n - 5$ , where $k_{1}, k_{4} \geq 1$ since I₁ and I₄ are not allowed to be empty, and $k_{2}, k_{3} \geq 0$ . Similar to Case (1), there are $b (k_{1} - 1) + b (k_{1})$ substructures that can be built on I₁. Clearly, the numbers of substructures that can be built on $I_{2}, I_{3}$ and I₄ are $b (k_{2}), a (k_{3})$ and $a (k_{4})$ , respectively. So we obtain the number of structures of Case (3) that is

whose generating function is

For Case (4), there is only one interval of type T₆. Obviously, $n \geq 3$ and the number of vertices in the interval is $n - 2$ . If $n = 3$ , the substructure that can be built on the interval is just an isolated vertex. If $n \geq 4$ , there will be $b (n - 2)$ substructures that can be built. Therefore, the numbers of structures of Case (4) are $s_{4} (n) = \{\begin{matrix} 1, & n = 3, \\ b (n - 2), & n \geq 4 \end{matrix}$

with generating function $S_{4} (x) = x^{3} + x^{2} (z - 1)$ $= x^{2} z + (x - 1) x^{2} .$ (11)

For Case (5), there are three intervals $I_{1}, I_{2}$ , and I₃ of type T₃, T₂, and T₃, respectively. Note that $n \geq 6$ and the total number of vertices in $I_{1}, I_{2}, I_{3}$ is $k_{1} + k_{2} + k_{3} = n - 4$ , where $k_{1}, k_{3} \geq 1$ and $k_{2} \geq 0$ . Thus, the number of structures of Case (5) is

from which we can get the generating function as follows:

For Case (6), there are five intervals $I_{i} (1 \leq i \leq 5)$ , in which $I_{1}, I_{5}$ are of type $T_{3},$ and $I_{2}, I_{3}, I_{4}$ are of type T₂. Obviously $n \geq 8$ , the total number of vertices in the intervals is $\sum_{i = 1}^{5} k_{i} = n - 6$ , where $k_{1}, k_{5} \geq 1$ and $k_{2}, k_{3}, k_{4} \geq 0$ . Thus, the number of structures of Case (6) is

whose generating function is

Above all, substituting Equations (8)—(13) into $S (x) = 2 (S_{1} (x) + S_{2} (x) + S_{3} (x)) + S_{4} (x) + S_{5} (x) + S_{6} (x),$

we obtain Equation (7), which completes the proof.

Now, we are ready to prove Theorem 1 by eliminating the variables y and z from Equations (4), (5), and (7).

Proof of Theorem 1. To derive the functional equation satisfied by $S (x)$ in Theorem 1, by Equation (4) it is direct to see that $y = \frac{x^{2} z + z - 1}{x^{2} z} .$ (14)

Substituting Equation (14) into (7), we obtain

By computing the resultant of Equations (4) and (5) with respect to y, we have $(x + 1) x^{3} z^{3} + x^{2} z^{2} - (x^{2} + 1) z + 1 = 0 .$ (16)

Then eliminating z by computing the resultant of Equations (15) and (16), we obtain (1).

From the functional equation (1) of $S (x)$ , we can deduce the following recurrence relation of $s (n)$ by using the Maple commands algfuntodiffeq and diffeqtorec in the gfun package (Salvy and Zimmerman, 1994).

Corollary 1. The number of saturated extended 2-regular simple stacks on n vertices $s (n)$ satisfies the following recurrence relation

in which $\begin{matrix} p_{0} (n) = - 80 n^{2} + 560 n - 960, p_{1} (n) = - 424 n^{2} + 1976 n - 2112, \\ p_{2} (n) = 224 n^{2} - 940 n + 2120, p_{3} (n) = 5920 n^{2} - 2844 n + 3600, \\ p_{4} (n) = 14236 n^{2} + 28268 n + 6320, p_{5} (n) = 1042 n^{2} + 7068 n - 52316, \\ p_{6} (n) = - 66150 n^{2} - 453864 n - 903872, p_{7} (n) = - 166716 n^{2} - 1543338 n - 3600520, \\ p_{8} (n) = - 192882 n^{2} - 2175708 n - 5799422, p_{9} (n) = - 74278 n^{2} - 855668 n - 1405342, \\ p_{10} (n) = 94943 n^{2} + 1787565 n + 9945244, p_{11} (n) = 187411 n^{2} + 3757775 n + 20379846, \\ p_{12} (n) = 187813 n^{2} + 4238697 n + 23752086, p_{13} (n) = 157739 n^{2} + 4068271 n + 23817290, \\ p_{14} (n) = 177075 n^{2} + 5271923 n + 35846974, p_{15} (n) = 251299 n^{2} + 8276931 n + 65487426, \\ p_{16} (n) = 289002 n^{2} + 10257420 n + 89937374, p_{17} (n) = 228044 n^{2} + 8645464 n + 82079842, \\ p_{18} (n) = 104272 n^{2} + 4199356 n + 42802142, p_{19} (n) = 1624 n^{2} + 71614 n + 1164296, \\ p_{20} (n) = - 35127 n^{2} - 1578391 n - 17593210, p_{21} (n) = - 26613 n^{2} - 1262021 n - 14949286, \\ p_{22} (n) = - 9296 n^{2} - 465608 n - 5843760, p_{23} (n) = - 254 n^{2} - 16130 n - 255584, \\ p_{24} (n) = 1234 n^{2} + 66010 n + 880404, p_{25} (n) = 590 n^{2} + 33210 n + 466900, \\ p_{26} (n) = 120 n^{2} + 7020 n + 102600 \end{matrix} .$

The first 29 initial values of $s (n)$ are given in Table 1. The saturated extended 2-regular simple stacks of lengths five and six are listed in Figures 3 and 4, respectively.

Next, we use the singularity analysis method to derive the asymptotic formula of $s (n)$ . Here, we give a sketch of the computation process, and refer to the classic book Flajolet and Sedgewick (2009) for more details.

First, by Equation (1), we find the dominant singularity of $S (x)$ to be $ζ = 0.4246873104 \dots$ , and then by applying the Newton–Puiseux Expansion Theorem, we obtain the following expansion of $S (x)$ at $x = ζ$ , $S (x) = σ - α \sqrt{\frac{x - ζ}{- α}} + O (x - ζ),$

where $σ = 1.969506702 \dots$ , and $α = 235.0528747 \dots$ .

At last, leveraging the transfer theorem, we transfer the approximation of $S (x)$ near $x = ζ$ into the following approximation of its coefficients, $s (n) = [x^{n}] S (x) = \frac{γ}{2 \sqrt{π}} ω^{n} n^{- \frac{3}{2}} (1 + O (\frac{1}{n})),$

where $ω = \frac{1}{ζ} = 2.354673605 \dots, γ = \sqrt{α \cdot ζ} = 9.99119478 \dots$ .

Finally, we arrive at $s (n) \sim 2.818464011 \times 2.35467360 5^{n} \cdot n^{- \frac{3}{2}} .$

The Maple source codes of the above procedures can be found at https://github.com/aimerbam/asymptotic_formulas_RNA/tree/master.

For the saturated extended 2-regular simple stacks of length n with no visible vertex, we obtain the following expression for the generating function $\bar{S} (x)$ .

FIG. 3.

The saturated extended 2-regular simple stacks of length five.

FIG. 4.

The saturated extended 2-regular simple stacks of length six.

Table 1.

The First 29 Values of $s (n)$

n	0	1	2	3	4	5	6	7	8	9
$s (n)$	1	1	1	1	2	7	12	26	57	116
n	10	11	12	13	14	15	16	17	18
$s (n)$	251	545	1159	2517	5503	11962	26204	57711	127054
n	19		20		21		22		23
$s (n)$	280704		622425		1381923		3074897		6858928
n	24		25		26		27		28
$s (n)$	15323958		34293674		76885723		172630454		388146408

Corollary 2. We have $\begin{matrix} \bar{S} (x) = x^{6} y^{4} z - 2 x^{6} y^{3} z + 2 (x + 1) x^{5} y^{2} z^{2} + (x - 2) x^{5} y^{2} z - 2 (x + 1) x^{5} y z^{2} \\ + 2 (x^{2} + x + 1) x^{3} y z + x^{4} y^{3} - 2 x^{4} y^{2} + x^{4} y - (2 x^{2} + 2 x - 1) x^{2} z + (x - 1) x^{2} . \end{matrix}$ (18)

Proof. Similar to the proof of Theorem 4, we continue to use the method of primary component decomposition and discuss the six cases as shown in Figure 2.

Obviously, it is impossible for Case (1) to build a stack with no visible vertex. According to the proof of Theorem 7, we can see that the structures of Cases (2), (2′), (3), (3′), (4), and (5) are all saturated 2-regular simple stacks with no visible vertex. Thereby for $i = 2, 3, 4, 5$ , ${\bar{S}}_{i} (x) = S_{i} (x) .$

For Case (6), the interval I₃ of type T₂ must have no visible vertex. So the number of saturated extended 2-regular simple stacks of this case is

whose generating function is

Substituting Equations (9)–(12) and (19) into $\bar{S} (x) = 2 (S_{2} (x) + S_{3} (x)) + S_{4} (x) + S_{5} (x) + {\bar{S}}_{6} (x),$

we obtain Equation (18), which completes the proof.

Similarly, by eliminating y and z from Equations (4), (5), and (18), we obtain the following functional equation satisfied by $\bar{S} (x)$ .

Theorem 5. We have $q_{3} (x) {\bar{S}}^{3} (x) + q_{2} (x) {\bar{S}}^{2} (x) + q_{1} (x) \bar{S} (x) + q_{0} (x) = 0,$ (20)

where

Consequently, the asymptotic formula of $\bar{s} (n)$ is $\bar{s} (n) \sim 1.673111891 \times 2.35467360 5^{n} \cdot n^{- \frac{3}{2}} .$

4. THE OPTIMAL EXTENDED 2-REGULAR SIMPLE STACKS

In this section, we consider the enumeration of optimal extended 2-regular simple stacks, which correspond to the structures with the maximum number of arcs in the Nussinov–Jacobson energy model.

Let us first consider the number of arcs in the optimal structures.

Lemma 4. The number of arcs of the optimal extended 2-regular simple stacks with n vertices is $⌊\frac{n}{2}⌋$ , where $n \geq 3$ .

Proof. We follow the six cases as shown in Figure 2 to discuss the number of arcs in the optimal structures.

We first consider the case when n is odd. Suppose that $n = 2 m - 1$ with $m \geq 1 .$ From the restrictions for the structures of each case as discussed in the proof of Theorem 4, we claim that the maximum number of arcs for the saturated extended 2-regular simple stacks is $m - 1$ . We explain only Case (3), the other cases are similar and therefore omitted.

In Case (3), the primary component splits $[n]$ into four intervals $I_{1}, I_{2}, I_{3}, I_{4}$ of types T₄, T₅, T₂, and T₃, respectively. The total number of vertices in these intervals is $2 m - 6$ . Since $I_{1}, I_{4}$ cannot be empty, the remaining $2 m - 6$ vertices can form at most $\frac{2 m - 6 - 2}{2} = m - 4$ arcs. Adding the three arcs in the primary component, the optimal structures of Case (3) have exactly $m - 1$ arcs.

When n is even, suppose that $n = 2 m$ with $m \geq 2$ . We claim that the maximum number of arcs for the saturated extended 2-regular simple stacks is m. Similar to the above discussions, it is easy to see that for the Cases (1), (3), and (4), we can construct saturated structures with at most $m - 1$ arcs, and for the Cases (2), (5), and (6), we can construct saturated structures with at most m arcs. Therefore, the maximum number of arcs for the saturated extended 2-regular simple stacks is m, namely, only Cases (2), (5), and (6) can be optimal.

Above all, the number of arcs of the optimal extended 2-regular simple stacks with n vertices is $⌊\frac{n}{2}⌋$ .

Now, we are ready to prove Theorem 2.

Proof of Theorem 2. Denote the number of optimal extended 2-regular simple stacks of length n by $s o (n)$ . When n is odd, suppose that $n = 2 m - 1$ . By Lemma 4, we see that the optimal structures have $m - 1$ arcs. So we only need to discuss the structures with $m - 1$ arcs in each case, as shown in Figure 2. Denote the number of optimal structures in Case (i) by $s o_{i} (n)$ for $1 \leq i \leq 6$ .

For Case (1), we see that $m \geq 3$ . The primary component splits $[n]$ into three intervals $I_{1}, I_{2}, I_{3}$ of types T₁, T₂, and T₃, respectively. To be optimal, the substructures on I₁ and I₂ must be empty, and the substructures on I₃ must be optimal on $2 m - 5$ vertices. By using Equation (6), we have $s o_{1} (2 m - 1) = L O_{0} (2 m - 5) = 1 .$

For Case (2), $m \geq 3$ , there are two intervals $I_{1}, I_{2}$ of type T₁ and T₃ with $k_{1}, k_{2}$ vertices, respectively. To be optimal, we have two possibilities: one is that the substructure on I₁ is empty, and the substructures on I₂ are optimal with $2 m - 4$ vertices, and the other one is that both of the substructures on I₁ and I₂ are optimal and $k_{1}, k_{2}$ are odd with $k_{1} + k_{2} = 2 m - 4$ . By using Equation (6), we have

For Case (3), $m \geq 4$ , there are four intervals I₁, I₂, I₃, I₄ of type T₄, T₅, T₂, and T₃, respectively. To be optimal, the substructures on I₂ and I₃ must be empty, and the substructures on I₁ and I₄ must be optimal and $k_{1}, k_{4}$ are odd with $k_{1} + k_{4} = 2 m - 6$ . By using Equation (6), we have

For Case (4), $m \geq 2$ , there is only one interval I₁ of type T₆, and the substructures on I₁ must be optimal with $2 m - 3$ vertices. By using Equation (6), we have $s o_{4} (2 m - 1) = L O_{0} (2 m - 3) = 1 .$ (23)

For Case (5), $m \geq 4$ , there are three intervals $I_{1}, I_{2}, I_{3}$ of type $T_{3}, T_{2}$ , and T₃, respectively. To be optimal, we have two possibilities: one is that the substructures on $I_{1}, I_{2}$ , and I₃ are all optimal and $k_{1}, k_{2}$ , k₃ are odd numbers with $k_{1} + k_{2} + k_{3} = 2 m - 5$ , and the other one is that the substructure on I₂ is empty, and the substructures on I₁ and I₃ are both optimal with $k_{1} + k_{3} = 2 m - 5$ . By using Equation (6), we have

For Case (6), $m \geq 5$ , there are five intervals of type $T_{3}, T_{2}, T_{2}, T_{2}$ , and T₃, respectively. To be optimal, the possibility structures may be as follows: the substructures on I₁, I_i, and I₅ are all optimal, and $k_{1}, k_{i}$ , k₅ are odd such that $k_{1} + k_{i} + k_{5} = 2 m - 7$ , where $2 \leq i \leq 4$ ; or the substructures on $I_{2}, I_{3}$ , and I₄ are empty, and the substructures on I₁ and I₅ are both optimal with $k_{1} + k_{5} = 2 m - 7$ . By using Equation (6), we have

Substituting Equations (4)–(26) into

we obtain $s o (2 m - 1) = \frac{1}{3} (2 m^{3} - 6 m^{2} + m + 18),$

Thus, for n being odd, we have that $s o (n) = \frac{1}{12} (n^{3} - 3 n^{2} - 7 n + 69) .$

When n is even, suppose that $n = 2 m$ . By Lemma 4, we see that the optimal structures have m arcs and we only need to discuss the cases (2), (2′), (5), and (6), as shown in Figure 2.

For Case (2), $m \geq 2$ , to be optimal, the substructure on interval I₁ must be empty, the substructures on interval I₂ must be optimal with $2 m - 3$ vertices. Thus, we have $s o_{2} (2 m) = L O_{0} (2 m - 3) = 1 .$ (27)

For Case (5), $m \geq 3$ , to be optimal, the substructures on I₂ must be empty, the substructures on I₁ and I₃ must be optimal and $k_{1} + k_{3} = 2 m - 4$ with $k_{1}, k_{2}$ are odd numbers. Thus, we have

For Case (6), $m \geq 4$ , to be optimal, the substructures on intervals $I_{2}, I_{3}, I_{4}$ must be empty, and the substructures on intervals I₁ and I₅ must be optimal and $k_{1}, k_{5}$ are odd numbers with $k_{1} + k_{5} = 2 m - 6$ . Thus, we have

Substituting Equations (27)–(29) into $s o (2 m) = 2 s o_{2} (2 m) + s o_{5} (2 m) + s o_{6} (2 m),$

we obtain $s o (2 m) = 2 m - 3 .$

Thus, for n being even, $s o (n) = n - 3 .$

We complete the proof of Theorem 2.

The first 22 initial values of $s o (n)$ are given in Table 2.

Table 2.

The First 22 Values of $s o (n)$

n	0	1	2	3	4	5	6	7	8	9	10
$s o (n)$	1	1	1	1	2	7	3	18	5	41	7
n	11	12	13	14	15	16	17	18	19	20	21
$s o (n)$	80	9	139	11	222	13	333	15	476	17	655

For $n = 5$ , the 7 saturated extended 2-regular simple stacks as in Figure 3 are all optimal. For $n = 6$ , the last 3 structures with three arcs as in Figure 4 are optimal among the 12 saturated structures.

At last, the growth curves of the numbers of saturated and optimal extended 2-regular simple stacks are drawn together in Figure 5.

FIG. 5.

The growth curves of the numbers of saturated and optimal extended 2-regular simple stacks.

Footnotes

ACKNOWLEDGMENTS

We would like to thank the anonymous referees for valuable comments and suggestions.

AUTHOR DISCLOSURE STATEMENT

The authors declare they have no competing financial interests.

FUNDING INFORMATION

This work was supported by the National Natural Science Foundation of China (Grant Nos. 12071235, 11771222, and 11501307), the Fundamental Research Funds for the Central Universities, and the Natural Science Foundation of Tianjin (Grant No. 16JCQNJC09400), China.

References

Agarwal

P.K.

, Mustafa

N.H.

, and Wang

2007. Fast molecular shape matching using contact maps. J. Comput. Biol. 14, 131–143.

Banerjee

A.R.

, Jaeger

J.A.

, and Turner

D.H.

1993. Thermal unfolding of a group I ribozyme: The low-temperature transition is primarily disruption of tertiary structure. Biochemistry, 32, 153–163.

Chen

W.Y.C.

, Guo

Q.-H.

, Sun

L.H.

, et al. 2014. Zigzag stacks and m-regular linear stacks. J. Comput. Biol. 21, 915–935.

Clote

2006. Combinatorics of saturated secondary structures of RNA. J. Comput. Biol. 13, 1640–1657.

Clote

, Kranakis

, Krizanc

, et al. 2007. Asymptotic expected number of base pairs in optimal secondary structure for random RNA using the Nussinov-Jacobson energy model. Discrete Appl. Math. 155, 759–787.

Clote

, Kranakis

, Krizanc

, et al. 2009. Asymptotics of canonical and saturated RNA secondary structures. J. Bioinform. Comput. Biol. 7, 869–893.

Dill

K.A.

1990. Dominant forces in protein folding. Biochemistry, 29, 7133–7155.

Flajolet

, and Sedgewick

2009. Analytic Combinatorics. Cambridge University Press, Cambridge.

Fusy

, and Clote

2014. Combinatorics of locally optimal RNA secondary structures. J. Math. Biol. 68, 341–375.

10.

Goldman

, Istrail

, and Papadimitriou

C.H.

1999. Algorithmic aspects of protein structure similarity (extended abstract). 40th Annual Symposium on Foundations of Computer Science (New York, 1999), 512–521, IEEE Computer Soc., Los Alamitos, CA, 1999.

11.

Guo

Q.-H.

, and Sun

L.H.

2018. Combinatorics of contacts in protein contact maps. Bull. Math. Biol. 80, 385–403.

12.

Guo

Q.-H.

, Sun

L.H.

, and Wang

2016. Enumeration of extended m-regular linear stacks. J. Comput. Biol. 23, 943–956.

13.

Guo

Q.-H.

, Sun

L.H.

, Wang

2017. Regular simple queues of protein contact maps. Bull. Math. Biol. 79, 11–35.

14.

Guo

Q.-H.

, Wang

, and Xu

2018. Approximation algorithms for protein folding in the hydrophobic-polar model on 3D hexagonal prism lattice. J. Comput. Biol. 25, 487–498.

15.

Istrail

, and Lam

2009. Combinatorial algorithms for protein folding in lattice models: A survey of mathematical results. Commun. Inf. Syst. 9, 303–346.

16.

Jiang

, and Zhu

2005. Protein folding on the hexagonal lattice in the hp model. J. Bioinform. Comput. Biol. 03, 19–34.

17.

Jin

E.Y.

, Qin

, and Reidys

C.M.

2008. Combinatorics of RNA structures with pseudoknots. Bull. Math. Biol. 70, 45–67.

18.

Nebel

M.E.

2002. Combinatorial properties of RNA secondary structures. J. Comput. Biol. 9, 541–573.

19.

Nussinov

, and Jacobson

A.B.

1980. Fast algorithm for predicting the secondary structure of single stranded RNA. Proc. Natl. Acad. Sci. U S A. 77, 6309–6313.

20.

Pierri

C.L.

, De Grassi

, and Turi

2008. Lattices for ab initio protein structure prediction. Proteins, 73, 351C–361.

21.

Reidys

2011. Combinatorial Computational Biology of RNA Pseudoknots and Neutral Networks. Springer-Verlag, New York.

22.

Salvy

, and Zimmerman

1994. GFUN: A Maple package for the manipulation of generating and holonomic functions in one variable. ACM T. Math. Softw. 20, 163–177.

23.

Schmitt

W.R.

, and Waterman

1994. Linear trees and RNA secondary structure. Discrete Appl. Math. 51, 317–323.

24.

Waterman

, and Smith

1978. RNA secondary structure: A complete mathematical analysis. Math. Biosci. 42, 257–266.

25.

2019. Distance-based protein folding powered by deep learning. Proc. Natl. Acad. Sci. U S A. 116, 16856–16865.

26.

Yang

, Anishchenko

, Park

, et al. 2020. Improved protein structure prediction using predicted interresidue orientations. Proc. Natl. Acad. Sci. U S A. 117, 1496–1503.

On the Number of Saturated and Optimal Extended 2-Regular Simple Stacks in the Nussinov–Jacobson Energy Model

Abstract

1. INTRODUCTION

2. BASIC DEFINITIONS AND NOTATIONS

Footnotes

ACKNOWLEDGMENTS

AUTHOR DISCLOSURE STATEMENT

FUNDING INFORMATION

References