Phylogenetic and Chemical Probing Information as Soft Constraints in RNA Secondary Structure Prediction

Abstract

Extrinsic, experimental information can be incorporated into thermodynamics-based RNA folding algorithms in the form of pseudo-energies. Evolutionary conservation of RNA secondary structure elements is detectable in alignments of phylogenetically related sequences and provides evidence for the presence of certain base pairs that can also be converted into pseudo-energy contributions. We show that the centroid base pairs computed from a consensus folding model such as RNAalifold result in a substantial improvement of the prediction accuracy for single sequences. Evidence for specific base pairs turns out to be more informative than a position-wise profile for the conservation of the pairing status. A comparison with chemical probing data, furthermore, strongly suggests that phylogenetic base pairing data are more informative than position-specific data on (un)pairedness as obtained from chemical probing experiments. In this context we demonstrate, in addition, that the conversion of signal from probing data into pseudo-energies is possible using thermodynamic structure predictions as a reference instead of known RNA structures.

1. INTRODUCTION

Many functional RNAs exhibit well-conserved secondary structures (Gardner et al., 2011). Sequence variations on large evolutionary timescales are constrained such that they preserve base pairs. These selective constraints can be identified as consensus structures in multiple sequence alignments (MSA) (Bernhart et al., 2008; Hofacker et al., 2002; Tagashira and Asai, 2022; Will et al., 2012). Covariance and mutual information measures are suitable to detect consensus base pairs directly in sufficiently large MSAs. In MIfold a column-wise score of this type is used in place of an energy model (Freyhult et al., 2005). Direct evidence for the presence of individual helices is computed in ShapeSorter using a probabilistic model (Tsybulskyi and Meyer, 2022).

Different members of an RNA family nevertheless show structural variations that occasionally also violate consensus base pairs. To account for larger structural variations, the Rfam database collects multiple families in “clans” (Gardner et al., 2011) that share common ancestry and some common structural features. The consensus structure of a few related sequences is usually much more accurate than the secondary structure prediction for a single sequence (Gardner and Giegerich, 2004; Hajiaghayi et al., 2012). However, it does not account for the structural variability within a family.

The variability within an RNA family can be accommodated by using the consensus structure inferred from an alignment of family members as a constraint for the prediction of an individual structure. The consensus structure of an MSA can be computed, for example, with RNAalifold, which extends the thermodynamic model from single sequences to alignments by averaging energy contributions over alignment columns (Bernhart et al., 2008; Hofacker et al., 2002). As an alternative, Pfold (Knudsen and Hein, 2003; Sükösd et al., 2012) uses an MSA and a phylogenetic tree to compute a global consensus structure using the Stochastic Context Free Grammar (SCFG) instead of the Turner energy model of RNA structures (Turner and Mathews, 2010). The construction of the MSA and the inference of the consensus structure are combined in structure-based alignment tools such as locarna (Will et al., 2012; Will et al., 2007). Alignment and consensus structure of an RNA family can be combined and summarized in a covariance model (CM) (Eddy and Durbin, 1994). The CM can then be aligned to a sequence of interest using cmalign (Nawrocki and Eddy, 2013), resulting again in a projection of the consensus structure. The R2DT pipeline (Sweeney et al., 2021) implements this workflow to predict and visualize RNA secondary structures in RNAcentral. The projection of the consensus structure to an individual sequence of interest can be realized using hard constraints (Lorenz et al., 2016) that enforce all base pairs of the consensus when computing the secondary structure. This procedure is implemented in the script refold.pl, which is part of the utilities distributed through the ViennaRNA GitHub site.1 This approach is problematic, however, if individual structures do not fit well with the consensus structure or if the consensus structure covers only part of the sequence.

Early implementations of both mfold (Zuker et al., 1991) and the ViennaRNA Package (Hofacker et al., 1994) used pseudo-energies to force or exclude base pairs based on external evidence and as a means of handling exceptions to the general energy model. An early example included bonus energies for extra-stable tetraloops (Mathews et al., 1999). In RNAalifold, bonus energies are used to reward co-variation of base pairs and to discourage the inclusion of nonstandard pairs into consensus structures (Hofacker et al., 2002). In TurboFold (Harmanci et al., 2011), pseudo-energies are used to incorporate conservation information in a way that avoids a fixed MSA. Instead, for each potential base pair (i, j) in sequence x, a “proclivity” is computed by aggregating the probabilities that sequence positions potentially homologous to i and j in the related sequences are paired. This “proclivity” is then converted into a pseudo-energy that is added to every secondary structure of x in which (i, j) is paired. The computation is iterated until the “proclivities,” and thus the pseudo-energies and the resulting structure predictions, converge. This amounts to position-specific modifications of the standard energy model (Turner and Mathews, 2010) for each of the input sequences.

Pseudo-energies have become the method of choice to include chemical probing data into secondary structure prediction algorithms. Position-specific stabilizing energies for paired bases, for instance, are derived from SHAPE reactivities (Deigan et al., 2009; Hajdin et al., 2013; Zarringhalam et al., 2012). Also other types of chemical probing methods, such as DMS probing (Cordero et al., 2012), and lead probing (Kolberg et al., 2023) as well as enzymatic probing methods, such as PARS (Kertesz et al., 2010; Wan et al., 2014), yield signals that are readily converted into pseudo-energies for either paired or unpaired sequence positions. In a more elaborate approach, position-specific pseudo-energies are computed indirectly by minimizing a total error estimate for both experimental signal and thermodynamic parameters (Washietl et al., 2012). SCFG-based folding algorithms, such as PPfold (Sükösd et al., 2012) and ProbFold (Sahoo et al., 2016), are similarly modified to incorporate external evidence. In Eddy (2014), a fully probabilistic approach has been suggested, which was implemented in (Deng et al., 2016).

The performance of RNA structure predictions with empirical pseudo-energy contributions is a consequence of the accuracy of the standard energy model (Turner and Mathews, 2010). As it provides an excellent approximation, the predicted energy of a biologically relevant RNA structure cannot differ drastically from the predicted ground state, albeit the predicted structure might be very different. Small modifications of the energy model are therefore sufficient to nudge the ensemble of structures toward the structures that harbor the features supported by external empirical evidence.

In this contribution we consider pseudo-energy contributions as a means of representing extrinsic information on conserved secondary structures in a systematic manner in Section 2. Despite the generally good performance of TurboFold and RNAalifold, pseudo-energies derived directly from MSAs do not seem to have been used for the folding of single RNA sequences so far. Such an approach is easy to implement in the ViennaRNA package because it features a generic interface for the position- and base pair-specific pseudo-energies (Lorenz et al., 2016). The ensemble of alignment-based consensus structures computed by RNAalifold can be used to drive pseudo-energies for base pairs in a very natural manner. We show in Section 3 that these pseudo-energies yield substantial improvements in the accuracy of secondary structure predictions. We then proceed to comparing the consensus-based information with the evidence obtainable from chemical probing data.

2. THEORY

2.1. Secondary structures

We consider an RNA sequence of interest X. Moreover, we denote by $A$ a (pairwise or multiple) sequence alignment that contains X as one of its rows. A secondary structure s (on X or $A$ ) is a set of base pairs such that every sequence or alignment position i is contained in at most one base pair. Two base pairs (i, j) and (k, l) are said to be crossing if $k < i < l < j$ or $i < k < j < l$ . Throughout, we consider only crossing-free secondary structures.

A secondary structure s is compatible with X, if every base pair (i, j) in s adheres to certain base pairing rules. In the standard RNA model, only the pairs GC, CG, AU, UA, GU, and UG are allowed, whereas all other combinations of nucleotides are forbidden. For alignments $A$ , the pairing rule is defined in terms of two alignment columns $A_{i}$ and $A_{j}$ .

RNAalifold, for instance, requires that $A_{i} (S)$ and $A_{j} (S)$ form one of these six canonical base pairs for almost all rows S of $A$ while tolerating a small number of exceptions (Bernhart et al., 2008). It is possible to allow also arbitrary “nonstandard” base pairs and to associate them with energies or scores that discourage noncanonical pairings. This route is often taken in SCFG-based approaches. The set of all secondary structures that are compatible with X will be denoted by $Ω = Ω (X)$ . Analogously, we write $Ω (A)$ for the set of all structures compatible with the alignment $A$ .

2.2. Features in secondary structures

Consider the set of secondary structures Ω for a fixed sequence X. Intuitively, a feature μ in an RNA structure s is a pattern of base pairs and/or unpaired bases. Examples for features are paired positions (i), unpaired positions $\neg (i)$ , base pairs (i, j), the loops appearing in the recursions for secondary structure prediction (Lorenz et al., 2016; Zuker and Stiegler, 1981), entire helices, and the abstract shapes in sense of Giegerich et al., 2004. From a formal point of view, it is useful to identify a feature with the set μ of secondary structure that carries such a feature and to write $s \in μ$ if the secondary structure s has feature μ. For a given sequence, thus $μ \subseteq Ω$ . A secondary structure s thus has both features μ and $μ'$ if $s \in μ \cap μ'$ .

Definition 1.
Two features $μ'$ and $μ''$ are incompatible if $μ' \cap μ'' = \emptyset$ .

Paired and unpaired nucleotides (i) and $\neg (i)$ are incompatible features. Similarly, the base pair (i, j) is incompatible with each of the unpaired positions $\neg (i)$ and $\neg (j)$ . Another example of incompatible features includes crossing base pairs.

Denote by p(s) the probability of observing a particular secondary structure s for a fixed sequence X. A feature μ is then observed with the probability $p [μ] : = \sum_{x \in μ} p (x)$ . The probability of not observing feature μ is then $p [\neg μ] = p [Ω ∖ μ] = 1 - p [μ]$ .
Definition 2.
A feature μ is dominating if $p [μ] > p [\neg μ]$ , i.e., if $p [μ] > 1 / 2$ .
Lemma 1.
Let $D$ be a set of dominating features. Then $D$ does not contain a pair of incompatible features.

Proof. Suppose $μ'$ and $μ''$ are incompatible and $μ' \in D$ , that is, $p [μ'] > 1 / 2$ . Then $μ'' \subseteq Ω ∖ μ'$ and thus $p [μ''] \leq p [\neg μ'] = 1 - p [μ'] < 1 / 2$ and thus $μ'' \notin D$ .

Lemma 1 generalizes the observation in Ding et al. (2005) that the centroid structure of a Boltzmann ensemble, which by definition consists of all base pairs with $p [(i, j)] > 1 / 2$ , is crossing free:
Corollary 1.
The dominating set of base pairs $C = {(i, j) | p [(i, j)] > 1 / 2}$ is crossing free.
2.3. Energies and pseudo-energies

The standard energy model for nucleic acids (Turner and Mathews, 2010) comprises contributions for stacked base pairs and loops that depend on their sequence. These energy contributions derive from a large number of precise thermodynamic measurements. The free energy $G_{0} (s)$ of a given secondary structure s is given as the sum of these contributions. A pseudo-energy $Γ_{μ}$ for a feature μ, in contrast, is an additive contribution to the energy G(x) of every secondary structures $x \in μ$ . It is usually derived from some form of evidence such as a reactivity of a probing reagent or a conservation measure that in general is not associated with a measurement of thermodynamic quantities. It is motivated by the observation that the probability of encountering a secondary structure with feature μ in the Boltzmann ensemble is determined by the partial partition function $Z [μ] = \sum_{s \in μ} \exp (- G_{0} (s) / R T)$ . Adding the pseudo-energy $Γ_{μ}$ to the energy model, $G (x) = G_{0} (x) + Γ_{μ}$ changes the partial partition function for structures with feature μ by a factor $\exp (- Γ_{μ} / R T)$ and thus increases the abundance of the feature whenever $Γ_{μ} < 0$ . By additivity, pseudo-energies $Γ_{μ}$ can be included simultaneously for arbitrary feature sets $M$ : $G (x) = G_{0} (x) + \sum_{μ \in M} Γ_{μ}$ (1)

Chemical probing experiments typically provide position-specific pseudo-energies $Γ_{(i)}$ or $Γ_{\neg (i)}$ for paired or unpaired nucleotides, respectively. As another example, TurboFold includes pseudo-energies of the form $Γ_{(i, j)} = - a \ln Π_{(i, j)}$ derived from pairing “proclivities” $Π_{(i, j)}$ that, in turn, are computed from the base pair probabilities of related sequences (Harmanci et al., 2011).

The recursions appearing in the dynamic programming algorithms for RNA folding can incorporate certain types of pseudo-energy contributions very easily. This pertains, in particular, to unpaired positions $\neg (i)$ , paired positions (i), base pairs (i, j), as well as specific loops. Entire hairpins enclosed by pair (i, j) or interior loops delimited by two pairs (i, j) and (k, l) are also consistent with the folding algorithms. The ViennaRNA package provides a generic interface for this purpose (Lorenz et al., 2016).

The conversion of extrinsic information to pseudo-energies usually uses empirical expressions (Deigan et al., 2009; Harmanci et al., 2011), often motivated by some fitting procedure. In case the extrinsic information can be quantified as the probability $p [μ]$ that the feature μ is present, the pseudo-energy $Γ_{μ}$ for feature μ is given by a scaled log-odds-ratio (Cordero et al., 2012): $Γ_{μ} = - R T \ln \frac{p [μ]}{p [\neg μ]}$ (2)

For chemical probing data, $p [(i)]$ or $p [\neg (i)]$ can be estimated directly by a comparison of the empirically determined probing signals with known secondary structures (Kolberg et al., 2023; Sükösd et al., 2013).

Whenever the secondary structures s can be associated with an energy(-like) quantity $ε (s)$ , the Boltzmann distribution assigns a probability $p_{ε} (s) = \exp (- β ε (s)) / Z_{ε}$ where $β = 1 / (R T)$ is the inverse temperature and the partition function $Z_{ε} : = \sum_{s \in Ω} \exp (- β ε (s))$ serves as normalization factor. In our setting the energy model ε will not be the biophysically realistic Turner energy model (Turner and Mathews, 2010). Instead we assume that ε describes the pertinent extrinsic information. The exponential relation between $ε (s)$ and p(s) implicitly assigns energies to secondary structures also in a setting where probabilities are computed, for example, from SCFGs with parameters inferred with some learning scheme as in Do et al. (2006) and Dowell and Eddy (2004). Introducing the constrained partition function $Z_{ε} [μ] : = \sum_{s \in μ} \exp (- β ε (s))$ , we can express feature probabilities as $p [μ] = Z_{ε} [μ] / Z_{ε}$ . Partition functions are related to corresponding ensemble free energies by $G_{ε} = - (1 / β) Z_{ε}$ . A short computation shows that Eq. (2) can be expressed equivalently as $Γ_{μ} = - (1 / β) \ln \frac{p [μ]}{p [\neg μ]} = G_{ε} [μ] - G_{ε} [\neg μ],$ (3)where $G_{ε} [μ]$ and $G_{ε} [\neg μ]$ are the ensemble free energies with regard to the model ε constrained to structures containing and not containing μ, respectively. The pseudo-energy for feature μ thus has a simple interpretation as the difference of the two constrained ensemble free energies for the energy model describing the extrinsic evidence. Returning to the notion of dominating features, we observe the following:

Corollary 2.

The pseudo-energy $Γ_{μ}$ is negative if and only if the feature μ is dominating.

Thus, secondary structures are stabilized by pseudo-energy contributions only for dominating features.

2.4. Estimation of feature probabilities

Given an MSA, probabilities $p [(i, j)]$ of base pairs can be obtained directly from a probabilistic model for computing a consensus structure for the aligned sequences. In RNAalifold, the energy of a consensus secondary structure, defined as a crossing-free set of pairs of alignment columns, is defined as the average of energies of the secondary structures for the aligned sequences, plus an optional bonus term for sequence covariation (Bernhart et al., 2008; Hofacker et al., 2002). Partition functions for this model computed with McCaskill’s algorithm (McCaskill, 1990) readily yield base pairing probabilities $p [(i, j)]$ . Pfold (Knudsen and Hein, 2003) uses SCFGs instead of the thermodynamic energy model and accounts for the phylogenetic relationships among the input sequences instead of using a simple average. LocaRNA-P (Will et al., 2012) computes base pairing probabilities from combined sequence–structure alignments using a sparsified version of the Sankoff algorithm (Hofacker and Stadler, 2004; Sankoff, 1985).

It is also possible to convert empirical base pair propensities into corresponding feature probabilities $p [(i, j)]$ for base pairs. This can be achieved either by comparing the propensities with reference structures in the same way as for paired or unpaired nucleotides or by following the idea of MIfold (Freyhult et al., 2005). There, mutual information or covariance measures were used as base pairing propensities $ε_{i j}$ from which a secondary structure was computed using the maximum weighted circular matching model (Nussinov and Jacobson, 1980), that is, as the crossing-free set of base pairs maximizing the sum of the pairing propensities. The corresponding partitions Z_ij for the interval from alignment columns i to j satisfy the recursion $Z_{i j} = Z_{i + 1, j} + \sum_{i < k \leq j} Z_{i + 1, k - 1} Z_{k + 1, j} \exp (- β ε_{i j}) Z_{i i} = 1$ (4)where β is a parameter corresponding to the inverse temperature. The backward recursion of McCaskill’s algorithm (McCaskill, 1990) then yields base pair probabilities $p [(i, j)]$ .

In Section 3 we will restrict ourselves to the base pairing probabilities provided by RNAalifold as a measure of phylogenetic conservation.

2.5. Restrictions on bonus terms for consensus structures

In models such as RNAalifold that computes a global consensus structure, a sequence position i appears as unpaired for two very different reasons, namely, (1) if i is unpaired in the structure of each of the sequences in the MSA and (2) if i is paired with different, and therefore in general incompatible, pairing partners in the different sequences. In other words, an unpaired position in a consensus structure may simply reflect the absence of a conserved base pair. A large value of $p [\neg (i)]$ computed with RNAalifold, therefore, must not be misinterpreted as evidence of conserved unpairedness. As an extreme case, alignments of sufficiently different sequences without conserved secondary structure predict no base pairs in RNAalifold (see Fig. 1). Clearly, this is not an indication for conserved unpairedness throughout the alignments. Extensive computational studies showed that residual structural similarity vanishes when sequences differ by more than about 20% of randomly placed substitution (Fontana et al., 1993). Unpaired bases, therefore, will be predicted also outside local regions that harbor conserved secondary structure elements, for example, in mRNAs. This argument was also the motivation in Gruber et al. (2008) for measuring structural conservation as the ratio of the consensus folding energy and the average energy of the (unconstrained) secondary structure of the individual sequences. This structure conservation index effectively estimates the average fraction of the folding energy that is explained by the consensus structure. More precisely, it evaluates the consensus base pairs as the open structure without base pairs serves a reference with energy 0. As a consequence, only the paired positions in the consensus but not their unpaired counterparts are informative.

FIG. 1.

Larger MSAs of similar sequences with randomly placed substitutions (here 70%, 80%, and 90% identity with a common reference of length 80 nt) do not exhibit consensus base pairs. Both the number of base pairs in the minimum energy structure (consensus bps) by RNAalifold and the number of positions that are paired with a probability $p (i) > 0.5$ decrease rapidly with the number N of sequences in the alignment. As expected, the fraction of base pairs in consensus structures strongly depends on the average sequence similarity. For instance, at 80% pairwise identity, virtually no significant base pairs are left for N > 10. MSAs, multiple sequence alignments.

A similar argument can be made for probing experiments. It is impossible to distinguish from the data whether a position—for which a probing reagent designed to detected, say, unpairedness does not yield a signal—is truly not unpaired or whether the signal is missing for technical reasons. From this point of view, the commonly used way of analyzing SHAPE-MaP data in terms of pseudo-energies that explicitly stabilize paired positions with low reactivity (Low and Weeks, 2010) might not be optimal.

In most situations it will be plausible to assume that an experiment can determine the presence of a feature μ, whereas the absence of μ cannot be distinguished from missing data. As there is positive support for μ only if $p [μ] > p [\neg μ]$ , supporting pseudo-energy terms are associated only with dominating features. In other words, only negative pseudo-energy contributions are considered. Therefore, we have $Γ_{μ} = \min {- R T \ln \frac{p [μ]}{p [\neg μ]}, 0}$ (5)

In the case of the RNAalifold model, therefore, we consider only the set $C$ of dominating base pairs, which, as argued previously, coincides with the centroid structure in the RNAalifold model.

3. IMPLEMENTATION AND EVALUATION

3.1. Datasets

In order to test and benchmark conservation-based pseudo-energies, we selected nine seed alignments $A$ from Rfam v14.9 that contained the largest number of seed sequences with known 3D structures and were free of pseudoknots. A complete list is provided in the Supplementary Data S1. If the focal sequence X was not contained in the MSA $A$ , then we added X to $A$ using mafft v7.310 (Katoh and Frith, 2012). The MSAs were reduced by iteratively removing sequence with a similarity of more than 80% to any other sequence as very similar sequences also share structural similarities in the absence of consensus that is under selective constraint. This also ensures that phylogenetically closely related sequences are not overrepresented in the phylogenetic data. The lengths of the 28 focal sequences X varied between 74 and 184 nucleotides (see Supplementary Data S1 for details).

Reference structures for the focal sequences X were retrieved from RNAcentral. By definition these structures are free of pseudoknots. Positions of pseudoknots therefore remain “unpaired” in the reference structures. Details can be found in the Supplementary Data S1. The accuracy of a predicted secondary structures was quantified by Matthew’s correlation coefficient (MCC) based on the comparison of predicted base pairs and the base pairs of the reference structures.

In order to compare the information provided by chemical probing data with the impact of phylogenetic information, we used published datasets generated with different chemical probing methods for Escherichia coli. Comparisons of structure predictions with and without inclusion of probing data were restricted to transcripts with sufficient coverage.

DMS-seq comprises DMS probing data generated by Burkhardt et al. (2017) to investigate the link between operon structure, translation efficiency, and RNA structure. Data were obtained through the RASP atlas of transcriptome-wide RNA secondary structure probing data (Li et al., 2020). We selected 19 sequences of lengths $77 - 682$ in 8 Rfam alignments that were covered by the data. The 16S ribosomal RNA (rRNA) and 23S rRNA sequences were separated according to their domains as described in Jaeger et al. (1989) and Mathews et al. (1999).

Led-Seq comprises lead probing data generated by Kolberg et al. (2023). Comparisons with reference structures were carried out for 24 sequences of lengths $74 - 377$ contained in 5 Rfam alignments.

SHAPE-MaP comprises SHAPE-Map data generated by Mustoe et al. (2018) for a transcriptome wide survey of mRNA structures. The dataset was downloaded from RASP (Li et al., 2020). A comparison with reference structures was possible for 12 sequences of lengths $74 - 377$ from 4 Rfam alignments.

Details on the investigated transcripts and reference structures can be found in the Supplementary Data S1.

3.2. Software

RNAsoftcons implements the following workflow:

1
Using a MSA $A$ as input, the consensus structure ensemble for $A$ is computed using RNAalifold -p. The set $C$ of centroid base pairs and their corresponding probabilities $p (i', j')$ are extracted from the corresponding base pair probability matrix. In addition, the probabilities $p_{i'} : = \sum_{j' < i'} p (j', i') + \sum_{j' > i'} p (i', j')$ (6)that alignment column $i'$ corresponds to a consensus base pair are computed.
2
The alignment columns $i'$ and $j'$ of the MSA $A$ are translated to the corresponding sequence positions i and j of the focal sequence X. If row X of $A$ shows a gap in column $i'$ or $j'$ , then the consensus base pair $(i', j')$ is discarded. Similarly, if the nucleotides X_i and X_j at positions i and j cannot form a Watson–Crick or wobble pair, the base pair is ignored. As a result, we obtain a set $C_{X}$ of base pairs and associated probabilities $p [(i, j)]$ . The latter are converted into bonus energies $Γ_{(i, j)}$ according to Eq. 5. Analogously, the probabilities $p_{i'}$ that alignment columns are paired are associated with the projected position X in the focal sequences. Again, these are also translated to pseudo-energies using Eq. 5. In the following we refer to these pseudo-energies as base pair-wise or position-wise phylogenetic soft constraints, respectively.
3
RNAfold is used to compute the secondary structure of X using $Γ_{(i, j)}$ for $(i, j) \in C_{X}$ as soft constraint. Alternatively, the position-wise phylogenetic soft constraints $Γ_{i} < 0$ are used.

The source code and implementation details for RNAsoftcons can already be accessed at https://www.github.com/ViennaRNA/softconsensus. Starting with the next release (v2.6.5), it will further be included into the ViennaRNA Python bindings available from https://pypi.org/project/ViennaRNA. A full integration into the ViennaRNA Package and more detailed example code in the official ViennaRNA Manual (https://viennarna.readthedocs.io) follow later.

The script refold.pl uses the consensus structure $s^{}$ of the MSA $A$ computed by RNAalifold as a hard constraints. More precisely, the base pairs of $s^{}$ are projected to the focal sequence X such that base pairs mapping to a gap character in the row X of the MSA are omitted. In addition, a base pair is removed if its projection to X consists of two bases that do not form one of the six canonical base pairs. The remaining base pairs are used as hard constraints for folding X using RNAfold -C. The script is part of the utilities distributed through the ViennaRNA GitHub site.2

For comparison we also used PETfold (Seemann et al., 2008), which identifies base pairs that have high probabilities of being conserved and of being energetically favorable and extracts the consensus structure using a maximum expected accuracy scoring. Data were processed using the PETfold web service (Seemann et al., 2011).

In order to compare with TurboFold, we extracted the individual sequences for the (possibly augmented) MSAs $A$ by removing all gap characters. The sequences were then used as input for TurboFold v6.3 (Harmanci et al., 2011).

The inclusion of phylogenetic information consistently improves the accuracy of the prediction (measured in terms of base pairs) from about 65% for the pure thermodynamic model to well above 90% (see Table 1). The differences between the different consensus and consensus-aware methods are small. We observe a small advantage for the approach advocated here, presumably because it uses the consensus information as efficiently as the alternative and, in addition, accounts for the small sequence-specific deviations from the consensus structure.

Table 1.
Performance Metrics for RNA Secondary Structure Prediction for Different Approaches That Incorporate Phylogenetic Information

Method MCC F-val PPV Sensitivity

RNAfold 0.662 0.666 0.629 0.703

refold.pl 0.936 0.937 0.919 0.955

RNAsoftcons 0.948 0.949 0.922 0.977

TurboFold 0.931 0.932 0.923 0.940

PETfold 0.936 0.937 0.921 0.954

See text for details on the dataset.

MCC, Mathew’s correlation coefficient; PPV, Positive Predictive Value.

Bold represents best performance.

Using RNAsoftcons, there is an improvement in accuracy in the majority of predicted structures. We consistently reach a level of at least 80% and sometimes perfect predictions. The average improvement in MCC is 0.29 [ standard deviation (SD) = 0.28) overall and 0.45 (SD = 0.23) for sequences where unconstrained folding yields an $MCC < 0.8$ (see von Löhneysen et al., 2023). In the top three panels of Figure 2, we shall compare the predictions with and without pseudo-energies separately for the reference structures with sufficient coverage in the chemical probing datasets to allow for a direct comparison with the experimental information described in Section 3.1. Using information on conserved base pairs, we observe an improvement in accuracy in the majority of the cases. Moreover, we have encountered only a single RNA (16S rRNA domain IV) for which the prediction deteriorates, that is, where the phylogenetic information is misleading.

FIG. 2.
Scatter plots comparing the Mathew’s correlation coefficients (MCC) of secondary structure predictions with and without extrinsic information. Each data point is an RNA with known reference structure with sufficient coverage in one of the three chemical probing datasets (DMS-seq, Led-Seq, SHAPE-MaP). Data points in the upper-left triangle indicate an improvement because of the inclusion of pseudo-energies. Top: Structure predictions using pseudo-energies from base pair-wise phylogenetic conservation. Middle: Structure predictions using pseudo-energies from position-wise conserved pairedness. Note that the data in these two rows are independent of chemical probing and differ only in the selection of RNA transcripts. Bottom: Structure predictions pseudo-energies from chemical probing data.
3.3. Base pair-wise versus position-wise pseudo-energies

Method	MCC	F-val	PPV	Sensitivity
RNAfold	0.662	0.666	0.629	0.703
refold.pl	0.936	0.937	0.919	0.955
RNAsoftcons	0.948	0.949	0.922	0.977
TurboFold	0.931	0.932	0.923	0.940
PETfold	0.936	0.937	0.921	0.954

In contrast to phylogenetic information, which identifies consensus base pairs, chemical probing methods, reviewed, for example, by Strobel et al. (2018), Mitchell et al. (2019), and Solayman et al. (2022), yield position-wise information and thus also position-wise pseudo-energy contributions. Knowledge on paired positions, however, does not necessarily imply knowledge of base pairs. Consider, for example, the following artificial RNA sequence, which can form two structures (even with comparable free energies) that do not share a single base pair but have exactly the same paired and unpaired positions. $\begin{array}{l} GCGCGATTAACGCGCTATGCGGGAAACCCGCGATTACGCGC \\ (((((.....))))) ... (((((...))) (((....))))) -9 . 30 \\ (((((..... (((((...)) (((...)))))) ....))))) -8 . 50 \end{array}$ (7)

By definition, the two structures cannot be distinguished my means of position-wise probing signal. This example begs the question how much structural information can be inferred from position-wise data compared with pair-wise constraints. For the phylogenetic approach outlined previously, this question can directly be answered empirically: It suffice to compute position-wise bonus energies according to Eq. (6) and to compare the accuracy of the structure prediction with the results of prediction with pseudo-energies for RNAalifold centroid base pairs.

On the entire reference set, we observed an improvement for position-wise pseudo-energies that is slightly more moderate than for base-pairing information, with an average increase of the MCC by 0.26 (SD = 0.25) overall and 0.41 (SD = 0.20) for RNAs with an unconstrained MCC < 0.8. The data are shown in this form in the preliminary version of this work (von Löhneysen et al., 2023). The middle row in Figure 2 shows the same effect separately for the reference structures in the three chemical probing data sets. Notably, the number of sequences with perfect predictions decreases in comparison with the use of base pairing information.

It is not unexpected that predictions that are fairly accurate already without extrinsic information (accuracy $> 80 %$ ) do not show improvement. This can presumably be explained by the relatively small number of consensus pairs with $Γ [(i, j)] < 0$ , which typically are already predicted correctly for such sequences. The pseudo-energy contribution then has little influence on the predicted structure.

3.4. Chemical probing versus phylogenetic information

Most currently available probing methods provide direct evidence only for cleavage at flexible, and thus preferentially unpaired, positions. The raw data, for example, counts of read ends, are first normalized over a locus or transcript such that the local expression level is taken into account. The normalized position-dependent probing signal S_i must then be converted into a probability $p_{i} = p (S_{i})$ that position i is unpaired. Following Kolberg et al. (2023), the function p(S) can be estimated from a histogram counting the fraction $\hat{p}$ of unpaired positions in the reference structure that exhibits a normalized signal in the interval $[S - Δ S, S + Δ S]$ . A sigmoidal curve is then fitted to the histogram. Here we use $p (S) = \frac{1}{1 + \exp (b - a S)} + c,$ (8)which requires the estimation of the three parameters a, b, and c. The parameter c determines the saturation level for large signals and, thus, the combined level of errors in reference structures and the probing experiment.

The bottom row in Figure 2 shows that chemical probing data of all three datasets result in a systematic improvement of the secondary structures. However, the beneficial effect of the corresponding pseudo-energies is, in general, smaller than the inclusion of the structural conservation, even if only (un)pairedness is considered. Although this may be surprising at first glance, it is worth recalling that chemical probing methods are by no means a fail-safe method to distinguish paired from unpaired signals, that is, p(S) is not even close to a step function.

The excellent performance in particular of the pairing status inferred from phylogenetic information is at least, in part, the consequence of highly accurate, manually curated Rfam seed alignments. Pure sequence alignments recomputed with mafft reached comparable performance only when including about a dozen homologous sequences. We expect that pairwise sequence similarity will play a crucial role in particular for alignments covering only a moderate number of sequences. For details, we refer to von Löhneysen et al. (2023).

3.5 Influence of reference sets for probing data

The set of RNAs with accurately known structures is rather limited, in particular if transcripts are studied in a system other than a small number of model organisms. As noted in Stadler et al. (2024), it is possible to circumvent the need for a large enough set of reference structures by defining p(S) as the average probability that a position with signal in $[S - Δ D, S + Δ S]$ is unpaired in structure predictions based on the thermodynamic model. McCaskill’s partition function algorithm (McCaskill, 1990) can be used to compute the probability that a given sequence position is unpaired. As transcript boundaries are often not known exactly, and the accuracy of global folding is limited for long transcript, we used here a local folding approach. RNAplfold (Bernhart et al., 2006) restricts the span of base pairs to a local interval and averages base pairing probabilities over all sequence intervals that contain a putative base pair, resulting in a more robust prediction of the probability $p^{*} (S)$ that a position is unpaired with normalized empirical signal S. More precisely, we computed the probabilities that a position is unpaired using RNAplfold -W 200 -L 180 -u 1 with both genomic strands as input and extracted $p^{*} (S)$ from all positions with a valid probing score in the datasets. We observed no relevant difference in the $p^{*} (S)$ curves between this local folding approach and unpaired probabilities generated by McCaskill’s partition function algorithm using RNAfold -p.

The predictive power of this alternative approach hinges on the accuracy of the standard energy model for RNA folding, which is far from perfect even though it is sufficient for some applications (see, e.g., Bugnon et al., 2022; Xu et al., 2012). Thus, $p^{*} (S)$ may serve as an approximation for p(S). The two quantities are related by $p (S) = p (u | u^{*}, S) p^{*} (S) + p (u | p^{*}, S) (1 - p^{*} (S))$ (9)where $a (S) : = p (u | u^{*}, S)$ and $b (S) : = p (u | p^{*}, S)$ are the conditional probabilities that a position with signal S is unpaired given that RNAplfold predicts it to be unpaired and paired, respectively. This equation can be rewritten in the form $p (S) = b (S) + (a (S) - b (S)) p^{*} (S)$ (10)

The key idea is now that a(S) and b(S) are less strongly dependent on the signal strength S. One can estimates a(S) and b(S) for each experimental protocol using datasets for which reference structures are readily available and then transfer these estimates to other systems, where reference structures are sparse or even unknown.

Empirically, we observe that p(S) can be estimated well using Eq. (10) from the predicted pairing probabilities and technology-dependent estimates a(S) and b(S) that positions predicted to be unpaired are unpaired and paired, respectively (see Fig. 3). In general, a(S) saturates quickly at values above 0.8, indicating that bases that are both predicted as unpaired and exhibit large chemical probing signal are very likely to be unpaired in reality. This is true even when the signal S becomes very small. On the contrary, the b(S) curve suggests that positions predicted as unpaired but showing large signals S are still unpaired with probabilities of about 60%.

FIG. 3.

Inference of the probability p(S) that a position with probing signal S is unpaired. Top: Conditional probabilities that bases are unpaired given that they are predicted to be paired and unpaired according to the thermodynamic model and a signal strength S is observed in probing experiments. Below: Estimates of p(S) from reference structures and with the help of Eq. (10). The curves $p^{*} (S)$ are shown for comparison.

3.6. Comparison of overall performance

Both phylogenetic information an chemical probing data can be used to refine and improve secondary structure predictions. Figure 4 summarizes the accuracies that are achieved for the secondary structures using various sources of extrinsic information.

FIG. 4.

Comparison of different methods for secondary structure prediction: (a) chemical probing data alone; (b) thermodynamic folding with Turner energy model (RNAfold). Chemical probing data used as pseudo-energies in thermodynamic folding: (c) p(S) inferred from reference structures, (d) p(S) estimated according to Eq. (10), (e) including low probing scores as evidence for pairedness. Phylogenetic information used a pseudo-energies in thermodynamic folding: (f) base pair-wise information (g) position-wise information, and (h) combination of chemical probing and phylogenetic information.

The Led-Seq data provide two independent measurements for unpairedness. Although the 5′-OH data completely cover the 5′-end of each transcript, there is no signal for about the last dozen nucleotides. Complementarily, the 2′,3′-cP library covers the 3′-end and has a blind spot at the 5′-end. We computed the pseudo-energies as the average for all positions for which data were available from both libraries. The complete coverage may explain the good performance of Led-Seq in comparison with the other probing methods. We emphasize that the data shown here should not be misconstrued as a benchmarking of probing technologies. The datasets differ in age and coverage, and no effort has been made to account for such differences. The point we do wish to make here is that all of the probing methods fall short of the performance of the phylogenetic information by a quite substantial margin. At the same time, there is not much to be gained by combining probing and phylogenetic data.

4. DISCUSSION

Conservation of RNA secondary structures over evolutionary timescales causes sequence covariations through which consensus base pairs become detectable. Although reliable detection of covariation requires alignments of large numbers of RNA sequences, it is possible to identify likely consensus base pairs by extending secondary structure prediction algorithm to operate directly on the alignment. Embedding the sequence of interest into the alignment yields a projection of the consensus on the query. These data can then be used to improve the structure prediction specifically for the query sequence. This principle has long been known, and several tools have been available to compute consensus-based structures. Here we have shown that the conversion of the consensus structure into pseudo-energy contributions yields accurate predictions in an efficient and transparent manner. This conceptually simple approach is competitive in comparison with complex workflows, such as TurboFold, or more sophisticated statistical models, such as PETfold.

Revisiting the theoretical foundations, we observe that only positive information, that is, paired position and base pairs in the consensus structure, can meaningfully be used to derive pseudo-energy contribution. Unpaired positions, in contrast, may also arise simply from the lack of a structural consensus. The benefits of phylogenetic pseudo-energies, moreover, depend critically on the quality of the alignments used to derive the consensus base pairs. This is not surprising: alignment errors imply conflicts between predicted base pairs and thus a reduction of signal-to-noise ratio in the consensus prediction.

Information on conserved base pairs is somewhat more informative than position-wise data on the (un)pairedness, although the differences appear to be moderate. It is advisable, nevertheless, to make use of conserved base pairs, even though the use of pairedness status may seem easier from a data analysis point of view.

A closely related topic is the comparison of the relative effects, and possibly the combinations, of structural information inferred from conservation and chemical probing measurements. As a first step, we compare examples of probing data obtained with different chemistries: DMS-Map, Leq-Seq, and SHAPE-Map. In all cases we observe on average less accurate secondary structure from probing data than from conservation. It should be noted that the RNAs for which reference structures are available are expected to have conserved structures covering the entire transcript. In mRNAs and long non-coding RNAs, however, such global consensus structures cannot be expected, and hence, the consensus-based approach is not applicable.

Moreover, some structured RNAs require transitions between different conformations for their biological function. Transcriptional riboswitches, for example, operate by forming alternative terminator and anti-terminator hairpins (Helmling et al., 2018; Scull et al., 2020). These structures are conserved but incompatible with each other. The conservation-based pseudo-energies as used in RNAsoftcons cannot account for such situations. A very general probabilistic framework to incorporate such evidence into structure prediction has been proposed by Eddy (2014). In a simpler setting, Washietl et al. (2012) proposed to determine position-wise pseudo-energies such that a loss function is minimized that combines the discrepancy between observed and predicted probabilities of (un)pairedness and the magnitude of the pseudo-energy terms. Assuming sufficiently accurate data, alternative structures are suitably represented in the resulting Boltzmann ensemble. As noted, for example, by Ritz et al. (2013), conserved alternative conformations sometimes can be found as suboptimal local minima in the Boltzmann ensemble. For chemical probing data, several methods for deconvolving structural ensembles have been developed (reviewed by Spitale and Incarnato, 2023), which may also be applicable to covariance data. It is also possible in some cases to obtain direct evidence for the conservation of helices that are incompatible with the most stable structure (Tsybulskyi and Meyer, 2022). A promising framework for aggregating information on incompatible conformations, possibly supported by different sources of information, is the ensemble tree proposed by Li and Reidys (2020). Alternative structures thus remain an interesting topic for future research.

Footnotes

AUTHORS’ CONTRIBUTIONS

S.V.L.: Conceptualization, software, formal analysis, visualization, and writing. T.S.: Software. Y.V.: Software. H.-T.Y.: Software. R.L.: Conceptualization, validation, and writing. I.H.: Conceptualization, validation, and writing. P.F.S.: Conceptualization, validation, supervision, and writing.

AUTHOR DISCLOSURE STATEMENT

The authors declare that they have no conflicting financial interests.

FUNDING INFORMATION

This work was funded by the Deutsche Forschungsgemeinschaft (grant number STA 850/48-1) and by the Austrian Science Fund (grant number F-80 and I-4520).

SUPPLEMENTARY MATERIAL

References

Bernhart

, Hofacker

, Stadler

. Local RNA base pairing probabilities in large sequences. Bioinformatics, 2006; 22(5):614–615; doi: 10.1093/bioinformatics/btk014

Bernhart

, Hofacker

, Will

, et al. RNAalifold: Improved consensus structure prediction for RNA alignments. BMC Bioinf, 2008; 9:474; doi: 10.1142/s0219720008003886

Bugnon

, Edera

, Prochetto

, et al. Secondary structure prediction of long noncoding RNA: Review and experimental comparison of existing approaches. Briefings Bioinf, 2022; 23:bbac205; doi: 10.1093/bib/bbac205

Burkhardt

, Rouskin

, Zhang

, et al. Operon mRNAs are organized into ORF-centric structures that predict translation efficiency. eLife, 2017; 6:e22037; doi: 10.7554/eLife.22037

Cordero

, Kladwang

, VanLang

, et al. Quantitative dimethyl sulfate mapping for automated RNA secondary structure inference. Biochemistry, 2012; 51(36):7037–7039; doi: 10.1021/bi3008802

Deigan

, Li

, Mathews

, et al. Accurate SHAPE-directed RNA structure determination. Proc. Natl. Acad. Sci. USA, 2009; 106:97–102; doi: 10.1073/pnas.080692910

Deng

FEI

, Ledda

, Vaziri

, et al. Data-directed RNA secondary structure prediction using probabilistic modeling. RNA, 2016; 22(8):1109–1119; doi: 10.1261/rna.055756.11527251549

Ding

, Chan

, Lawrence

. RNA secondary structure prediction by centroids in a Boltzmann weighted ensemble. RNA, 2005; 11(8):1157–1166; doi: 10.1261/rna.2500605

, Woods

, Batzoglou

. CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics, 2006; 22(14):e90–98–e98; doi: 10.1093/bioinformatics/btl246

10.

Dowell

, Eddy

. Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction. BMC Bioinformatics, 2004; 5:71; doi: 10.1186/1471-2105-5-71

11.

Eddy

. Computational analysis of conserved RNA secondary structure in transcriptomes and genomes. Annu Rev Biophys, 2014; 43:433–456; doi: 10.1146/annurev-biophys-051013-022950

12.

Eddy

, Durbin

. RNA sequence analysis using covariance models. Nucleic Acids Res, 1994; 22(11):2079–2088; doi: 10.1093/nar/22.11.2079

13.

Fontana

, Stadler

, Bornberg-Bauer

, et al. RNA folding landscapes and combinatory landscapes. Phys Rev E, 1993; 47:2083–2099; doi: 10.1103/PhysRevE.47.2083

14.

Freyhult

, Moulton

, Gardner

. Predicting RNA structure using mutual information. Appl Bioinf, 2005; 4(1):53–59; doi: 10.2165/00822942-200504010-00006

15.

Gardner

, Daub

, Tate

, et al. Rfam: Wikipedia, clans and the “decimal” release. Nucleic Acids Res, 2011; 39(Database issue):D141–D145; doi: 10.1093/nar/gkq1129

16.

Gardner

, Giegerich

. A comprehensive comparison of comparative RNA structure prediction approaches. BMC Bioinf, 2004; 5:140; doi: 10.1186/1471-2105-5-140

17.

Giegerich

, Voß

, Rehmsmeier

. Abstract shapes of RNA. Nucleic Acids Res, 2004; 32:4843–4851; doi: 10.1093/nar/gkh779

18.

Gruber

, Bernhart

, Hofacker

, et al. Strategies for measuring evolutionary conservation of RNA secondary structures. BMC Bioinf, 2008; 9:122; doi: 10.1186/1471-2105-9-122

19.

Hajdin

, Bellaousov

, Huggins

, et al. Accurate SHAPE-directed RNA secondary structure modeling, including pseudoknots. Proc Natl Acad Sci, 2013; 110(14):5498–5503; doi: 10.1073/pnas.1219988110

20.

Hajiaghayi

, Condon

, Hoos

. Analysis of energy-based algorithms for RNA secondary structure prediction. BMC Bioinf, 2012; 13:22; doi: 10.1186/1471-2105-13-22

21.

Harmanci

, Sharma

, Mathews

. TurboFold: Iterative probabilistic estimation of secondary structures for multiple RNA sequences. BMC Bioinf, 2011; 12:108; doi: 10.1186/1471-2105-12-108

22.

Helmling

, Klötzner

, Sochor

, et al. Life times of metastable states guide regulatory signaling in transcriptional riboswitches. Nature Comm, 2018; 9:944; doi: 10.1038/s41467-018-03375-w

23.

Hofacker

, Fekete

, Stadler

. Secondary structure prediction for aligned RNA sequences. J Mol Biol, 2002; 319(5):1059–1066; doi: 10.1016/S0022-2836(02)00308-X

24.

Hofacker

, Fontana

, Stadler

, et al. Fast folding and comparison of RNA secondary structures. Chemical Monthly, 1994; 125:167–188; doi: 10.1007/BF00818163

25.

Hofacker

, Stadler

PF.

The partition function variant of Sankoff’s algorithm. In: Computational Science—ICCS 2004, volume 3039 of Lecture Notes in Computer Science. ( Bubak

, van Albada

, Sloot

PMA

, et al. eds.) Springer: Berlin, Heidelberg; 2004; pp. 728–735. doi:10.1007/978-3-540-25944-2_94

26.

Jaeger

, Turner

, Zuker

. Improved predictions of secondary structures for RNA. Proc Natl Acad Sci. USA, 1989; 86(20):7706–7710; doi: 10.1073/pnas.86.20.7706

27.

Katoh

, Frith

. Adding unaligned sequences into an existing alignment using MAFFT and LAST. Bioinformatics, 2012; 28(23):3144–3146; doi: 10.1093/bioinformatics/bts578

28.

Kertesz

, Wan

, Mazor

, et al. Genome-wide measurement of RNA secondary structure in yeast. Nature, 2010; 467(7311):103–107; doi: 10.1038/nature09322

29.

Knudsen

, Hein

. Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acids Res, 2003; 31(13):3423–3428; doi: 10.1093/nar/gkg614

30.

Kolberg

, von Löhneysen

, Ozerova

, et al. Led-seq – ligation-enhanced double-end sequence-based structure analysis of RNA. Nucleic Acids Res, 2023; 51(11):e63; doi: 10.1093/nar/gkad312

31.

, Zhou

, Xu

, et al. RASP: An atlas of transcriptome-wide RNA secondary structure probing data. Nucleic Acids Res, 2020; 49(D1):D183–D191; doi: 10.1093/nar/gkaa880

32.

TJX

, Reidys

. On an enhancement of RNA probing data using information theory. Algorithms Mol Biol, 2020; 15:15; doi: 10.1186/s13015-020-00176-z

33.

Lorenz

, Hofacker

, Stadler

. RNA folding with hard and soft constraints. Algorithms Mol Biol, 2016; 11:8; doi: 10.1186/s13015-016-0070-z

34.

Low

, Weeks

. SHAPE-directed RNA secondary structure prediction. Methods, 2010; 52(2):150–158; doi: 10.1016/j.ymeth.2010.06.007

35.

Mathews

, Sabina

, Zuker

, et al. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J Mol Biol, 1999; 288(5):911–940; doi: 10.1006/jmbi.1999.2700

36.

McCaskill

. The equilibrium partition function and base pairing probabilities for RNA secondary structures. Biopolmers, 1990; 29(6–7):1105–1119; doi: 10.1002/bip.360290621

37.

Mitchell

3rd , Assmann

, Bevilacqua

. Probing RNA structure in vivo. Curr Opin Struct Biol, 2019; 59:151–158; doi: 10.1016/j.sbi.2019.07.008

38.

Mustoe

, Busan

, Rice

, et al. Pervasive regulatory functions of mRNA structure revealed by high-resolution SHAPE probing. Cell, 2018; 173(1):181–195.e18; doi: 10.1016/j.cell.2018.02.034

39.

Nawrocki

, Eddy

. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics, 2013; 29(22):2933–2935; doi: 10.1093/bioinformatics/btt509

40.

Nussinov

, Jacobson

. Fast algorithm for predicting the secondary structure of single stranded RNA. Proc Natl Acad Sci USA, 1980; 77:6309–6313; doi: 10.1073/pnas.77.11.6309

41.

Ritz

, Martin

, Laederach

. Evolutionary evidence for alternative structure in RNA sequence co-variation. PLoS Comput Biol, 2013; 9(7):e1003152; doi: 10.1371/journal.pcbi.1003152

42.

Sahoo

, Świtnicki

JMP

, Pedersen

. ProbFold: A probabilistic method for integration of probing data in RNA secondary structure prediction. Bioinformatics, 2016; 32(17):2626–2635; doi: 10.1093/bioinformatics/btw175

43.

Sankoff

. Simultaneous solution of the RNA folding, alignment and protosequence problems. SIAM J Appl Math, 1985; 45:810–825; doi: 10.1137/0145048

44.

Scull

, Dandpat

, Romero

, et al. Transcriptional riboswitches integrate timescales for bacterial gene expression control. Front Mol Biosci, 2020; 7:607158; doi: 10.3389/fmolb.2020607158

45.

Seemann

, Gorodkin

, Backofen

. Unifying evolutionary and thermodynamic information for RNA folding of multiple alignments. Nucleic Acids Res, 2008; 36(20):6355–6362; doi: 10.1093/nar/gkn544

46.

Seemann

, Menzel

, Backofen

, et al. The PETfold and PETcofold web servers for intra- and intermolecular structures of multiple RNA sequences. Nucleic Acid Res, 2011; 39:W107–W111; doi: 10.1093/nar/gkr248

47.

Solayman

, Litfin

, Singh

, et al. Probing RNA structures and functions by solvent accessibility: An overview from experimental and computational perspectives. Brief Bioinf, 2022; 23(3):bbac112; doi: 10.1093/bib/bbac112

48.

Spitale

, Incarnato

. Probing the dynamic RNA structurome and its functions. Nat Rev Genet, 2023; 24(3):178–196; doi: 10.1038/s41576-022-00546-w

49.

Stadler

, von Löhneysen

, Mörl

. Limits of experimental evidence in RNA secondary structure prediction. Frontiers Bioinf, 2024; 4:1346779; doi: 10.3389/fbinf.2024.1346779

50.

Strobel

, Yu

, Lucks

. High-throughput determination of RNA structures. Nat Rev Genet, 2018; 19(10):615–634; doi: 10.1038/s41576-018-0034-x

51.

Sükösd

, Knudsen

, Kjems

, et al. PPfold 3.0: Fast RNA secondary structure prediction using phylogeny and auxiliary data. Bioinformatics, 2012; 28(20):2691–2692; doi: 10.1093/bioinformatics/bts488

52.

Sükösd

, Swenson

, Kjems

, et al. Evaluating the accuracy of SHAPE-directed RNA secondary structure predictions. Nucleic Acids Res, 2013; 41(5):2807–2816; doi: 10.1093/nar/gks1283

53.

Sweeney

, Hoksza

, Nawrocki

, et al. R2DT is a framework for predicting and visualising RNA secondary structure using templates. Nat Commun, 2021; 12(1):3494.

54.

Tagashira

, Asai

. ConsAlifold: Considering RNA structural alignments improves prediction accuracy of RNA consensus secondary structures. Bioinformatics, 2022; 38(3):710–719; doi: 10.1093/bioinformatics/btab738

55.

Tsybulskyi

, Meyer

. ShapeSorter: A fully probabilistic method for detecting conserved RNA structure features supported by SHAPE evidence. Nucleic Acids Res, 2022; 50(15):e85; doi: 10.1093/nar/gkac405

56.

Turner

, Mathews

. NNDB: The nearest neighbor parameter database for predicting stability of nucleic acid secondary structure. Nucl Acids Res, 2010; 38:D280–D282; doi: 10.1093/nar/gkp892

57.

von Löhneysen

, Spicher

, Varenyk

, et al. Phylogenetic information as soft constraints in RNA secondary structure prediction. In: ISBRA 2023: Bioinformatics Research and Applications, volume 14248 of Lecture Notes in Computer Science. ( Guo

, Mangul

, Patterson

, et al., eds.) Springer: Singapore; 2023; pp. 267–279. doi:10.1007/978-981-99-7074-2_21

58.

Wan

, Qu

, Zhang

, et al. Landscape and variation of RNA secondary structure across the human transcriptome. Nature, 2014; 505(7485):706–709; doi: 10.1038/nature12946

59.

Washietl

, Hofacker

, Stadler

, et al. RNA folding with soft constraints: Reconciliation of probing data and thermodynamic secondary structure prediction. Nucleic Acids Res, 2012; 40(10):4261–4272; doi: 10.1093/nar/gks009

60.

Will

, Joshi

, Hofacker

, et al. LocARNA-P: Accurate boundary prediction and improved detection of structured RNAs for genome-wide screens. RNA, 2012; 18(5):900–914; doi: 10.1261/rna.029041.111

61.

Will

, Missal

, Hofacker

, et al. Inferring non-coding RNA families and classes by means of genome-scale structure-based clustering. PLoS Comp. Biol, 2007; 3:e65; doi: 10.1371/journal.pcbi.0030065

62.

, Almudevar

, Mathews

. Statistical evaluation of improvement in RNA secondary structure prediction. Nucleic Acids Res, 2012; 40(4):e26; doi: 10.1093/nar/gkr1081

63.

Zarringhalam

, Meyer

, Dotu

, et al. Integrating chemical footprinting data into RNA secondary structure prediction. PLoS One, 2012; 7(10):e45160; doi: 10.1371/journal.pone.0045160

64.

Zuker

, Jaeger

, Turner

. A comparison of optimal and suboptimal RNA secondary structures predicted by free energy minimization with structures determined by phylogenetic comparison. Nucleic Acids Res, 1991; 19(10):2707–2714; doi: 10.1093/nar/19.10.2707

65.

Zuker

, Stiegler

. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res, 1981; 9(1):133–148; doi: 10.1093/nar/9.1.133

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB