HattCI: Fast and Accurate attC site Identification Using Hidden Markov Models

Abstract

Integrons are genetic elements that facilitate the horizontal gene transfer in bacteria and are known to harbor genes associated with antibiotic resistance. The gene mobility in the integrons is governed by the presence of attC sites, which are 55 to 141-nucleotide-long imperfect inverted repeats. Here we present HattCI, a new method for fast and accurate identification of attC sites in large DNA data sets. The method is based on a generalized hidden Markov model that describes each core component of an attC site individually. Using twofold cross-validation experiments on a manually curated reference data set of 231 attC sites from class 1 and 2 integrons, HattCI showed high sensitivities of up to 91.9% while maintaining satisfactory false-positive rates. When applied to a metagenomic data set of 35 microbial communities from different environments, HattCI found a substantially higher number of attC sites in the samples that are known to contain more horizontally transferred elements. HattCI will significantly increase the ability to identify attC sites and thus integron-mediated genes in genomic and metagenomic data. HattCI is implemented in C and is freely available at http://bioinformatics.math.chalmers.se/HattCI.

1. Introduction

Integrons are genomic elements that facilitate the horizontal transfer of genes between bacteria. Each integron carries an array of gene cassettes that can be incised (Hall et al., 1991) and excised (Collis and Hall, 1992) from the genome using site-specific recombination. All integrons share a common structure, which includes an intI gene that encodes for an integrase enzyme responsible for the gene transfer, a common attI recombination site where exogenous cassettes are incorporated, and an array of gene cassettes (Boucher et al., 2007). Integrons located on plasmids and other horizontally transferable elements typically contain up to six gene cassettes, while chromosomally located integrons (superintegrons) may carry up to a couple of hundreds of genes (Cambray et al., 2010). During the gene transfer, the bottom strand of the attC site folds into a secondary structure (Francia et al., 1999; Johansson et al., 2004), which, together with the attI site, is recognized by the integrase. Both sites are then cleaved and recombined to incise the gene cassette into the integron (Hall and Collis, 1995; Mazel 2006). Many gene cassettes are known to regularly carry antibiotic resistance genes (Partridge et al., 2009). In fact, integrons have been suggested to be an important mechanism behind the rapid spread of antibiotic-resistant bacteria (Mazel, 2006; Poirel et al., 2009). Identification of genes mobilized by integrons in bacterial genomes and metagenomes is therefore important to understand bacterial evolution and the promotion of antibiotic resistance (Stalder et al., 2012).

The attC sites are from 55 to 141 nucleotide (nt) long imperfect palindromic sequences that show a low degree of conservation across different gene cassettes. These sites carry two pairs of complementary motifs. The first pair, R″ and R′, is each 7 nt long and located in the 5′ and 3′ end of the attC site, respectively. The motifs of both R″ and R′ are partially conserved between attC sites from different gene cassettes and have the form of RYYYAAC and GTTRRRY, respectively (where R is a purine and Y is a pyrimidine). The second pair, the L″ and L′ motifs, is 8 and 7 nt long and located downstream of R″ and upstream of R′, respectively. The L″ and L′ motifs show a lower degree of conservation than the R″ and R′. The regions between R″ and L″ and between L′ and R′ consist of short spacers of variable length, typically 7–8 nt. These spacers are neither complementary nor do they show any considerable degree of conservation. Finally, the central motif between L″ and L′, often denoted the loop, is of length between 14–102 nt and shows a low degree of conservation, but forms an imperfect inverted repeat (Stokes et al., 1997). The attC sites are vital for mobilization of gene cassettes in and out of integrons (Hall et al., 1991) and can therefore serve as markers for genes that are horizontally transferable.

Several methods have been developed for the identification and analysis of integrons in DNA sequences. The INTEGRALL database contains information about the general structure of integrons, including the integrase gene (intI) and the present gene cassettes (Moura et al., 2009). The Repository for Antibiotic resistance Cassettes (RAC) is specialized in organizing gene cassettes carrying antibiotic resistance genes (Tsafnat et al., 2011) and has implemented an integrated annotation system called Attacca for automatic identification of gene cassettes in DNA sequences. Attacca uses context-driven grammars, where a set of deterministic rules are used to match the characteristic motifs of the start and end of attC sites in the presence of open reading frames and other features of the integron such as the attI site (Tsafnat et al., 2009). Another method is ACID, which identifies the structure of the full integron by recognizing the intI gene and the attC sites of each gene cassette. In this method, intI is identified using BLAST (Altschul et al., 1990), and the attC sites are identified by verifying the presence of their complementary motifs (Joss et al., 2009).

While these methods have provided satisfactory results, they are in essence deterministic and can thus not efficiently handle technical noise and the large biological variability encountered in DNA sequences. Instead, we argue that the identification of attC sites seems more suitably tackled by probabilistic approaches, which can appropriately model different types of variation. Moreover, the existing methods rely on the identification of the attC site, together with other parts of the integron structure, such as the open reading frame in the corresponding gene cassette or the integrase gene (intI), which requires long contiguous DNA stretches for reliable identification. Modern high-throughput DNA sequencing technology is, however, known to produce highly fragmented data with high error rates and where the assembly of longer contiguous sequences is often challenging.

This is particularly true in metagenomics, where entire communities of microorganisms are studied by observing random fragments of their genomes. The assembly of complete integrons from metagenomic data can be especially problematic due to its many repetitive regions, combined with the often low sequencing coverage (Kristiansson et al., 2011; Wu et al., 2012; Bengtsson-Palme and Larsson 2015). Hence, information of the complete integron structure is thus often not available. Also, none of the existing methods is designed to analyze large data sets. In particular, Attacca relies on a manual step for verification of the predictions, which makes it inapplicable to large volumes of genomic and metagenomic data that may contain many thousands of attC sites. Novel methods for identification of attC sites, which are also applicable to fragmented metagenomic sequence data, are therefore required and would enable an accurate analysis of which bacterial genes can be horizontally transferred by integrons, independent of the data context in which it is considered.

In this article, we present HattCI, a new method for fast identification of attC sites in DNA sequence data based on hidden Markov model (HMM). The method divides the motifs of an attC site (R″, R′, L″, L′, spacers, and loop) into separate states and models their specific sequence characteristics, including both their length and the internal dependences of their nucleotide composition. The model parameters are estimated from a reference data set containing 231 nonredundant attC sites from class 1 and 2 integrons, which was created by careful manual assessment of previously reported attC sites. Cross-validation showed that HattCI has sensitivity of up to 91.9% for detecting the attC sites, while maintaining satisfactory false-positive rates (FPR). Moreover, HattCI was applied to a total of 64 gigabases of metagenomic data sequenced from bacterial communities collected in 35 different environments. The relative abundance of attC sites was estimated up to 2.33 occurrences per megabase, where the highest abundances were found in samples that have been previously shown to be associated with high frequencies of horizontal gene transfer elements.

2. Materials and Methods

2.1. Modeling attC sites with generalized HMMs

In this article, we present a generalized hidden Markov model (gHMM) for the identification of attC sites in DNA sequence. A standard discrete-time HMM is a stochastic process formed by two interrelated processes. The first process is hidden from the observer and is a Markov chain that jumps between states in a state space at each time step. The second process is observed and generates one output symbol per step as a function of the underlying hidden state. Through a number of self-transitions into the same state, we can then observe a state length (i.e., a sequence stretch corresponding to a DNA motif) of random length, which, by definition, will follow a geometric distribution, something that is not always a suitable model.

In a gHMM, the length distribution is generalized to any distribution of our choice, and in each time step, the model emits an entire stretch of observed sequence of random length \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$d$$ \end{document} drawn from this distribution. In this study, we assume that the observed process \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$X = \{ {X_1} , \ldots , {X_U} \} $$ \end{document} is a DNA sequence consisting of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$U$$ \end{document} nucleotides and that the hidden process \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$Y = \{ {Y_1} , \ldots , {Y_T} \} $$ \end{document} is the corresponding state sequence, where the states are the corresponding DNA motifs. The hidden process \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$Y$$ \end{document} is a Markov process that jumps between states \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$i$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$j$$ \end{document} in a state space \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$S = \{ {s_1} , \ldots , {s_N} \} $$ \end{document} according to transition probabilities \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${a_{ij}} = P ( {Y_t} = j \vert {Y_{t - 1}} = i )$$ \end{document} , and where the first state is chosen according to an initial distribution \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \pi _i} = P ( {Y_1} = i )$$ \end{document} , \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$i , i \in S$$ \end{document} . The observed process \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$X$$ \end{document} is a function of the hidden process such that if the hidden state \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${Y_t} = j$$ \end{document} has length \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${d_t}$$ \end{document} chosen from a generalized distribution \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${f_j} ( {d_t} )$$ \end{document} , it emits an observed subsequence \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$X_{u + 1}^{u + {d_t}} = \{ {X_{u + 1}} , \ldots , {X_{u + {d_t}}} \} $$ \end{document} according to the emission probability \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${b_j} ( X_{u + 1}^{u + {d_t}} \vert X_1^u ) = P ( {X_{u + 1}} , \ldots , {X_{u + {d_t}}} \vert {X_1} , {X_2} , \ldots , {X_u} , {Y_t} = j )$$ \end{document} , which, for a DNA sequence, describes the nucleotide composition and dependencies. Note that as a consequence of the state length being generalized, the indices of the hidden and the observed processes differ. To keep track of the differing indices, we introduce partial sums \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${p_t} = \sum \nolimits_{k = 1}^t {d_k}$$ \end{document} , where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${p_0} = 0$$ \end{document} , and we assume that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\sum \nolimits_{t = 1}^T {d_t} = U$$ \end{document} . The joint probability of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$X$$ \end{document} , \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$Y$$ \end{document} , and the sequence of state lengths \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$d = {d_1} , \ldots , {d_T}$$ \end{document} is then given as follows (Russell and Moore, 1985; Levinson, 1986): \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} p ( X , Y , d ) = \prod \limits_{t = 1}^T {a_{{Y_{t - 1}} , {Y_t}}}\,{f_{{Y_t}}} ( {d_t} ) {b_{{Y_t}}} ( X_{{p_{t - 1}} + 1}^{{p_t}} \vert X_1^{{p_{t - 1}}} ) . \tag{1} \end{align*} \end{document}

where we let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${a_{{Y_0} , i}} = { \pi _i}$$ \end{document} , \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$i \in S$$ \end{document} .

In our attC model, we assume that the Markov chain \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$Y$$ \end{document} has a state space of eight possible states, where states one to seven describe the successive different motifs of the attC site and the eighth state model stretches that are not part of an attC site (Fig. 1) such that our state space becomes S = {R″, spacer″, L″, loop, L′, spacer′, R′, non-attC}. In this way, the characteristics of each motif of the attC site such as its evolutionary conservation, nucleotide composition, and length distribution can be individually modeled (for a summary of the model, see Supplementary Material—Section A). In the gHMM framework, the prediction of attC sites in a given input DNA sequence \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$X$$ \end{document} translates into determining the underlying hidden state sequence \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$Y$$ \end{document} and the corresponding state lengths that maximize the joint probability \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$P ( X , Y , d )$$ \end{document} (Axelson-Fisk, 2015). In short, the HattCI model describes each motif of an attC site as one individual state. The HattCI states are defined by a length definition and a description of the nucleotide found in the corresponding motif.

FIG. 1.

Overview of the HattCI HMM model. (a) The different motifs of the attC sites are shown. (b) The HattCI model is shown with its eight states and arrows indicating the possible state transitions. The motifs of the attC sites are represented by seven different states, and the remaining non-attC state represents sequences in between the attC sites. Circular states correspond to fixed-length motifs, while rectangular states depict variable-length motifs. HMM, hidden Markov model.

2.1.1. The R″ and R′ states

The R″ and R′ motifs have the form of RYYYAAC and GTTRRRY, where R symbol is purine and Y pyrimidine (Stokes et al., 1997). Thus, their corresponding states have a fixed length of 7 nt (Fig. 1a), with three conserved positions (AAC for R″ and GTT for R′) across all attC sites. Since a substantial number of attC sites does not follow the purine/pyrimidine pattern for either R″ or R′ (Partridge et al., 2009) (see also our reference data set below), we relaxed these restrictions on the remaining four positions by using a pseudocount-induced empirical distribution for all possible 4-tuple combinations.

2.1.2. The spacer states

As motifs, the spacers are nonconserved elements with variable lengths of about 5 nt (Fig. 1a). The length was modeled using an empirical distribution for lengths between 3 and 7 nt for the leftmost spacer (spacer″) and between 3 and 12 nt for the rightmost spacer (spacer′). The nucleotide composition was described by a first-order Markov model \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$P ( {X_u} \vert {X_{u - 1}} )$$ \end{document} .

2.1.3. The L″ and L′ states

The L″ and L′ motifs have fixed lengths of 8 and 7 nt, respectively (Fig. 1a), and exhibit a lower degree of conservation across attC sites in comparison to the R-motifs. One important feature of the L-motifs is the presence of an extrahelical nucleotide in L″ due to the difference in the length of L″ and L′. The bulge formed by this extra nucleotide, which is most often a guanine with variation happening eventually (Larouche and Roy, 2011), has been shown to be crucial for the binding of the integrase (Johansson et al., 2004; MacDonald et al., 2006). The lower conservation leads to many more possible combinations for each motif, in comparison to the R-sites. Thus, the L-motifs were modeled using position-specific scoring matrices (PSSMs) (Staden, 1984). A PSSM represents a motif by assigning a probability for each nucleotide in each motif position. In this way, the model of the L″ is capable of capturing the information about the bulge it forms when folded.

2.1.4. The loop state

The loop of an attC site is a 14–102 nt imperfect palindromic region between the L″ and L′ motifs (Fig. 1a). The loop is long relatively to the other motifs of the attC site and thus providing an important contribution to the total likelihood despite its low conservation. However, the imperfect palindromic nature of the loop makes it less straightforward to model, and therefore, we evaluated Markov models of different orders (first to fourth order) to capture the loop nucleotide dependencies. In addition, the length of the loop varies substantially between attC sites, and we therefore compared three different length models: (1) a geometric distribution (i.e., a standard HMM), (2) a gamma distribution, and (3) an empirical distribution. The gamma and the empirical distributions were both truncated between 6 and 120 nt.

2.1.5. The non-attC state

This state, which models the regions outside the attC site, was defined by a standard HMM (with unit output resulting in geometrically distributed state lengths), with a first-order Markov model as distribution to describe the nucleotide composition and dependencies of the non-attC site regions.

2.2. Predicting attC sites

HattCI predicts attC sites in an observed DNA sequence by using the Viterbi algorithm to predict the optimal underlying state sequence. For each state \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$j \in S$$ \end{document} and position \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$u$$ \end{document} in the observed sequence, the recursive relationship of the algorithm is given by the Viterbi variables \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} {V_j} ( u ) = \mathop { \max } \limits_{i , d} {a_{ij}}{f_j} ( d ) {b_j} ( X_{u - d + 1}^u \vert X_1^{u - d} ) {V_i} ( u - d ) \tag{2} \end{align*} \end{document}

and with pointers \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \Psi _j} ( u ) = {{ \mathop{ \rm argmax} \nolimits} _i}{V_j} ( u )$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \Phi _j} ( u ) = \arg { \max\nolimits_d}{V_j} ( u )$$ \end{document} , keeping track of the previous state \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$i$$ \end{document} and current state length \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$d$$ \end{document} that maximized the current Viterbi variable, respectively. The predicted sequence is then retrieved by following the pointers back through the sequence, starting in state \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${i^ * } = {{ \mathop{ \rm argmax} \nolimits} _i}{V_j} ( U + 1 )$$ \end{document} , which has length \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \Phi _{{i^ * }}} ( U + 1 )$$ \end{document} , and which is preceded by state \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \Psi _{{i^ * }}} ( U + 1 )$$ \end{document} , and so on.

In practice, at each state index \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$j$$ \end{document} and sequence position \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$u$$ \end{document} , the model “votes” in favor for the attC site if the probability is greater than that of the non-attC state (for the DNA sequence up to that state and base). Thus, the Viterbi variables (Eq. 2) can be seen as a likelihood ratio, or call it Viterbi score, with cutoff one. By replacing this fixed cutoff by a variable significance cutoff \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$k$$ \end{document} , we can control how strict our model is in assigning a sequence as attC site. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} { \rm { Viterbi } } \ { \rm { score } } = { \frac { P ( { Y_t } \vert X_ { { p_t } - 1 } ^ { { p_t } } = attC \ site ) } { P ( { Y_t } \vert X_ { { p_t } - 1 } ^ { { p_t } } = non - attC \ site ) } } > k \tag { 3 } \end{align*} \end{document}

In this study, we require all the motifs of the attC site to be present in the analyzed sequence for the identification, but this could easily be relieved by allowing the model to start in any state (instead of just the non-attC site). However, since the individual motifs of the attC site are rather short, we expect that this would substantially reduce the performance of the model.

2.3. Parameter estimation

Transition probabilities and nucleotide composition distributions were estimated using maximum likelihood, which, within the gHMM framework, corresponds to the relative frequencies of the corresponding events in the training set (Axelson-Fisk, 2015). Pseudocounts were added to the estimation of the parameters to allow for rare events not present in the training data. The optimal number of pseudocounts was investigated using cross-validation (Section 2.5). The gamma distribution for the loop length was fitted to the observed lengths using numerical maximum likelihood. The empirical distribution was estimated from the observed loop lengths and smoothed using the Gaussian kernel interpolation, applying a bandwidth of 5.358 (van der Vaart, 2000).

2.4. Reference data set

The HattCI model was trained and evaluated on a reference data set consisting of 231 attC sites. This data set was based on the gene cassettes previously described by Partridge et al. (2009) and contains mainly class 1 and 2 integrons. Using their accession numbers, the entire nucleotide sequences containing the reported attC sites were downloaded from GenBank (Benson et al., 2009) in September 2015. The coordinates for all attC sites in Partridge et al. (2009), together with the attC sites of other gene cassettes present in the sequences, were extracted, those that were redundant were discarded, and the remaining nonredundant data set was annotated as follows. The R″/R′ motifs were identified at the edges of each attC site. Then, L″ and L′ were annotated by searching for complementary pairs and a bulge within the range of possible distances from R″ and R′, respectively. The reference data set and its full annotation are available in Supplementary Material—Section B.

2.5. Model evaluation

The performance of our model was evaluated using the model sensitivity (Sn), which measures the capability to detect the attC sites in the reference data set. The sensitivity is defined as follows (Burset and Guigo, 1996): \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} Sn = { \frac { TP } { TP + FN } } , \tag { 4 } \end{align*} \end{document}

where true positives (TP) are the number of correctly predicted attC sites in our reference data set and false negatives (FN) are the number of missed attC sites. We also evaluated the model based on the FPR, which is the number of false predictions made in sequences where no attC sites are present. The FPR is defined as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} FPR = { \frac { FP } { DS } } , \tag { 5 } \end{align*} \end{document}

where false positives (FP) are the number of falsely predicted attC sites and DS (database) is the size (in megabases) of the sequence database searched. This database was created by random reshuffling of the Escherichia coli K-12 genome (GenBank accession number NC_000913.3). The model performance (Sn versus FPR) can be adapted by changing the significance cutoff \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$k$$ \end{document} (Eq. 3), since by increasing the cutoff, the sensitivity is reduced, but FPR improves, and vice versa.

Twofold cross-validation experiments were performed by randomly splitting the reference data set into two equally sized subsets and using one for parameter training and the other for validation. This procedure was repeated 100 times. The significance cutoff \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$k$$ \end{document} was varied between 0 and 30.0 at a step size of 0.5. For each cutoff, Sn and FPR were estimated. The results were analyzed using receiver operating characteristic curves calculated by setting Sn on the \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$y$$ \end{document} -axis and FPR on the \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$x$$ \end{document} -axis and varying the cutoff \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$k$$ \end{document} .

2.6. Metagenomic data

Quality assessed metagenomic sequence data were downloaded from the CAMERA database (Seshadri et al., 2007). Only samples with an average fragment length of 150 base pairs and at least 40,000 fragments were included in the study, resulting in 35 data sets comprising 64 gigabases in total (see Supplementary Material—Section C for a list of included data sets). HattCI was used to search for attC sites in both strands—in all, 168,749,505 fragments in the data sets. A significance cutoff of 7.5 was used. For each data set, the relative abundance of attC sites was calculated by comparing the number of identified sites to the total number of nucleotides.

3. Results

3.1. Model evaluation

The HattCI model was evaluated based on the sensitivity and FPR estimated through cross-validation experiments performed on our reference data set containing 231 manually annotated attC sites (see Materials and Methods). The performance of the model was investigated in relation to (a) the number of pseudocounts added to the parameters of the nucleotide composition of R″ and R′, (b) the number of pseudocounts added to the parameters of the nucleotide composition of the L″ and L′, (c) the order of the Markov model describing the nucleotide dependencies in the loop, and (d) the distribution used to model the length of the loop. First, the amount of pseudocounts for the nucleotide composition of the R″ and R′ was varied between 1 and 20, while keeping the number fixed to one for the other motifs. Increasing the number of pseudocounts for R″ and R′ resulted in a decreased model performance (Fig. 2a). At a fixed FPR of 1 hit/Mbase, the sensitivity was 0.829 and 0.793 for 1 and 5 pseudocounts, respectively. The sensitivity decreased further when more pseudocounts were added, resulting in a sensitivity of 0.765 for 10 pseudocounts and 0.742 for 20 pseudocounts. This decrease in performance is likely due to the high conservation of the motifs, and we therefore used pseudocount 1 in the final R-motif models.

FIG. 2.

Receiver operating characteristic (ROC) curves describing the sensitivity and the false-positive rate (FPR) for the model under different settings. (a) The number of pseudocounts for the nucleotide composition of R″ and R′ was varied between 1 and 20. (b) The number of pseudocounts for the nucleotide composition of L″ and L′ was varied between 1 and 20. (c) The order of the Markov model describing the nucleotide dependencies in the loop was varied between a first- and a fourth-order model. (d) Three different distributions were examined for the loop length as follows: (1) a smoothed empirical distribution, (2) a gamma distribution, and (3) a geometric distribution (standard HMM).

Similarly, we investigated the addition of 1 to 20 pseudocounts to the nucleotide composition parameters of L″ and L′, while the pseudocounts used in the other motifs were set to 1. In contrast to R″ and R′, adding pseudocounts to the L″ and L′ distributions resulted in an increased performance (Fig. 2b). For an FPR of 1 hit/Mbase, the sensitivity was 0.829, 0.835, 0.836, and 0.839 for 1, 5, 10, and 20 pseudocounts, respectively. Since the sensitivity at 20 pseudocounts was substantially reduced at both lower and higher FPR compared to the other options, we decided to use 10 pseudocounts for the L-motifs. The increase in performance when adding pseudocounts reflects the more variable nature of the L-motifs and prevents overfitting.

Next, the model for the loop state was evaluated. First, the order of the Markov model for the nucleotide dependencies was varied between first and fourth order (the pseudocounts for the nucleotide composition was fixed at 1 for all motifs). The cross-validation showed that the model with second-order dependencies resulted in the highest sensitivity (Fig. 2c), 0.815, at an FPR of 1 hit/Mbase. The corresponding performance for first-, third-, and fourth-order Markov models was 0.779, 0.796, and 0.760, respectively. We then investigated the impact of the length distribution on the model performance, in which the standard geometric distribution was compared to the more flexible gamma and the smoothed empirical distributions (see Materials and Methods). At an FPR of 1 hit/Mbase, the geometric distribution resulted in a sensitivity of 0.815, while the gamma and the smoothed empirical distributions both had higher sensitivities of 0.826 and 0.829, respectively (Fig. 2d). To maximize the sensitivity, we decided to model the loop sequence using a second-order Markov model and a smoothed empirical length distribution.

3.2. Final model performance

The final HattCI model was set to use pseudocount 1 for the R-motifs and the loop and pseudocounts 10 for the L-motifs and the spacers, respectively (see Supplementary Material—Section D for details on the pseudocounts of the spacers and the loop). For the loop, the nucleotide dependencies were modeled using a second-order Markov model and the length described by a smoothed empirical distribution. Cross-validation showed that the final model had a sensitivity of 0.853, at a fixed FPR of 1 hit/Mbase (Fig. 3a). The corresponding Viterbi scores discriminated between the true and the FP well (Fig. 3b). By changing the significance cutoff \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$k$$ \end{document} , the sensitivity and FPR can be adapted (Table 1). At the least stringent cutoff of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$k = 0$$ \end{document} , the sensitivity was estimated at 0.919 with an FPR of 26.4 hits/Mbase. The stricter cutoff of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$k = 7.5$$ \end{document} resulted in substantial decrease of the FPR to only 0.011 hit/Mbase while maintaining an acceptable sensitivity of 0.665.

FIG. 3.

Cross-validation of the performance of the final model. (a) ROC curve for the final model based on 100 twofold cross-validation iterations. Sensitivity in the reference data set ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$y$$ \end{document} -axis) is plotted against the FPR estimated based on the reshuffled Escherichia coli K-12 genome ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$x$$ \end{document} -axis), for a varying significance cutoff \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$k$$ \end{document} . (b) A histogram of the Viterbi scores for the true positives (TP) and false positives (FP). The Viterbi scores for the TP (white) are substantially higher and relatively well separated from the FP (gray). The vertical line indicates the significance cutoff of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$k = 7.5$$ \end{document} .

Table 1.

Summary of the Performance for Different Significance Cutoffs \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$k$$ \end{document}

\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$k$$ \end{document}	N	TP	FN (partial)	FN (total)	Sn (%)	FPR (hits/Mbase)
0 0	11,430	10,501	411	518	91.9	26.4
3.0	11,430	9897	228	1305	86.6	1.54
7.5	11,430	7601	76	3753	66.5	0.011
10.0	11,430	5855	0	5575	51.2	0.000

N indicates how many attC sites in the reference data set that were used for testing. TP gives the number of sites whose HattCI prediction agreed perfectly with the GenBank annotation. FN (partial) corresponds to the attC sites whose prediction coordinates just covered parts of the real attC site and were thus counted as FN. FN (total) is the number of attC sites that were not covered at all by any of the predictions, and FPR is the number of attC sites found per megabase in the reshuffled Escherichia coli genome.

FN, false negatives; FPR, false-positive rate; TP, true positives.

3.3. Metagenomic data analysis

HattCI does not require information of the overall integron structure for the prediction of attC sites and can therefore be applied directly to fragmented sequence data. In this study, we used HattCI to estimate the abundance of integron-mediated genes in 35 metagenomes representing different types of bacterial communities (Seshadri et al., 2007). The metagenomes consisted of 168,749,505 sequence fragments comprising 64 gigabases in total. A strict significance cutoff of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$k = 7.5$$ \end{document} was used to minimize the number of FP while maintaining a satisfactory sensitivity. Overall, the number of attC sites per megabase was estimated at 0.291, ranging from 0 to 2.33 over the different data sets (Fig. 4). The highest levels of attC sites were found in the metagenomes sampled from sludges from a waste water treatment plant and from farm soil. These environments have previously been associated with a high number of bacteria carrying integron-mediated antibiotic resistance genes (Srinivasan et al., 2008; Gillings et al., 2015; Munck et al., 2015). There were also high levels of attC sites in the metagenome from the hydrothermal vent, which has previously been associated with a high abundance of bacteria carrying genetic elements for horizontal gene transfer (Xie et al., 2011). The lengths of the attC sites identified in the metagenomes followed approximately the distribution of the training data set (Supplementary Material—Section E). We noted, however, an underrepresentation of shorter attC sites (55–80 bases) and an overrepresentation of longer attC sites (120–140 bases).

FIG. 4.

The number of attC sites found per megabase in metagenomic data. The number of attC sites found per megabase (y-axis) is shown for 35 metagenomic samples of the CAMERA project (Seshadri et al., 2007).

3.4. Availability and time performance

The HattCI model was implemented in C as a command line tool for the Unix/Linux computer environment. The model was parallelized by distributing the input data over different threads using OpenMP. Using one thread, HattCI can process ∼5 gigabases per hour (Supplementary Material—Section F). The analysis can be further accelerated by increasing the number of parallel processes used. The program, including documentation, tutorials, and source code, is freely available under the GPL v3 license at http://bioinformatics.math.chalmers.se/HattCI/.

4. Discussion

Integrons are genetic elements that facilitate the horizontal transfer of genes between bacterial genomes. The attC sites, which are essential for the incision and excision of gene cassettes in and out of integrons, are 55 to 141 nt long indistinct patterns. In this article, we present HattCI, a novel method for the fast and accurate prediction of attC sites in the DNA sequence. HattCI is based on an eight-state gHMM, with seven states representing the different motifs of attC sites and one state representing non-attC site regions. The performance of HattCI was evaluated by twofold cross-validation using a manually assessed reference data set. At the most sensitive setting, HattCI was able to identify 91.9% of all attC sites in this data set. HattCI was also used to estimate the abundance of attC sites in 35 metagenomes. The highest abundances were found in metagenomes from microbial communities, previously associated with a high level of integron-mediated genes.

The HattCI method describes the attC site by modeling each of its seven distinct motifs as separate states. Two states with a fixed length of 7 nt were used to describe R″ and R′, which are the most conserved motifs of the attC site. They have three out of their seven positions perfectly conserved in all attC sites, making it possible to parameterize the probability of all possible combinations of the remaining four nucleotides. Similarly, two fixed-length states of 8 and 7 nt, respectively, were used to represent L″ and L′. In contrast to the R-motifs, the L-motifs show much less conservation between different attC sites, which made it impractical to parametrize all possible variants. Instead, the nucleotide composition of the L″/L′ states was modeled using PSSMs describing the nucleotide distribution in each position of the motif. Furthermore, the number of pseudocounts added to the nucleotide counts in each state was evaluated using cross-validation. For the R-motifs, a pseudocount of 1 had the highest sensitivity, while the corresponding number for the L-motifs was 10, which reflects the different levels of conservation of these motifs. The R-motifs are well conserved thereby likely to be well represented by the training data sets. Thus, adding too many pseudocounts will blur the true distribution, and the accuracy of the model will deteriorate. The L-motifs are, on the contrary, more variable, and a larger number of pseudocounts were needed to reach the optimal sensitivity, likely because the training data set did not sufficiently cover all variants.

The modeling of the loop required attention due to its low conservation and variability in length. A number of different model assumptions for the loop nucleotide dependencies and the length distribution were therefore evaluated. The nucleotide dependencies were modeled by Markov models of different orders, from first to fourth order. Our results showed that the second-order Markov model had a substantially better performance than the other models. This indicates that the second-order Markov model could capture a considerable part of the local dependencies, while still keeping the number of parameters limited. It should be noted, however, that a Markov model of higher order could be more suitable when more comprehensive training data become available. For the length distributions, both the gamma and the empirical distributions performed considerably better than the geometric distribution that would result from using a standard HMM. The empirical distribution demonstrated the overall best performance, but the difference from using the gamma distribution was small. Since the gamma distribution has a substantial lower number of parameters, it is therefore a good alternative when the training data are more limited.

The final HattCI model demonstrated a good performance in the cross-validation experiments. At the most sensitive setting, HattCI could correctly identify 91.9% of the attC sites in the reference data set, while the number of FP was 26.4 hits per megabase. This makes HattCI highly suitable to pinpoint attC sites and annotate their individual motifs in smaller sequence data sets such as amplicons or plasmid-mediated integrons, which typically are in the order of 10,000 bases. In fact, in contrast to other methods, HattCI provides detailed information of the attC sites, including position of all its separate motifs. When HattCI was run at a higher stringency (with a higher significance cutoff), the FPR was substantially reduced while maintaining a satisfactory sensitivity. At a cutoff of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$k = 3$$ \end{document} , the cross-validation showed a sensitivity of 86.6% and an FPR of 1.54 per megabase. When the significant cutoff was further increased to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$k = 7.5$$ \end{document} , the FPR dropped to only 10 FP per gigabase, while the sensitivity was still an acceptable 66.5%. At this setting, HattCI is suitable for more exploratory analyses such as screening for attC sites in metagenomes or larger sets of bacterial genomes.

It should be emphasized that the performance estimated by the cross-validation is likely to be higher than what is to be expected in real genomic and metagenomic data. This is due to a potential bias in our reference data set caused by the overrepresentation of certain types of attC sites in the scientific literature. For instance, integrons carrying an integrase of class 1 are by far the most studied, while integrons with other classes of integrases (e.g., 2 and 3) have not yet received as much attention (Moura et al., 2009). Even though class 1 integrons can excise and incise many types of gene cassettes (Collis et al., 2001), it has a higher affinity for certain types of attC sites (Larouche and Roy, 2011). Thus, the model trained on the reference data set will therefore have the highest performance for identifying attC sites commonly located in class 1 integrons. The performance for identification of an attC site with a different nucleotide composition and/or loop length distribution, for example, attC located in superintegrons, may therefore be reduced. Note, however, that the model has been designed to be highly flexible, and when training data that more accurately reflect the full diversity of integrons, the performance for detecting atypical attC is likely to improve.

HattCI was used to estimate the abundance of attC sites in 35 metagenomes comprising over 168 million DNA fragments. Our results demonstrated a high relative abundance of attC sites in microbial communities sampled from sludges, farm soil, and hydrothermal vents. These environments have previously been associated with a high abundance of integron-mediated genes and other mechanisms for horizontal transfer of genetic materials (Srinivasan et al., 2008; Xie et al., 2011; Gillings et al., 2015; Munck et al., 2015). The metagenomic analysis shows that HattCI can identify attC sites directly in fragmented data without any additional information about the integron structure. Furthermore, several of the metagenomes known to be more pristine demonstrated a very low attC site abundance, in some cases very close to or even below the 0.011 FPR estimated in the reshuffled E. coli genome, which demonstrates that HattCI can maintain a low FPR when applied to real sequence data. Furthermore, the analysis of horizontally transferred elements in metagenomic data is challenging due to its fragmented nature. The assembly is often hard, especially in regions containing repetitive sequences such as present in integrons (Wu et al., 2012; Bengtsson-Palme and Larsson, 2015). HattCI eliminates the need for a high-quality assembly and can therefore be used directly on the fragmented data, to accurately identify and quantify the presence of integron-mediated mobile genes in bacterial communities.

HattCI is based on a gHMM, which is a flexible framework commonly used to efficiently identify patterns in biological sequences (Koski, 2001). The use of a probabilistic approach allows HattCI to explicitly model the variability of attC sites in an efficient way, enabling the identification of other types than the ones present in the reference data set. In addition, probabilistic models provide a likelihood score (the Viterbi score) of a potential attC site to belong to the model, which allows for fine tuning of the performance by adjusting the stringency of the significance cutoff between true attC sites and spurious ones. This probabilistic approach provides such good power in identifying attC sites that extra context of the surrounding integron parts will, in many cases, not be necessary. The gHMM framework has also computationally favorable properties. Indeed, our C implementation of HattCI is highly efficient and can, by using a single core on a modern computer, process over 5 gigabases of DNA sequences per hour. Moreover, HattCI is parallelized such that the input data can be distributed among multiple threads, which increases the performance further. This contrasts with the existing softwares for identification of gene cassettes, which are developed for small data sets and, in some cases, are also dependent on manual steps (Tsafnat et al., 2009).

5. Conclusion

HattCI is a new, versatile, and computationally efficient tool for accurate identification of attC sites in DNA sequence data. HattCI can be used as a stand-alone tool and can be applied directly to large and fragmented data sets such as metagenomes. It would also be possible to integrate HattCI into existing bioinformatic pipelines and thereby increase the overall performance for detecting and annotating complete integron structures, including all its gene cassettes.

Footnotes

Acknowledgments

This research was funded by the Swedish Research Council (VR), the Chalmers Life Science Area of Advance, and the Wallenberg Foundation.

Author Disclosure Statement

No competing financial interests exist.

References

Altschul

, Gish

, Miller

, et al. 1990. Basic local alignment search tool. J. Mol. Biol., 215, 403–410.

Axelson-Fisk

2015. Comparative Gene Finding: Models, Algorithms and Implementation, 2nd edition. Springer Verlag, London. ISBN 9781447166931.

Bengtsson-Palme

, and Larsson

2015. Antibiotic resistance genes in the environment: prioritizing risks. Nat. Rev. Microbiol. 13, 396.

Benson

, Karsch-Mizrachi

, Lipman

, et al. 2009. GenBank. Nucleic Acids Res. 37, D26–D31.

Boucher

, Labbate

, Koenig

, et al. 2007. Integrons: mobilizable platforms that promote genetic diversity in bacteria. Trends Microbiol. 15, 301–309.

Burset

, and Guigo

1996. Evaluation of gene structure prediction programs. Genomics, 34, 353–367.

Cambray

, Guerout

A.-M.

, and Mazel

2010. Integrons. Annu. Rev. Genet., 44, 141–166.

Collis

, and Hall

1992. Gene cassettes from the insert region of integrons are excised as covalently closed circles. Mol. Microbiol. 6, 2875–2885.

Collis

, Recchia

, Kim

, et al. 2001. Efficiency of recombination reactions catalyzed by class 1 integron integrase intI1. J. Bacteriol., 183, 2535–2542.

10.

Francia

, Zabala

, de la Cruz

, et al. 1999. The intI1 integron integrase preferentially binds single-stranded DNA of the attC site. J. Bacteriol., 181, 6844–6849.

11.

Gillings

, Gaze

, Pruden

, et al. 2015. Using the class 1 integron-integrase gene as a proxy for anthropogenic pollution. ISME J. 9, 1269–1279.

12.

Hall

, and Collis

1995. Mobile gene cassettes and integrons: capture and spread of genes by site-specific recombination. Mol. Microbiol., 15, 593–600.

13.

Hall

, Brookes

, and Stokes

1991. Site-specific insertion of genes into integrons: role of the 59-base element and determination of the recombination cross-over point. Mol. Microbiol., 5, 1941–1959.

14.

Johansson

, Kamali-Moghaddam

, and Sundström

2004. Integron integrase binds to bulged hairpin DNA. Nucleic Acids Res. 32, 4033–4043.

15.

Joss

M.J.

, Koenig

J.E.

, Labbate

, et al. 2009. ACID: annotation of cassette and integron data. BMC Bioinformatics. 10, 118–126.

16.

Koski

2001. Hidden Markov Models for Bioinformatics. Springer Verlag, New York, 2001. ISBN 9781402001352.

17.

Kristiansson

, Fick

, Janzon

, et al. 2011. Pyrosequencing of antibiotic-contaminated river sediments reveals high levels of resistance and gene transfer elements. PLoS One, 6, e17038.

18.

Larouche

, and Roy

2011. Effect of attC structure on cassette excision by integron integrases. Mob. DNA, 2, 3–15.

19.

Levinson

1986. Continuously variable duration hidden markov models for automatic speech recognition. Comput. Speech Lang., 1, 29–45.

20.

MacDonald

, Demarre

, Bouvier

, et al. 2006. Structural basis for broad DNA-specificity in integron recombination. Nature, 440, 1157–1162.

21.

Mazel

2006. Integrons: agents of bacterial evolution. Nat. Rev. Microbiol., 4, 608–620.

22.

Moura

, Soares

, Pereira

, et al. 2009. INTEGRALL: a database and search engine for integrons, integrases and gene cassettes. Bioinformatics, 25, 1096–1098.

23.

Munck

, Albertsen

, Telke

, et al. 2015. Limited dissemination of the wastewater treatment plant core resistome. Nat. Commun. 6, 8452.

24.

Partridge

, Tsafnat

, Coiera

, et al. 2009. Gene cassettes and cassette arrays in mobile resistance integrons. FEMS Microbiol. Rev., 33, 757–784.

25.

Poirel

, Carrër

, Pitout

, et al. 2009. Integron mobilization unit as a source of mobility of antibiotic resistance genes. Antimicrob. Agents Chemother. 53, 2492–2498.

26.

Russell

, and Moore

1985. Explicit modelling of state occupancy in hidden markov models for automatic speech recognition. Proc. IEEE Int. Conf. ASSP, 10, 5–8.

27.

Seshadri

, Kravitz

, Smarr

, et al. 2007. Camera: a community resource for metagenomics. PLoS Biol. 5, e75.

28.

Srinivasan

, Nam

, Sawant

, et al. 2008. Distribution of tetracycline and streptomycin resistance genes and class 1 integrons in enterobacteriaceae isolated from dairy and nondairy farm soils. Microb. Ecol. 55, 184–193.

29.

Staden

1984. Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res. 12, 505–519.

30.

Stalder

, Barraud

, Casellas

, et al. 2012. Integron involvement in environmental spread of antibiotic resistance. Front. Microbiol. 3, 119.

31.

Stokes

, O'Gorman

, Recchia

, et al. 1997. Structure and function of 59-base element recombination sites associated with mobile gene cassettes. Mol. Microbiol., 26, 731–745.

32.

Tsafnat

, Coiera

, Partridge

S.R.

, et al. 2009. Context-driven discovery of gene cassettes in mobile integrons using a computational grammar. BMC Bioinformatics, 10, 281.

33.

Tsafnat

, Copty

, and Partridge

2011. RAC: Repository of antibiotic resistance cassettes. Database, 2011, bar054.

34.

van der Vaart

2000. Asymptotic Statistics. Part of Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press, Cambridge. ISBN 9780521784504.

35.

, Rho

, Doak

, et al. 2012. Oral spirochetes implicated in dental diseases are widespread in normal human subjects and carry extremely diverse integron gene cassettes. Appl. Environ. Microbiol. 78, 5288–5296.

36.

Xie

, Wang

, Guo

, et al. 2011. Comparative metagenomics of microbial communities inhabiting deep-sea hydrothermal vent chimneys with contrasting chemistries. ISME J. 5, 414–426.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.27 MB