A New Protein Structure Representation for Efficient Protein Function Prediction

Abstract

One of the challenging problems in bioinformatics is the prediction of protein function. Protein function is the main key that can be used to classify different proteins. Protein function can be inferred experimentally with very small throughput or computationally with very high throughput. Computational methods are sequence based or structure based. Structure-based methods produce more accurate protein function prediction. In this article, we propose a new protein structure representation for efficient protein function prediction. The representation is based on three-dimensional patterns of protein residues. In the analysis, we used protein function based on enzyme activity through six mechanistically diverse enzyme superfamilies: amidohydrolase, crotonase, haloacid dehalogenase, isoprenoid synthase type I, and vicinal oxygen chelate. We applied three different classification methods, naïve Bayes, k-nearest neighbors, and random forest, to predict the enzyme superfamily of a given protein. The prediction accuracy using the proposed representation outperforms a recently introduced representation method that is based only on the distance patterns. The results show that the proposed representation achieved prediction accuracy up to 98%, with improvement of about 10% on average.

1. Introduction

A huge number of protein sequences are available from many genome projects. Genome projects were developed to receive and maintain protein data through database and information retrieval and analysis systems (Sayers et al., 2012; Vuong et al., 2012). Most of the discovered proteins are with no knowledge about their function. Therefore, the need for effective and accurate protein function prediction methods is emerging. The main idea behind protein function prediction is to derive the function of unknown protein by finding its similarity with proteins with known function. The protein function annotation is the main goal of protein function prediction methods. Methods of function prediction can be sequence based or structure based. There are different classifications to characterize the proteins function. Some annotation methods use a numerical functional classification such as enzyme commission (EC) numbers which based on protein enzymatic activity (Almonacid et al., 2010). Others provide more roles through their classification catalog (Mewes et al., 2002; Ruepp et al., 2004). Another classification method for annotating protein functions is the gene ontology (Ashburner et al., 2000), which characterizes protein functions into three categories: biological process, molecular function, and cellular component. In this article, we propose a new protein representation method that utilizes atom coordinates of protein residues and includes angles and distance patterns in representation. The proposed method uses only the protein structure without need of any sequence information. Besides, it does not need any prior alignment process. We used protein function prediction that is based on enzyme classification. The article is organized as follows: Related work is reviewed in section 2. Section 3 describes the representation method. Experimental results are discussed in section 4. Finally, the conclusion is stated in section 5.

2. Related Work

Structure-based protein function prediction depends on finding a relation between protein structures and their function. This may be based on analyzing the similarities between the structures of proteins that belong to the same functional class (Bray et al., 2009; Boaretoa et al., 2012) in order to predict the functions of functionally unannotated proteins. Protein function prediction methods include statistical methods (Ewens and Grant, 2005), data mining approaches (Han and Kamber, 2008), and machine learning techniques (Larranaga et al., 2006; Cheng et al., 2008). The similarity comparison level of protein can be global, in which the whole structure is considered, or local, which considers only substructures and is termed as motif finding. Besides, the comparison may require alignment technique (Csaba et al., 2008; Ritchie et al., 2012) or not (Pires et al., 2011). Protein function prediction accuracy and efficiency are based on protein representation and prediction methods. The protein structure could be represented according to atom coordinates as done by Xie and Bourne (2007) and Pires et al. (2011), who used C_α atom to represent protein structures, topological structure that considers relationships between secondary structure elements (Gilbert et al., 1999, 2001; Veeramalai et al., 2008), or surface shape that considers overall surface features (Sael et al., 2008a,b; La et al., 2009; Kihara et al., 2011). Many methods have been proposed for structure-based protein function prediction. These include function prediction motif finding (Jia et al., 2009; Ku and Hu, 2012; Wang et al., 2013), binding sites prediction (Xie and Bourne, 2007; Zhao et al., 2010, 2011; Alvarez and Yan, 2012; Nisius et al., 2012), or enzyme classification (Dobson and Doig, 2005; Erdin et al., 2010, 2013; Wang et al., 2011; Alvarez and Yan, 2012; Rahimi et al., 2013).

Finding motifs that appear in proteins belonging to the same function could be important for function analysis and are known as functional sites. Jia et al. (2009) represented motifs as subgraphs and proposed APproximate Graph Mining (APGM) algorithm to identify con-served structures by finding the frequent subgraphs with approximate matching in a set of graph-represented proteins with an accuracy of 78%. Ku and Hu (2012) proposed a two-stage framework for structural motif discovery. It starts with converting protein three-dimensional (3D) coordinates into structural alphabet sequences, and then finds motifs using sequence motif-finding tool. Wang et al. (2013) used local 3D structural motifs by implementing the structurally aligned local sites of activity (SALSA) method for predicting protein biochemical function. For predicting binding sites where some proteins function by binding to other proteins through them, Nisius et al. (2012) reviewed binding sites analysis methods through their scopes, limitations, and validation. Xie and Bourne (2007) proposed a new shape descriptor to represent protein structure based on its global shape and the surrounding environment of each residue using C_α atoms. For binding site prediction, they accurately identified 85% of known binding sites. Alvarez and Yan (2012) introduced a graphical method for protein structure representation based on the spatial clustering of amino acids and evolutionary profiles of the proteins. Using Support vector machine (SVM), the proposed method achieved an accuracy of 92% for the prediction of DNA-binding proteins. Zhao et al. (2010) achieved an accuracy of 98% for predicting DNA-binding proteins utilizing knowledge-based energy function and atom-type-dependent features.

Zhao et al. (2011) proposed a structure-based method to predict zinc-binding sites that are applied on 1888 protein structures with unknown function. For enzyme classification, Dobson and Doig (2005) used a set of global attributes calculated from proteins structure. They achieved 35% of accuracy using support vector machines. Alvarez and Yan (2012) achieved for the prediction of enzymes an accuracy of 87% using their proposed graphical method that is based on the spatial clustering of amino acids and evolutionary profiles of the proteins. Using 3D templates (structure–function motifs or 3D function-associated motifs) and evolutionary information, Erdin et al. (2010) classified enzymes and nonenzymes using gene ontology terms (Ashburner et al., 2000) as functional classifications and achieved an accuracy of 94%. By letting each protein contribute multiple templates instead of one, Erdin et al. (2013) maintained the accuracy over 91%. Rahimi et al. (2013) used 3D motifs to predict the full EC number of enzymes. The representative 3D motif for each EC number is determined by searching for an active site whose spatial arrangement has the minimum average distance to other active sites belonging to the same class. Wang et al. (2011) proposed a method based on support vector machine to predict enzyme function. For the prediction of first three EC digits, an accuracy ranging from 81% to 98% is achieved.

3. Methods

3.1. Cutoff scanning matrix

Pires et al. (2011) proposed a cutoff scanning matrix (CSM) method to represent a set of proteins using distance patterns between their residues. As shown in Figure 1, each row in the matrix refers to a feature vector of a protein, which represents the frequency distribution of pairs within a range of distances (starting from d_min to d_max with the distance step d_step) and a defined maximum distance threshold (cutoff). They used a cutoff of 30 Å for all proteins.

FIG. 1.

Cutoff scanning matrix (CSM) for a set of proteins.

The CSM method fails to capture the 3D structure of the protein. The CSM encodes only the distance patterns between the residues in the protein, while the angles between the residues affect the protein structure dramatically. Therefore, we propose a new representation that builds on the same concept of the scanning matrix method and captures the 3D structure of the protein. The following section presents such a representation in detail.

3.2. Protein structure matrix

We propose a protein structure matrix (PSM) representation that captures the 3D structure of the protein. The PSM encodes both the distance and angle patterns between pairs of protein residues. The distance and the angle between a residue pair are computed as done by Gao and Zaki (2008), as shown in Figure 2.

FIG. 2.

The distance and angle between two residues (Gao and Zaki, 2008).

The PSM is generated as follows. We first compute the Euclidean distance D(C_α) and angles θ (C_α) between all residue pairs for each protein. Then, the feature vector of a protein is made of the frequency distribution of the distances between pairs F[D(C_α)], the average angle A[ θ (C_α)], and standard deviation S[ θ (C_α)] of the angles between residue pairs in each bin of the distribution. The number of features (number of columns) is made constant for each protein; that is, we have a variable distance step size for each protein. Figure 3 depicts the PSM representation.

FIG. 3.

Protein structure matrix (PSM) for a set of proteins.

The distance between residue pairs is the Euclidean distance between their C_α atoms. The angle between residue pairs is computed as done by Gao and Zaki (2008). They utilized the coordinates of the N, C_α, and C atoms of each residue. Each residue is represented by a plane containing its N–C_α–C triangle. The angle between two residue pairs is defined as the angle between the planes normal, as shown in Figure 2.

The normal to the N–C_α–C plane is defined by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} \begin{align*} \overleftarrow { n } = \frac {\overleftarrow{NC_\propto} \times {\overleftarrow{C_\propto C}}} {{\parallel} {\overleftarrow{NC_\propto}} \times {\overleftarrow{C_\propto C} \parallel}} \tag { 1 } \end{align*} \end{document}

Hence, the angle between the two normal n₁ and n₂ is calculated as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} \begin{align*} \cos \theta = \frac {{\parallel} {\overleftarrow{n_1} \parallel^2 + \parallel} {\overleftarrow{n_2} \parallel^2 - \parallel} {\overleftarrow{n_2}-} {\overleftarrow{n_1} \parallel^2}}{{2 \times \parallel} {\overleftarrow { n_1 } \parallel \times \parallel} {\overleftarrow{n_2} \parallel}} \tag { 2 } \end{align*} \end{document}

Algorithm 1 shows the detailed steps of computing the PSM.

Algorithm 1. Computation of Protein Structure Matrix (PSM)
Inputs: Protein atoms coordinate (Proteins). Number of features (N)
Output: Proteins Structure Matrix (PSM).
Function ComputePSM (Proteins, N)
Initializations: PSM(Numberof proteins, N3)*
for all p ∈ (Proteins) do
j = 0
Compute D(C_α)
Compute θ(C_α)
d_max ← max(D)
d_step ← d_max /(N-1)
for d ← 0 to d_max step d_step do
PSM [p][j] ← F(D(C_α) <= d)
PSM [p][j + 1] ← A_d(θ (C_α))
PSM [p][j + 2] ← S_d(θ (C_α))
j = j + 3
end for
end for
return PSM

3.3. PSM with cutoff

In order to show the importance of encoding the angles in a protein representation, we presented another implementation for the PSM with a cutoff (PSM-C), so that we could compare the performance of the PSM representation with that of the CSM. The PSM-C is very similar to the CSM, but it further encodes the angles between residue pairs. The feature vector in the PSM-C is computed, as in the PSM it encodes only the pairs within a defined threshold distance (the cutoff). Algorithm 2 shows the details of computing the PSM-C.

Algorithm 2. Computation of Protein Structure Matrix with Cutoff (PSM-C)
Inputs: Proteins atoms coordinate (Proteins). Minimum distance (d_min). Maximum distance (d_max). Distance step (d_step).
Output: Proteins Structure Matrix with cutoff (PSM-C).
Function ComputecPSM-C (Proteins, d_min, d_max, d_step)
Initializations: PSM-C
for all p ∈ (Proteins) do
j = 0
Compute D (C_α)
Compute θ (C_α)
for d ← d_min to d_max step d_step do
PSM-C [p][j] ← F(D(C_α) <= d)
PSM-C [p][j + 1] ← A_d(θ (C_α))
PSM-C [p][j + 2] ← S_d(θ (C_α))
j = j + 3
end for
end for
return PSM-C

4. Results and Discussion

We used two data sets. The first data set is the same as used by Pires et al. (2011). It is a gold standard of mechanistically diverse enzyme superfamilies (Brown et al., 2006). Proteins that do not conform to the adopted configuration for computing angles between residues are excluded. This condition leads to a data set that contains 346 protein structures that are available in the Protein Data Bank (Berman et al., 2000) and belongs to the 6 superfamilies, which are amidohydrolase, crotonase, enolase, haloacid dehalogenase, isoprenoid synthase type I, and vicinal oxygen chelate. The second data set was created by Dobson and Doig (2003). It contains 1178, from which only 1157 proteins conform to the adopted configuration. This data set is divided into 679 enzymes and 478 nonenzymes. We implemented three algorithms that utilize the CSM, PSM-C, and PSM representations to predict protein function using these data sets.

For both CSM and PSM-C, the same distance range proposed by Pires et al. (2011) was used, which starts from 0 to 30 Å as a maximum distance with a step 0.2 Å. This range produces 151 feature values. Hence, we also set the number of features for the PSM to be 151 in order to compare its results with the other two representations. Note that in the PSM and PSM-C, each feature vector contains 151 × 3 = 453 values. The Weka Toolkit developer version 3.7.6 (Witten and Frank, 2005) was used with 10-fold cross-validation. Three algorithm, naïve Bayes, k-nearest neighbors (KNN), and random forest, were implemented as classifiers, and the prediction accuracy was evaluated using five performance measures, accuracy, precision, recall, f-measure, and receiver operating characteristic (ROC) area. The details of these performance measures for multiclass classification task can be found in (Sokolova and Lapalme, 2009).

Experiment 1: The results of the superfamily prediction of the proposed PSM-C and PSM are compared with our implementation to the CSM. The classification results for representations using naïve Bayes, KNN, and random forest are summarized in Table 1. The results show that both PSM-C and PSM, which incorporate angles with distance patterns to represent proteins, achieve higher prediction results than the CSM. The results also show that the prediction accuracy for the PSM representation is lower than that for the PSM-C representation. That is, using a maximum cutoff distance for the feature vector can characterize the protein better that using all distances. This can be interpreted as using a cutoff distance encodes local features better than using the whole distances in the protein.

Table 1.

Superfamily Prediction Accuracy Comparison Between the Three Representations Using Different Classification Algorithms

	Prediction accuracy
Representation	Naïve Bayes (%)	KNN (%)	Random forest (%)
PSM-C	88.72	97.4	98.3
PSM	69.65	92.48	90.75
CSM	60.40	88.4	84.7

CSM, cutoff scanning matrix; KNN, k-nearest neighbors; PSM, protein structure matrix; PSM-C, protein structure matrix with a cutoff.

Experiment 2: The same data set is used to predict family level. The PSM-C and PSM are still achieving higher results than the CSM. Table 2 summarizes the accuracy.

Table 2.

Family Prediction Accuracy Comparison Using Different Classification Algorithms

	Prediction accuracy
Representation	Naïve Bayes (%)	KNN (%)	Random forest (%)
PSM-C	83.52	90.46	91.04
PSM	76.30	87.86	83.52
CSM	60.40	76.58	72.54

Experiment 3: For predicting enzymes and nonenzyme proteins, the second data set created by Dobson and Doig (2003) was used. The PSM-C and PSM achieved an accuracy of 79.25% and 76.31% using random forest, respectively. These results are higher than the accuracy achieved by Borgwardt et al. (2005). Borgwardt et al. (2005) dealt similarly with the proposed representation with protein structure only, used nearest neighbors based on Distance-matrix ALIgnment (DALI), and achieved an accuracy of 75.07%. Compared with the CSM, still the proposed representations are achieving higher results using KNN and random forest. When naïve Bayes was used, the prediction accuracy of the CSM was higher than that of the PSM. The accuracy values using different prediction methods are summarized in Table 3.

Table 3.

Enzyme Prediction Accuracy Comparison Using Different Classification Algorithms

	Prediction accuracy
Representation	Naïve Bayes (%)	KNN (%)	Random forest (%)
PSM-C	75.97	74.15	79.25
PSM	70.26	70.52	76.31
CSM	72.60	69.31	71.82

Experiment 4: Pires et al. (2011) used singular value decomposition (SVD) as a postprocessing step to improve the CSM representation. In this experiment, we compare the results of the PSM-C to that of the CSM with and without the SVD step, as shown in Table 4. The prediction results show an improvement for the CSM representation with the SVD step over that of the CSM without the SVD step. But the results for the PSM-C still outperform that of the CSM with the SVD step. This means that the CSM needs additional computations to achieve prediction accuracy comparable to the PSM-C.

Table 4.

Comparison of Superfamily Prediction Accuracy for PSM-C and CSM With and Without SVD

	Prediction accuracy
Representation	Naïve Bayes (%)	KNN (%)	Random forest (%)
PSM-C	88.72	97.4	98.3
CSM with SVD	81.50	95.95	93.64
CSM without SVD	60.40	88.4	84.7

SVD, singular value decomposition.

Experiment 5: The detailed classification results of the proposed PSM-C and CSM using naïve Bayes, KNN, and random forest are shown in Tables 5, 6, and 7, respectively. Figures 4 and 5 show a comparison between recall and precision for each enzyme superfamily, respectively. Using naïve Bayes, the PSM-C achieved an average recall of 88.7%, while the CSM achieved 60.4%. For KNN, the PSM-C and CSM achieved an average recall of 97.4% and 88.4%, respectively. Using random forest, the PSM-C achieved an average recall of 98.3%, while the CSM achieved 84.7%. However, in one experiment, as shown in Figure 5d, the classification results of naïve Bayes for the haloacid dehalogenase class show that the PSM-C achieved a precision of 84.8%, which is lower than that of the CSM, which achieved 90.5% precision, but still the PSM-C has higher recall (90.7%), as shown in Figure 4d, and f-measure (87.6%).

FIG. 4.

(a–f) Recall comparison between CSM and PSM with a cutoff (PSM-C) for different enzyme superfamilies.

FIG. 5.

(a–f) Precision comparison between CSM and PSM-C for different enzyme superfamilies.

Table 5.

Superfamily Classification Comparison of CSM and PSM-C Using Naïve Bayes

	CSM				PSM-C
Superfamily	Precision	Recall	f-Measure	ROC	Precision	Recall	f-Measure	ROC
Amidohydrolase	0.669	0.831	0.741	0.844	0.887	0.887	0.887	0.953
Crotonase	0.193	0.611	0.293	0.821	0.6	0.667	0.632	0.981
Enolase	0.786	0.347	0.482	0.847	0.911	0.863	0.886	0.966
Haloacid dehalogenase	0.905	0.442	0.594	0.877	0.848	0.907	0.876	0.986
Isoprenoid synthase type I	0.4	0.857	0.545	0.947	1	1	1	1
Vicinal oxygen chelate	0.926	0.556	0.694	0.908	0.956	0.956	0.956	0.998
Weighted avg.	0.723	0.604	0.61	0.863	0.89	0.887	0.888	0.971

ROC, receiver operating characteristic.

Table 6.

Superfamily Classification Comparison of CSM and PSM-C Using KNN

	CSM				PSM-C
Superfamily	Precision	Recall	f-Measure	ROC	Precision	Recall	f-Measure	ROC
Amidohydrolase	0.887	0.952	0.918	0.942	0.984	0.976	0.98	0.984
Crotonase	0.667	0.667	0.667	0.859	0.857	1	0.923	0.997
Enolase	0.856	0.811	0.832	0.879	0.978	0.958	0.968	0.978
Haloacid dehalogenase	0.976	0.93	0.952	0.978	0.976	0.953	0.965	0.972
Isoprenoid synthase type I	0.952	0.952	0.952	0.985	1	1	1	1
Vicinal oxygen chelate	0.907	0.867	0.886	0.93	0.978	1	0.989	0.999
Weighted avg.	0.885	0.884	0.884	0.926	0.975	0.974	0.974	0.984

Table 7.

Superfamily Classification Comparison of CSM and PSM-C Using Random Forest

	CSM				PSM-C
Superfamily	Precision	Recall	f-Measure	ROC	Precision	Recall	f-Measure	ROC
Amidohydrolase	0.858	0.927	0.891	0.965	0.984	0.992	0.988	0.999
Crotonase	0.65	0.722	0.684	0.925	1	0.944	0.971	1
Enolase	0.833	0.789	0.811	0.924	0.979	0.979	0.979	0.992
Haloacid dehalogenase	0.881	0.86	0.871	0.958	0.976	0.953	0.965	0.997
Isoprenoid synthase type I	0.9	0.857	0.878	0.996	1	1	1	1
Vicinal oxygen chelate	0.875	0.778	0.824	0.942	0.978	1	0.989	1
Weighted avg.	0.848	0.847	0.846	0.95	0.983	0.983	0.983	0.997

5. Conclusion

Functions of proteins depend highly on their structure. Therefore, good protein structure representation could result in effective protein function prediction. In this article, we propose a new protein structure representation: the PSM, which encodes protein structure based on the 3D patterns of protein residues. PSM representation encodes angle patterns of protein residues along with their distance patterns. This representation has significantly improved protein function prediction accuracy. The proposed method requires neither prior alignment process nor any sequence information. Furthermore, it is independent of the prediction method. Computation results using three different classification algorithms and two different data sets show that the prediction accuracy for the PSM representation outperforms that for the CSM representation, which only encodes distance patterns of protein residues. Our representation achieved a prediction accuracy of 98% in predicting superfamily. On average, it achieved an improvement of about 10% in the prediction accuracy of different protein families using different classification methods.

Footnotes

Author Disclosure Statement

No competing financial interests exist.

References

Almonacid

D.E.

, Yera

E.R.

, Mitchell

J.B.O.

, et al. 2010. Quantitative comparison of catalytic mechanisms and overall reactions in convergently evolved enzymes: implications for classification of enzyme function. PLoS Comput. Biol., 6, e1000700.

Alvarez

M.A.

, and Yan

2012. A new protein graph model for function prediction. Comput. Biol. Chem., 37, 6–10.

Ashburner

, Ball

C.A.

, Blake

J.A.

, et al. 2000. Gene ontology: tool for the unification of biology, The Gene Ontology Consortium. Nat. Genet., 25, 25–29.

Berman

H.M.

, Westbrook

, Feng

, et al. 2000. The Protein Data Bank. Nucleic Acids Res. 28, 235–242.

Boaretoa

, Yamagishi

M.E.

, Caticha

, et al. 2012. Relationship between global structural parameters and Enzyme Commission hierarchy: implications for function prediction. Comput. Biol. Chem., 40, 15–19.

Borgwardt

K.M.

, Ong

C.S.

, and Schönauer

2005. Protein function prediction via graph kernels. Bioinformatics, 21, i47–i56.

Bray

, Doig

A.J.

, and Warwicker

2009. Sequence and structural features of enzymes and their active sites by EC class. J. Mol. Biol., 386, 1423–1436.

Brown

S.D.

, Gerlt

J.A.

, Seffernick

J.L.

, et al. 2006. A gold standard set of mechanistically diverse enzyme superfamilies. Genome Biol., 7, R8.

Cheng

, Tegge

A.N.

, and Baldi

2008. Machine learning methods for protein structure prediction. IEEE Rev. Biomed. Eng., 1, 41–49.

10.

Csaba

, Birzele

, and Zimmer

2008. Protein structure alignment considering phenotypic plasticity. Bioinformatics, 24, i98–i104.

11.

Dobson

P.D.

, and Doig

A.J.

2003. Distinguishing enzyme structures from non-enzymes without alignments. J. Mol. Biol. 330, 771–783.

12.

Dobson

P.D.

, and Doig

A.J.

2005. Predicting enzyme class from protein structure without alignments. J. Mol. Biol., 345, 187–199.

13.

Erdin

, Venner

, Lisewski

A.M.

, et al. 2013. Function prediction from networks of local evolutionary similarity in protein structure. BMC Bioinform., 14, S6.

14.

Erdin

, Ward

R.M.

, Venner

, et al. 2010. Evolutionary trace annotation of protein function in the structural proteome. J. Mol. Biol. 396, 1451–1473.

15.

Ewens

, and Grant

2005. Statistical methods in bioinformatics: an introduction. In Gail

, Krickeberg

, Samet

, Tsiatis

, and Wong

, eds. Statistics for Biology and Health, 2nd ed. Springer, New York.

16.

Gao

, and Zaki

M.J.

2008. PSIST: a scalable approach to indexing protein structures using suffix trees. J. Parallel Distr. Com., 68, 55–63.

17.

Gilbert

, Westhead

, Nagano

, et al. 1999. Motif-based searching in tops protein topology databases. Bioinformatics, 15, 317–326.

18.

Gilbert

, Westhead

, Viksna

, et al. 2001. A computer system to perform structure comparison using tops representations of protein structure. Comput. Chem. 26, 23–30.

19.

Han

, and Kamber

2008. Data Mining: Concepts and Techniques. Elsevier, San Francisco, CA.

20.

Jia

, Huan

, Buhr

, et al. 2009. Towards comprehensive structural motif mining for better fold annotation in the “twilight zone” of sequence dissimilarity. BMC Bioinform., 10, S46.

21.

Kihara

, Sael

, Chikhi

, et al. 2011. Molecular surface representation using 3d zernike descriptors for protein shape comparison and docking. Curr. Protein Pept. Sci., 12, 520–530.

22.

, and Hu

2012. Structural alphabet motif discovery and a structural motif database. Comput. Biol. Med., 42, 93–105.

23.

, Esquivel-Rodríguez

, Venkatraman

, et al. 2009. 3D-SURFER: software for high-throughput protein surface comparison and analysis. Bioinformatics, 25, 2843–2844.

24.

Larranaga

, Calvo

, Santana

, et al. 2006. Machine learning in bioinformatics. Brief Bioinform., 7, 86–112.

25.

Mewes

H.W.

, Frishman

, Güldener

, et al. 2002. MIPS: a database for genomes and protein sequences. Nucleic Acids Res., 30, 31–34.

26.

Nisius

, Sha

, and Gohlke

2012. Structure-based computational analysis of protein binding sites for function and druggability prediction. J. Biotechnol., 159, 123–134.

27.

Pires

D.E.

, Melo-Minardi

R.C.

, Santos

M.A.

, et al. 2011. Cutoff Scanning Matrix (CSM): structural classification and function prediction by protein inter-residue distance patterns. BMC Genomics, 12, S12.

28.

Rahimi

, Madadkar-Sobhani

, Touserkani

, et al. 2013. Efficacy of function specific 3D-motifs in enzyme classification according to their EC-numbers. J. Theor. Biol., 336, 36–43.

29.

Ritchie

D.W.

, Ghoorah

A.W.

, Mavridis

, et al. 2012. Fast protein structure alignment using Gaussian overlap scoring of backbone peptide fragment similarity. Bioinformatics, 28, 3274–3281.

30.

Ruepp

, Zollner

, Maier

, et al. 2004. The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. Nucleic Acids Res., 32, 5539–5545.

31.

Sael

, Li

, La

, et al. 2008a. Fast protein tertiary structure retrieval based on global surface shape similarity. Proteins, 72, 1259–1273.

32.

Sael

, Li

, La

, et al. 2008b. Rapid comparison of properties on protein surface. Proteins, 73, 1–10.

33.

Sayers

E.W.

, Barrett

, Benson

D.A.

, et al. 2012. Database resources of the national center for biotechnology information. Nucleic Acids Res. 40, D13–D25.

34.

Sokolova

, and Lapalme

2009. A systematic analysis of performance measures for classification tasks. Inform. Process Manag. 45, 427–437.

35.

Veeramalai

, and Gilbert

2008. A novel method for comparing topological models of protein structures enhanced with ligand information. Bioinformatics, 24, 2698–2705.

36.

Vuong

, Stephens

R.M.

, and Volfovsky

2012. AVIA: an interactive web-server for annotation, visualization and impact analysis of genomic variations. BMC Proc., 6, P37.

37.

Wang

Y.C.

, Wang

, Yang

, et al. 2011. Support vector machine prediction of enzyme function with conjoint triad feature and hierarchical context. BMC Syst. Biol., 5, S6.

38.

Wang

, Yin

, Lee

J.S.

, et al. 2013. Protein function annotation with Structurally Aligned Local Sites of Activity (SALSAs). BMC Bioinform., 14, S13.

39.

Witten

I.H.

, and Frank

2005. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Burlington, MA.

40.

Xie

, and Bourne

P.E.

2007. A robust and efficient algorithm for the shape description of protein structures and its application in predicting ligand binding sites. BMC Bioinform., 8, S9.

41.

Zhao

, Yang

, and Zhou

2010. Structure-based prediction of DNA-binding proteins by structural alignment and a volume-fraction corrected DFIRE-based energy function. Bioinformatics, 26, 1857–1863.

42.

Zhao

, Xu

, Liang

, et al. 2011. Structure-based de novo prediction of zinc-binding sites in proteins of unknown function. Bioinformatics, 27, 1262–1268.