† Corresponding author. E-mail:
Knowledge-based scoring functions have been widely used for protein structure prediction, protein–small molecule, and protein–nucleic acid interactions, in which one critical step is to find an appropriate representation of protein structures. A key issue is to determine the minimal protein representations, which is important not only for developing of scoring functions but also for understanding the physics of protein folding. Despite significant progresses in simplifying residues into alphabets, few studies have been done to address the optimal number of atom types for proteins. Here, we have investigated the atom typing issue by classifying the 167 heavy atoms of proteins through 11 schemes with 1 to 20 atom types based on their physicochemical and functional environments. For each atom typing scheme, a statistical mechanics-based iterative method was used to extract atomic distance-dependent potentials from protein structures. The atomic distance-dependent pair potentials for different schemes were illustrated by several typical atom pairs with different physicochemical properties. The derived potentials were also evaluated on a high-resolution test set of 148 diverse proteins for native structure recognition. It was found that there was a crossover around the scheme of four atom types in terms of the success rate as a function of the number of atom types, which means that four atom types may be used when investigating the basic folding mechanism of proteins. However, it was revealed by a close examination of typical potentials that 14 atom types were needed to describe the protein interactions at atomic level. The present study will be beneficial for the development of protein related scoring functions and the understanding of folding mechanisms.
Appropriate representation of protein structures is important in computational structural biology. It is associated with not only the development of scoring functions but also the understanding of protein folding mechanism. Coarse-grained model is one of the methods to represent protein structures and often applied to reduce computational cost. A key factor in the model is to determine the minimal protein representations, residue-based or atom-based, without sacrificing essential precision. Simplifying twenty amino acids into fewer numbers of representative alphabets has been widely studied in protein folding,[1–5] molecular docking,[6] protein structure prediction,[7,8] protein design,[9] protein function,[10] protein classification,[11–13] and protein sequence alignment.[14,15] Baker et al. designed a 57-residues protein Src SH3 with a reduced alphabet using five amino acid types.[5] Motivated by their work, Wang and Wang obtained an optimal reduction with five types of residues that has the same form as the simplified palette of Baker and coworkers based on the concept of mismatch between a reduced interaction matrix and the Miyazawa and Jernigan (MJ) matrix.[1] In coarse-grained molecular simulations, lattice chain models of two residue types or more (HP, HNP, IAGEK, etc.) provided a good understanding on the folding process of natural proteins.[4,9,16–18] Although residue-based typing issues in protein modeling, folding, and design have been studied by researchers,[1,3,19–23] the optimal atom classification for proteins has received little attention. In fact, during the scoring process of protein structure prediction, protein–protein, protein–ligand, and protein–nucleic acid interactions, atom-based potentials are often required for the accurate evaluation of these interactions.[24–28] For years, various protein atom typing schemes have been developed to characterize the interactions between atoms.[29–34] However, the optimization about atom typing remains an important issue. Current approaches normally classify protein atoms based on their physical and chemical properties. These properties include atomic number, charge, polarity, hydrophobicity, hydrogen bond, local chemical and protein secondary structure environment, etc.[35–39]
Despite the successes of current atom typing schemes on some systems, their classification methods for atom types are all kind of arbitrary and depend on specific systems studied. Given the 167 heavy atoms of 20 amino acids, protein atoms could be grouped into 1 to 167 types.[36,37] If a scheme has a small number of atom types, it will make the potentials of scoring functions simple and fast, but the resolution of the potentials will be sacrificed in characterizing the atomic interactions. On the contrary, if a scheme has a large number of atom types, the corresponding potentials will have a better resolution in describing the interactions, but the potentials may suffer from the slow speed due to more possible interacting pairs and more errors due to insufficient statistics in the derivation of scoring functions.[32–35] Therefore, an appropriate typing scheme is needed to achieve a good balance between accuracy and resolution, especially for knowledge-based scoring functions.[27]
In this work, we have used 11 schemes to categorize protein atoms based on their physical and chemical environment in proteins. The 11 schemes gave 1 to 20 atom types when different levels of details for atoms’ environments were taken into account. Using a statistical mechanics-based iterative method[35] and a training set of 1225 proteins,[40] we have derived 11 sets of knowledge-based pair potentials based on the 11 protein atom typing schemes. The goodness of different atom typing schemes were assessed based on the ability of the developed scoring functions in discriminating native structures from decoys on a test set of 148 proteins with 500–1600 high resolution (HR) decoys for each protein.[40] We have also examined the derived knowledge-based potentials of some typical atom pairs to obtain a relation between the accuracy of the potentials and the dimension of the parameters. The present work provides a reference on the optimal number of atom types for proteins in the development of scoring functions and will be useful for the study of protein design and structure prediction.
The 11 schemes have been used to classify the 167 heavy atoms of proteins based on their physical, chemical, and functional environment. The details of the atom typing schemes are illustrated in Fig.
During our computations, a large training set of experimentally determined protein structures and computationally generated decoys was used to derive the scoring function of pair potentials.[40] Through a statistical mechanics-based iterative method, all the native structures are expected to have the lowest energy scores compared with their respective decoys. The basic idea of the iteration process is described by the following expressions:[35]
The statistical energy score of each native structure or non-native/decoy structure at the k-th iterative step can be calculated as follows:
The iterative method is robust and has been widely used in studying protein–protein interactions, protein–ligand interactions, protein–RNA/DNA interactions, and protein structure prediction.[32–35,42,45] The iterative method circumvents the long-standing reference state problem in the development of knowledge-based scoring functions by improving the pair potentials iteratively through comparison of the physics-based pair distribution functions. The scoring function ITScorePP, that was derived by the similar iterative method, has proven to be effective in past critical assessment of prediction of interactions (CAPRI) experiments.[46,47]
In the present work, we have used the training set of 1225 proteins and the test set of 148 proteins that were generated by Rajgaria et al.[40] All the proteins, that were selected by Zhang and Skolnick, are nonredundant single domain proteins with a maximum pairwise sequence similarity of 35%.[48] The length of these proteins ranges from 41 to 200 amino acids. This set also has a uniformly distribution of α,
Out of the 1225 proteins, we have randomly chosen 500 proteins as the training set for our iterative computation. The similar iteration was run 10 times, and the final potentials were the average values over all the resulting iterative potentials of 10 runs. For the sake of computational efficiency, we have selected 400 decoys for each protein in the iteration. As our main purpose is to obtain the relationship between the effectiveness of the statistical interaction potentials and the dimension of the atom typing schemes, we have focused on the comparison between the effectiveness of different atom typing schemes in protein structure prediction so as to obtain an optimal atom typing scheme. Therefore, as long as the statistics for atom pairs are sufficient, the quantity of proteins in the training set and the number of decoys for each protein are relatively less relevant factors in our calculations. Given the interaction pair potentials for different atom typing schemes, we have closely examined the potential energy curves of several typical atom pairs with different physicochemical properties. All the derived scoring functions of potentials were also evaluated in terms of native structure recognition and five other parameters on the test set of 148 proteins with 500–1600 HR decoys for each protein.
Considering the set of randomly selected 500 nonhomologous proteins, our large training database results in significant statistics of frequencies
Through a statistical mechanics-based iterative method, we have developed 11 sets of atomic distance-dependent pair potentials corresponding to 11 atom typing schemes. To illustrate how the atom typing schemes impact the derived statistical interaction potentials and obtain an overall physical picture of the atomic interactions, we have plot the potential energy curves of several typical atom pairs with different physicochemical properties when the protein atoms were categorized into one to 20 atom types, respectively. Similar to our previous study,[35] we have chosen several representative atom pairs for demonstration in the present work. They stand for electrostatic interactions [O2-]–[N2+] (e.g., ASP_OD2–ARG_NH1) and [O2-]–[N3+] (e.g., ASP_OD2–LYS_NZ) (Figs.
Several common trends can be observed from the pairs of knowledge-based potentials under different atom typing schemes. First, with the increasing number of classified atom types, the equilibrium positions of the potentials tend to move left. Second, the depths of the potential wells become deeper with more atom types classified. For clarity, the equilibrium positions and depths of the potential wells of all typical atom pairs are listed in Table
Atom typing is a critical aspect in the development of knowledge-based scoring functions. Therefore, an important criterion for the goodness of an atom typing scheme is the ability of the scoring function with the atom typing scheme in discriminating native structures from decoys. Therefore, we have evaluated the performances of the 11 scoring functions derived from the 11 atom typing schemes through a statistical mechanics-based iteration method on the high-resolution (HR) decoy set of 148 proteins generated by Rajgaria et al.[40] Table
One common feature can be found from five assessment parameters (except Z-score) as a function of the number of atom types. Namely, the values of the five parameters change fast at the beginning and then become relatively stable with the increasing number of atom types (Table
From the above assessments over different atom typing schemes, we have found that the scheme of four atom types did the best in low dimensions while the scheme of 14 atom types obtained the best performance in high dimensions. The four atom types correspond to the natural atom typing scheme of C, N, O, and S. The scheme can give excellent average Z-score and rank, and the other parameters are not far from the maximum values in the high dimensions. Therefore, we may use the scheme of four atom types as rough screening (low precision) for protein structure prediction, as the simple scheme can greatly cut down the computational cost. When accurate prediction is needed, we can turn to the scheme of 14 atom types to evaluate structures in high resolution.
In this work, we have addressed the protein atom typing problem by categorizing protein heavy atoms into 11 atom typing schemes. The knowledge-based pair potentials for the 11 atom typing schemes were derived using a statistical mechanics-based iterative method. The performances of different atom typing schemes were evaluated by the comparison of the derived knowledge-based pair potentials and the ability of the corresponding scoring functions in discriminating native structures from decoys. It was found that the derived pair potentials started to converge when 14 or more atom types were used, while four types were enough to obtain a satisfactory success rate in native structure recognition. The results suggested that the number of atom types could range from 4 to 14 in practical applications depending on the studied systems. The scheme of four atom types (i.e., C, N, O, and S) could be used for assessing the overall quality of protein structures, while an accurate description of interactions for protein structures at atomic level may require a finer scheme of 14 atom types. The present study provides a basic guidance for the classification of protein atoms, and is expected to benefit the development of scoring functions and the understanding of interaction mechanisms in proteins.
[1] | |
[2] | |
[3] | |
[4] | |
[5] | |
[6] | |
[7] | |
[8] | |
[9] | |
[10] | |
[11] | |
[12] | |
[13] | |
[14] | |
[15] | |
[16] | |
[17] | |
[18] | |
[19] | |
[20] | |
[21] | |
[22] | |
[23] | |
[24] | |
[25] | |
[26] | |
[27] | |
[28] | |
[29] | |
[30] | |
[31] | |
[32] | |
[33] | |
[34] | |
[35] | |
[36] | |
[37] | |
[38] | |
[39] | |
[40] | |
[41] | |
[42] | |
[43] | |
[44] | |
[45] | |
[46] | |
[47] | |
[48] |