Reweighted ensemble dynamics simulations: Theory, improvement, and application<xref ref-type="fn" rid="cpb142837fn1">*</xref>

Cite this Article

Gong Lin-Chen, Zhou Xin, Ouyang Zhong-Can. Reweighted ensemble dynamics simulations: Theory, improvement, and application* . Chinese Physics B, 2015, 24(6): 060202 Copy to clipboard

Permissions

Reweighted ensemble dynamics simulations: Theory, improvement, and application*

Gong Lin-Chen^a), Zhou Xin^b)†, Ouyang Zhong-Can^a),^c)

a)Center for Advanced Study, Tsinghua University, Beijing 100084, China

b)School of Physics, University of Chinese Academy of Sciences, Beijing 100049, China

c)Institute of Theoretical Physics, Chinese Academy of Sciences, Beijing 100190, China

^†Corresponding author. E-mail: xzhou@ucas.ac.cn

^*Project supported by the National Natural Science Foundation of China (Grant No. 11175250).

Abstract

Based on multiple parallel short molecular dynamics simulation trajectories, we designed the reweighted ensemble dynamics (RED) method to more efficiently sample complex (biopolymer) systems, and to explore their hierarchical metastable states. Here we further present an improvement to depress statistical errors of the RED and we discuss a few keys in practical application of the RED, provide schemes on selection of basis functions, and determination of the free parameter in the RED. We illustrate the application of the improvements in two toy models and in the solvated alanine dipeptide. The results show the RED enables us to capture the topology of multiple-state transition networks, to detect the diffusion-like dynamical behavior in an entropy-dominated system, and to identify solvent effects in the solvated peptides. The illustrations serve as general applications of the RED in more complex biopolymer systems.

Keyword: 02.70.Ns; 87.15.A–; 82.20.Wt; ensemble dynamics simulations; peptides; enhanced sampling; metastable states

Show Figures

1. Introduction

Exploring for the high-dimensional conformational space of complex molecular systems has long been the focus of computer simulation studies in recent years. Two challenges exist under this topic, i.e., how to thoroughly sample the conformational space, and how to understand the equilibrium and dynamic properties based on lots of simulation data in hand. Although standard Monte-Carlo (MC) and molecular-dynamics (MD) simulation methods have been widely applied and achieved great successes, for many complex systems, such as glassy systems, phase-transition systems, and solvated biopolymers, sufficient sampling is by no means an easy task. The rugged energy landscapes in the systems frequently trap simulation trajectories in a small conformational region, disallowing the thorough exploration of the whole conformational space within limited simulation time. Due to those, lots of various advanced simulation techniques, such as umbrella sampling, ^{[1– 3]} J-walk, ^[4] multicanonical ensembles, ^[5] simulated tempering, ^{[6, 7]} conformational flooding, ^[8] hyperdynamics, ^{[9– 11]} parallel tempering, ^{[12, 13]} Wang– Landau sampling^[14] and its derivatives, ^{[15, 16]} meta-dynamics, ^[17] and multidimensional generalized-ensemble methods, ^[18] have been developed and applied to overcome the difficulties. New hardware^[19] and parallel computational technique^[20] were also designed to extend MD simulations to reach microsecond scale in protein systems. On the other hand, due to the growing power of massively distributed computation, the ensemble dynamics (ED), ^{[21– 23]} which independently runs multiple simulations in distributed computers, provides an alternative way for bridging the gap between simulation power and practical interest.

Besides efficiently sampling, proper analysis and understanding systems from the simulation data are also not a trivial task. The major purpose in data analyses, is to search for a coarse-grained description of the conformational space. In the traditional way, a few (usually only one or two) knowledge-based reaction coordinates (or order parameters) are first defined, then the equilibrium (and dynamic) properties of the system are achieved by mapping sampled configurations in the low-dimensional reaction-coordinate space. However, this method may suffer from the arbitrariness in the selection of reaction coordinates, and give an oversimplified description of the system.^[24] Recently, a systematic metastable-state-network-based view brought prominent improvement.^{[25– 34]} Due to being theoretically transparent and inclusive of nearly all the useful information, the network model has received much attention.

In a previous work, ^[35] we have proposed a method, which was previously denoted as the weighted ensemble dynamics, now renamed as the reweighted ensemble dynamics (RED), to enhance sampling and to form a metastable-state transition network.^[36] The RED is following the spirit of ED, which generated multiple short simulation trajectories from dispersively-chosen initial conformations, then applies a network-based analysis to get equilibrium and dynamics properties from all of these trajectories. In the present paper, we further improve and validate the RED, including: (i) we present a scheme to select basis functions and judge their completeness; (ii) we discuss the unique free parameter in the RED; (iii) we provide an improvement to depress statistical errors. We illustrate the improved RED in a few different systems, give the topology structure of a multiple-state transition network, and predict the diffusion-like dynamical property in entropy-dominated systems. More importantly, we apply the RED to analyze the effects of different aspects (degrees of freedom, interactions, etc.) separately, such as the solvent effects in the solvated dipeptide.

The article is organized as follows. In Section 2, we briefly represent the main formula of the RED, then we discuss a few keys in the RED, such as selection of basis functions, the effects of unique parameters, and the improvement of statistical errors. In Section 3, we illustrate the application of the RED in two-dimensional multiple-well potentials and in solved dipeptides. A short conclusion is given in Section 4. The simulation and analysis details are presented in Appendix A.

2. Theory and method

In the RED, multiple MD (or MC) trajectories, {q_i(t)}, i = 1, … , p, t ∈ [0, τ ], are independently generated from initial conformations, denoted as {q_i(0) = q_i0}, i = 1, … , p, under the same simulation condition, respectively. Each trajectory involves m conformations, {q_i(t = jΔ ) = q_{i, j}}, j = 0, … , m − 1, thus total m × p conformations are sampled. Here q is a simple notation of configurational coordinates. Δ ; is the time interval of the samples, and mΔ ; = τ . We denote the distribution of samples in a single trajectory as the initial distribution of all the trajectories as

2.1. Theory of the RED

Usually, while τ is shorter than the equilibrium time of systems, a single trajectory does not reach the equilibrium, i.e., P_i(q) ≠ P_eq(q), but we can use all of these trajectories to reproduce the equilibrium distribution,

where w_i is the weight of the i-th trajectory, which is due to the deviation of the initial distribution from the equilibrium distribution. A possible selection of {w_i} is,

Thus, we have a linear equation about {w_i},

where

and ∑ _jw_j = p.

Due to the finite size of samples {q_i, j}, equation (1) should be understood as

for any A(q), while 〈 A(q)〉 _i is estimated as Similarly, the ratio of two distributions is understood as

while 〈 … 〉 _α is estimated by average in the corresponding conformation samples. By expanding on a complete set of conformational functions (named as basis functions) {A^μ(q)}, μ = 1, 2, … , we have,

Here g_{μ ν} is the inverse of g^{μ ν} ≡ 〈 A^μ(q) A^ν(q)〉 _init which is estimated by the average in the initial sample.

In practical application, instead of A(q_i, 0) in estimation of Γ _ij, we use the average of A(q) in a short t̃ -length initial segment of trajectory (the first m₀ configurations with m₀ ≪ m or t̃ ≪ τ ), i.e.,

then

Here {Â ^μ(q)} is the orthonormalized basis functions satisfying 〈 Â ^μ(q)Â _ν(q)〉 _init+ = δ _{μ ν} with

Thus, equation (3) can be rewritten as

Here H ≡ G^TG is the variance– covariance matrix of the trajectory-mapped vectors

where G_ij ≡ Γ _ij − δ _ij with i = 1, … , p, and w = (w₁, … , w_p)^T. Equation (6) is the main formula of the RED. The w corresponds to the ground state of matrix H whose number of degenerated ground states gives the number of disconnected metastable states under the timescale of the individual trajectory τ , the existence of a small fraction of transition trajectories among states can slightly split the ground state of H. More details are shown in our previous work.^[35]

2.2. Practical schemes of the RED

The selection of basis functions is a key in the practical application of RED. In our previous works, ^{[35, 36]} we briefly discussed the selection of basis functions, and used some triangle functions as basis functions like Fourier expansion. Here we more generally discuss the key point of the RED.

2.2.1. Selection of basis functions

Although the expansion in the RED required a complete set of basis functions in principle, it is usually not necessary to include too many basis functions in practice. Usually, it is easy to choose the initial configurations so that its distribution P_init+(q) is already local equilibrium, thus only slow processes need to be reweighted by the RED. Therefore Ω (q) in the all-atomic q space can be replaced by Ω (x) = P_eq(x)/[P_init+(x)] in a coarse-grained x space, where x represents the slow (coarse-grained) degrees of freedom, P(x) is the corresponding distribution function in the x space. For example, in proteins, if we disregard the first few nanoseconds of relaxation simulation and reset the zero point of time, it is safe to suppose only some slow degrees of freedom, such as torsion angles of backbone, positions of side-chain atomic groups, be x, thus only functions in the x space should be applied as basis functions. In many cases, systems might reach equilibrium within the τ time (length of individual trajectory) except for only a few collective variables, thus we even apply the RED in a low-dimensional space. In addition, we usually suppose there are many potential-energy basins (or super-basins) separated by high (free) energy barriers in the conformational space; each (super-)basin corresponds to a metastable state wherein a τ -length trajectory is sufficient to reach local equilibrium. Thus, Ω (x) looks like a step-like function, which is almost constant in each metastable state, variations only happening at boundaries between states. We approximately have

where Θ _α(x) is character function of the α metastable state, which is unity if x is inside the state, zero otherwise. The set of character functions of states {Θ _α(x)} provides a natural set of basis functions to expand Ω (x), thus the required number of basis functions is equal to the number of states. Since the boundaries of these states are unknown a priori, we usually use far more basis functions to linearly combine the step-like Ω (x) function. We may use some usual collective variables with clear physical meaning as basis functions. In biological macromolecules, dihedral angles and their analytical functions, pair distances between some key atomic groups, potential energy or its parts, solved area of proteins, radius of gyration, native contact number, number of hydrogen, etc., can be applied as basis functions. The variance– covariance matrix g^{μ ν} will ensure the consistent consideration of different classes of basis functions, and provide the measure of their relative importance by the orthonormalization process. In addition, the number of independent basis functions is also limited by the finite size of init+ , thus more selected basis functions will be discarded after considering the matrix g_{μ ν}, then do not significantly contribute to final results. Another possible scheme about the selection of basis functions is to use local functions in the conformational space. For example, for a set of sampled configurations, {x_i}, i = 1, … , M, one may group them into some subsets by clustering algorithms based on their geometric similarity, and use the character functions of clusters as basis functions. Thus the number of basis functions is the number of clusters, n_cl = M/m. Here M and m are the total number of sampled conformations and the (average) number of conformations in each cluster. The idea is actually already applied in the Markov State Model (MSM) to divide the whole conformational space into metastable states.^{[24, 30– 34]} In this paper, we use physically meaningful collective variables as basis functions, the application of the sample-based discrete cluster functions will appear elsewhere.

2.2.2. Completeness of basis functions

The completeness of selected basis functions can be directly test-based on sampled configurations in the RED. For a set of basis functions {A^μ(x)}, we can estimate the difference between two samples (or the corresponding distributions), such as the initial distribution P_init+(x) and the equilibrium distribution P_eq(x),

The definition gives a measurement about the similarity of samples, and can be generally applied to compare two samples. For example, we had used the formula to optimize coarse-grained models of the all-atomic water.^[37] For assessing the completeness of basis functions in the expansion of P_eq(x)/[P_init+(x)], a reasonable method is to check the convergence of while adding more and more basis functions.

To calculate the sample for equilibrium distribution P_eq(x), thus the trajectory weights {w_i}, are necessary. We propose two methods to bypass this difficulty. First, before reweighting, P_eq(x) is replaced by and we roughly select basis functions. Second, after reweighting, can be calculated with the resulting trajectory weights. This procedure can be used to test the former selection. If the rigorously calculated has reached its saturation region with previously roughly selected basis functions, the results should be satisfiable. However, this posterior examination is restricted to the cases that the trajectory weights can be uniquely determined. In practice, the distribution , has the same local equilibrium characteristics in each metastable region as P_init+ and P_eq, thus is similarly a step function, which is also constant in each meta-stable region, only the constant values are different from those of Ω _{init+ , eq}. Thus we believe the a priori test is already enough, and the following results are all calculated in this way.

Since the statistical errors exist due to the finite sample sizes, S² would not strictly stop growing with the number of basis functions, it just increases slowly compared to the initial growing phase with increasing basis functions. Similarly, when a large error occurs, S² may abnormally increase within the saturation region. We design a scheme to get rid of the statistically unreliable basis functions. We orthonormalize the functions {A^μ(x)}, leading to the explicitly identifiable value and error contribution from the orthonormalized basis functions to S². In the present paper, we implement the Gram– Schmidt orthonormalization method to one-by-one derive out the set of orthonormalized basis functions {Â ^μ(x)}, thus To estimate the error of S², we approximately estimate the error of s_μ, and combine them by

Here, Err( ) denotes the statistical error of the quantity in brackets. For each orthonormalized basis function, Â ^μ, if Err(s_μ) is larger than | s_μ| , i.e., the relative error of s_μ is larger than 1.0, this basis function will not be selected in expansion. One-by-one selection will give out the final basis function set and the estimated error of the completeness measure S².

The scheme introduced here provides us with the knowledge about how many basis functions are enough in the RED. After orthonormalization, the contribution of the basis functions to S² clearly reflect their relative importance in trajectory reweighting. The basis functions that contribute more are more important for portraying the difference between P_init+(x) and P_eq(x), and the ones that contribute quite little can be linearly expanded by other existing basis functions. Furthermore, various kinds of basis functions can be safely added to investigate the related phenomena. For illustration, we will show that the basis functions of dihedral angles and those of distances between heavy atoms can be consistently treated in calculation, and the effect of a solvent on the conformation of biological macromolecules can be studied by additional basis functions characterising the solute– solvent relationship.

2.2.3. The effects of free parameter t̃

There is only one free parameter in the RED, for collecting the init+ samples in Eq. (5). Large may bring systematical errors, and small will reduce the sample number in init+ , leading to a larger statistical uncertainty. is usually applied in the RED with satisfactory results obtained. We also point out that, for the purpose of discovering the states in conformational space, the results are not sensitive to

The variable t̃ actually reflects the timescale of the process that we are interested in. If it were set to the single trajectory length τ , then the trajectory weights solved by Eq. (6) will be equal to each other. In this case, we are focusing on the processes with a timescale much longer than τ , thus all the trajectories are still in the relaxation to local equilibrium, and should have the same weight. Usually, if a system can be divided into several metastable states under the focused timescale, t̃ should be selected comparable to or smaller than the smallest life time of these states. Given that the kinetic states of the system are frequently unknown, t̃ should be selected relatively short. In practice, it is possible to use shorter and shorter t̃ and check the convergence of the obtained free energy surface. Thus, a proper t̃ may be found as a compromise between correctness and statistical reliability (small fluctuation of trajectory weights). Another practical consideration is that a few percent of τ would be the safe choice for t̃ . This is because the transition events with timescales prominently smaller than τ have reached equilibrium in a single trajectory (thus do not need reweighting), we only need to focus on the ones with time-scales similar to or slightly smaller than τ . We usually choose t̃ /τ in the interval of [0.01, 0.1], and the results are satisfactory. In this paper, we will explicitly show that the variable within the range is not sensitive to our results.

2.3. Improvement of the RED

For decreasing statistical errors in the RED, we can generate a few trajectories from each initial configuration, and specify the same weight to the trajectories from the same initial configuration. Suppose there are n_s initial configurations (denoted as q₁₀, q₂₀, … , q_{n_s0} in the following), and m₁, m₂, … , m_{n_s} trajectories initiated respectively from these initial configurations (the total number of trajectories is still p), we have

where w_k is the weight for all the m_k trajectories initiated from q_k0, T̃ is a p × n_s matrix with zero and unity elements. T̃ _jk is unity only if the j-th trajectory starts from the k-th initial configuration. Since the number of equations is more than the number of trajectory weights, we estimate Eq. (11) by minimizing I = | GT̃ w | ². The vector of trajectory weights, w, corresponds to the ground state of matrix H ≡ T̃ ^TG^TGT̃ . For the newly defined H matrix, the number of the degenerated ground states still corresponds to the number of disconnected metastable states under the timescale of the individual trajectory. The existence of a small fraction of transition trajectories can slightly split the ground state of H. We name this method SI1.

Alternatively, we can merge the trajectories with the same initial configuration to a single one, and perform the standard RED analysis. Concretely speaking, we can calculate the average of a certain variable in the initial-segment sample set, init+ , by,

where means the average of A over the initial segments of all the trajectories with the initial configuration q_i0. By doing so, more samples are included in calculation, and effectively, both the initial region and the whole area in the conformational space which the trajectory explores become statistically more reliable. This method is named SI2. We will illustrate the application of SI1 and SI2 for depressing the large statistical noise in the entropy-dominated systems where the differences between trajectories within a metastable state might be in comparison with that in different states.

3. Results

3.1. Toy models

We first illustrate the RED in two simple toy models. The first one is a two-dimension (2D) system with four identical potential wells, denoted as Two-Dimension-Quad-Well (TDQW), see Fig. 1(a). The second model is a 2D system with a Mexican– Hat potential (MHAT) which has two states, see Fig. 1(b). The outer state is wide (larger entropy) and the relaxation time inside the state is comparable to the transition time between the state and the inner state. Simulation details and the applied basis functions are shown in Appendix A.

	Figure Option View Download New Window
	Fig. 1. The 2D potentials. (a) The TDQW potential shown as contour map. The darker regions correspond to potential wells, and the lighter regions correspond to potential barriers. (b) The MHAT potential projected along one dimension. Rotating around the dashed line reproduces the 2D potential.

3.1.1. Multiple-state transition network

In TDQW system, the eigenvalues of the matrix H in Eq. (6) are plotted in Fig. 2(a). As expected, there are four zero eigenvalues at temperature 0.3, implying the four disconnected states. At higher temperatures, the second to fourth eigenvalues deviate from zero to larger values due to the increasing fraction of the interstate transition trajectories. More illustrative perspective can be obtained from Figs. 2(b), 2(c), and 2(d), where the trajectories are mapped into the three-dimensional space by projecting the trajectory-mapped vectors, {Gⁱ} defined in Eq. (7), to the second, third, and fourth eigenvectors of H, denoted as where u_α is the α -th eigenvector of H. At temperature 0.3, the trajectories agglomerate into four disconnected clusters that locate respectively at the four vertexes of a tetrahedron. Inside each cluster are the trajectories isolated in the same potential well. At temperature 0.85, part of the trajectories can climb over the potential barriers to the other states. As a result, the points corresponding to the transition trajectories are found to gather along the straight lines that shoot out of the vertexes, and the ones corresponding to the non-transition trajectories are still locating on the vertexes. In Fig. 2(c), each vertex only connects to two neighboring vertexes, correctly reflecting the topology of TDQW system. At temperature 1.5, it is relatively easier for the trajectories to transfer between different states. The tetrahedron structure is still preserved, and its interior is filled by multiple-transition trajectories which transit among more than two states.

	Figure Option View Download New Window
	Fig. 2. The eigenvalues and projection maps for TDQW potential under different temperatures. (a) The first ten eigenvalues of H matrix calculated at different temperatures. For the three temperatures of 0.3 (b), 0.85 (c), and 1.50 (d), the projection of {Gⁱ} to the second, third, and fourth eigenvectors of H are also plotted.

3.1.2. Entropy-dominated states

In MHAT system, the 70 smallest eigenvalues of H are shown in Fig. 3(a). Different from TDQW, the eigenvalues of H gradually increase from zero at the low temperature T = 0.05, and lots of small eigenvalues exist. The number of the eigenvalues that obviously deviate from 1.0 is always more than the number of physical states in MHAT system, till the temperature is high enough that there is only one zero-eigenvalue apparently different from 1.0. This behavior simply reflects the entropic nature of the outer potential well. At low temperature, a single trajectory cannot reach local equilibrium in the state, consequently a few small eigenvalues exist, as there are many sub-states in the well. At higher temperatures, the trajectories that start from the outer potential well usually climb over the potential barrier before reaching local equilibrium within the state, thus the inter-trajectory difference will not be extinguished, until the two states in MHAT kinetically become one state at high enough temperature. The projections of {Gⁱ} onto the eigenvectors of H are shown in Figs. 3(b), (3c), and 3(d). At both temperatures 0.05 and 0.30, the trajectories are divided into two groups. In one group, the trajectories dispersively locate on the same plane which is perpendicular to the axis. These trajectories are found to be isolated in the outer potential well. In the other group, the trajectories concentrate to nearly one point. They are found to be isolated in the inner potential well. These two groups of trajectories are clearly differentiated by the projections along the second eigenvector of H, i.e., the one indicating the most important contribution to ergodicity broken. At T = 0.05, each trajectory only explores a very limited region in the outer state, leading to the distribution on the perimeter of a circle. At T = 0.30, since each trajectory can explore a larger area, the circle has been filled with points. At T = 0.85, the trajectories are found to locate in a cone. The points in the bottom surface and the vertex of the cone correspond to the non-transition trajectories inside the outer and inner potential wells, respectively, and the other points are transition trajectories between the two states. Once again, the trajectory-projected space clearly provides the topologies of this system.

	Figure Option View Download New Window
	Fig. 3. The eigenvalues and projection maps for MHAT potential under different temperatures. (a) The first seventy eigenvalues of H matrix calculated at different temperatures. For the three temperatures of 0.05 (b), 0.30 (c), and 0.85 (d), the projection of {Gⁱ} to the second, third, and fourth eigenvectors of H are also plotted.

3.1.3. Discussion

In the two 2D models, triangle functions in the 2D space are applied as basis functions, and the details are presented in Appendix A. We did the S² analysis mentioned in Subsection 2.2 at T = 0.85. For both systems, S² initially grows fast with increasing number of basis functions, then the growth slows down and shows a saturation behavior of S² as expected [see Figs. 4(a) and 4(b)]. Once S² has approximately reached the platform region, more basis functions will not change the calculation results anymore (data not shown). Thus, for TDQW and MHAT systems, about 24 basis functions are sufficient.

Figure Option
View Download New Window

Fig. 4. The S² curve, effect of t̃ , and illustration of statistics improvement. The behavior of S² with increasing number of basis functions are shown in panel (a) (TDQW) and panel (b) (MHAT). (c) The difference between the reconstructed equilibrium distribution and the theoretical distribution are shown for MHAT potential at T = 0.85. The horizontal axis denotes the method for reproducing the equilibrium distribution. ‘ eq’ means the results without reweighting (i.e., w_i = 1), numbers mean the selected

‘ SI1’ and ‘ SI2’ label the two methods for statistics improvement, respectively. The trajectory weights calculated with

equal to 0.01 are plotted as the inset. In panels (a), (b), and (c), the statistical errors are also shown as error bars. (d) The first 40 eigenvalues of H for MHAT potential at T = 0.05. The ones calculated with normal data set (stars), multiple trajectory data set (one-fifth initial configurations and five trajectories for each initial configuration) with the standard RED (circles), multiple trajectory data set with the SI2 (squares) are shown.

We also study the effect of free parameter t̃ in the RED with the simulation data of MHAT at T = 0.85. In this case, since there is only one zero eigenvalue of H, the whole free energy surface can be reconstructed. We uniformly divide the whole conformational space into small cells, k = 1, … , N_p (N_p is selected as 1600 here, i.e., 40 bins along each dimension), and the distribution probability in each cell is reconstructed with different We define the function as

to measure the difference between the reconstructed distribution and the theoretical distribution. Here is the distribution in the k-th cell which is reproduced by the RED using is the theoretical distribution in the cell. As shown in Fig. 4(c), indeed decreases with shorter in calculation. For in the [0.01, 0.1] interval, the values are almost the same, and are prominently smaller than the value of the non-weighted distribution of the simulation data. In the non-weighted (all w_i is unity) distribution, the major deviation from theoretical distribution resides in the inner potential well of MHAT, where the distribution is too high for the non-weighted data. In the reweighting process, the trajectories that start from the inner potential well are usually specified lower weights [see Fig. 4(c), inset], thus the over occupation in the inner state can be offset. The other variation in trajectory weights further adjust the intrastate distribution for both the inner and the outer states.

3.1.4. Improvement of statistical errors

We test the statistics improvement methods, SI1 and SI2. In here and below, we always randomly select one-fifth of the initial configurations, then five trajectories are spawned from each configuration to test the improvement. At T = 0.85, we apply both SI1 and SI2 to reconstruct the energy surface of MHAT. From Fig. 4(c), it can be seen that both methods correctly reproduce the energy surface with even smaller deviation from the theoretical one, as compared to the standard RED. At T = 0.05, the eigenvalues of the H matrix are calculated with SI2 method. As shown in Fig. 4(d), no matter whether one trajectory or multiple trajectories are generated from one initial configuration, the eigenvalues calculated by the standard RED are almost the same. However, analyzing by the SI2 method significantly alters the structure of eigenvalues. By generating more trajectories from single initial configuration, and combining these trajectories together in analysis, the statistics of the single trajectory are prominently improved, and the difference between trajectories within the outer state are reduced to a large extent. As a result, some of the eigenvalues increase, in particular for the ones deviating but close to 1.0. In other words, the SI2 method has depressed the statistical noise in the spectral analysis of H, leaving the physically relevant states recognized.

3.2. Solvated alanine dipeptide

The simulated system together with the initial configurations of simulation are shown in Fig. 5, where various notations of the atoms, dihedral angles, and free energy wells of this molecule are introduced. More details of simulations are presented in Appendix B.We use two different classes of basis functions to focus on the effects of different physical variables on the observed metastable states.

	Figure Option View Download New Window
	Fig. 5. Illustration of the alanine dipeptide molecule and the initial conformations of the RED simulation. The alanine dipeptide is shown in the left panel, with some dihedral angles and atoms labeled. The initial conformations projected to the ϕ – ψ plane are shown in the right panel. The background is the free energy surface at T = 600 K, with the free energy wells labeled.

3.2.1. Dihedral angles

With thorough investigation, the two dihedral angles ϕ and ψ are the major coordinates to characterize the conformational motion of the alanine dipeptide molecule. As in the previous study, ^[35] we adopted the first-to-fourth order twodimensional trigonometrical functions of ϕ and ψ as the basis functions. The 10 smallest eigenvalues of H are shown in Fig. 6(a). The smallest zero eigenvalue is intrinsic due to the construction of H. The second eigenvalue corresponds to the kinetic separation between the region with ϕ > 0 (containing and α _L) and the region with ϕ < 0 (containing and α _R). Consequently, by projecting the trajectory-mapped vectors {Gⁱ} to the second eigenvector of H, i.e., through , the trajectories can be clearly classified into the abovementioned two regions. The third eigenvalue of H corresponds to the partially kinetic connection between the and α _R states. Only a fraction of the trajectories in can climb over the free energy barriers to the α _R state, and vice versa. Thus, we may use the to classify the trajectories that are confined in the super-state involving and α _R into three groups. Since the two potential wells are well defined with their internal relaxation time smaller than the transition timescale, we can manipulate the trajectories to predict the transition times and locate the transition state ensemble.^[35] The fourth smallest eigenvalue of H is also smaller than 1.0. Although it no longer indicates a new state in the conformational space and should be taken as the reflection of statistical difference between trajectories, we recently find that, by observing two trajectories that transiently walk into the α _L region can be picked out. One of the two trajectories initially located in the α _L region, then quickly moved into , another just occasionally visited α _L. Actually, at the temperature of 300 K, the α _L region seems not very stable with the applied force field. The other eigenvalues of H are very close to 1.0 without observable meaningful information.

Figure Option
View Download New Window

Fig. 6. RED analysis with the basis functions of dihedral angles and inter-atomic distances. Shown are the results for solvated alanine dipeptide, which is simulated at T = 300 K with the modified force field. (a) The eigenvalues of H matrix calculated with different sets of basis functions, including the full set of inter-atomic distances (squares), 12 inter-atomic distances (upward triangles), first-to-fourth order trigonometrical functions of ϕ and ψ (circles), and first-to-second order trigonometrical functions of ϕ and ψ (stars). (b) The behavior of S² is shown either with the inter-atomic distances (distance-angle) at first, or with the trigonometrical functions of ϕ and ψ at first (angle-distance). The

values calculated with truncated trajectories (80 percent length) versus the ones calculated with full trajectories are shown. The results are obtained either with the first-to-second order trigonometrical functions of ϕ and ψ (c), or with the inter-atomic distances (d). The arrows label the intersection points between the dashed and dotted lines.

3.2.2. Distances between heavy atoms

We also test the effects of distances between atoms in the backbone of dipeptide by using them as basis functions. We compare two aspects, i.e., the eigenvalues of H and the manipulations on for extracting the kinetic information between and α _R. There are 10 non-hydrogen heavy atoms in alanine dipeptide. Except for the directly bonded atoms, 36 pairs of atoms exist, corresponding to 36 interatomic distances. Together with the two distances for the possible intramolecular hydrogen bonds (see Table 1), we select the 38 distances as basis functions. The resulting eigenvalues of H are shown in Fig. 6(a). While the three smallest eigenvalues are almost the same as the results obtained with the functions of dihedral angles, the fourth eigenvalue is prominently lifted to a value closer to 1.0. However, the discrimination of the two α _L-visited trajectories is still possible. We plot the values calculated with the trajectories truncated to 80-percent length (i.e., the 0.2τ ending segment of the trajectories are discarded in analysis) versus the ones calculated with full trajectories^[35] in Fig. 6(d). For comparison, the same figure generated with the first-to-second order two-dimensional trigonometrical functions of torsion angles is shown in Fig. 6(c). The two figures have quite similar structure. The points corresponding to the non-transition trajectories in α _R and are plotted as upward and downward (black) triangles, respectively. The points corresponding to the transition trajectories that start from α _R and are plotted as (red) squares and (blue) circles, respectively. The points along the inclined dashed lines correspond to the trajectories with only one time transition, and the points along the horizontal dotted lines correspond to the trajectories that do not happen to transition within the initial 0.8τ . Thus the intersection between the dotted and dashed lines corresponds to a single-transition trajectory with its transition happening exactly at 0.8τ . These points can be used to predict the times the transitions happen in the single-transition trajectories, consequently the transition state ensemble can be constructed. All the other points correspond to the trajectories with an even number of transitions which all happen within 0.8 τ , or with early transitions occurring within t̃ , or the multiple transition trajectories with transition happening both within 0.8 τ and after 0.8 τ . More details about detecting transition time thus transition conformation ensemble from the truncated-trajectory plots are presented in Ref. [35]. The similarity between Fig. 6(c) and Fig. 6(d) suggests both the torsion angles and the inter-atomic distances can be applied as basis functions.

Table 1. Basis function list.

3.2.3. Solvent molecules

For solvated biological macromolecules, the conformational change is sometimes correlated to the surrounding solvent molecules. In the RED, since the solvent-related properties can be selected as basis functions, it becomes possible to systematically one-by-one check and filter out the important collective variables. We study the dipeptide system with 91 basis functions to search for the solvent-related effects. These basis functions are classified into three groups, which respectively reflect the internal freedoms of alanine dipeptide (denoted as P), the solute-solvent relation (denoted as PW), and the structure of bulk solvent (denoted as W). The details of these basis functions are listed in Table 1. The eigenvalues of H are shown in Fig. 7(a). As can be seen, at T = 300 K, the inclusion of the solvent-related basis functions does not affect the spectral property of H. Particularly, the eigenvalues that are prominently smaller than 1.0 do not change, reflecting the unchanged physical division of the conformational space. Only a few eigenvalues pretty close to 1.0 are slightly lowered due to the added basis functions. The above results suggest that, for the current system, the solvent-related properties have almost reached equilibrium within the timescale of a single trajectory. Consequently, these properties contribute only a little to the inter-trajectory difference, and are unlikely to be strongly coupled with the dynamic process of the solute molecule under the simulated timescale.

Figure Option
View Download New Window

Fig. 7. The spectral properties of H calculated with the basis functions including solute– solvent relation. Shown are the results for the system of solvated alanine dipeptide. The eigenvalues of H calculated with different groups of basis functions are shown in panel (a) (300-K simulation with modified force field) and panel (b) (150-K simulation with standard forcefield). The functions used in calculation are labeled in the legend. The symbols ‘ P’ , ‘ PW’ , and ‘ W’ mean the three sets of basis functions included in calculation, the internal degrees of freedom of dipeptide, the solvent– solute interaction, and the bulk water; see the main text for more details. The behavior of S² with increasing number of basis functions are shown in panel (c) (300 K with modified force field) and panel (d) (150 K with standard force field). The insets of panels (c) and (d) show the part of S² curve that is related to the functions of solvents.

3.2.4. Low temperature

We also apply the RED to study the low-temperature (T = 150 K) properties of the dipeptide. We note that, at this temperature, the physical phenomena are quite complicated; the force field and the simulation procedure that we adopt may not reflect the physical reality. Here, we are only meant to illustrate the application of the RED in the extreme condition, instead of providing physical insight into the system.

The eigenvalues of the H matrix are calculated with the 91 basis functions listed in Table 1, and are plotted in Fig. 7(b). The solvent-related quantities do affect the eigenvalues of H now, leading to even more small eigenvalues. Correspondingly, the S², as shown in Fig. 7(d), almost always increases with the number of basis functions. At such a low temperature, the free energy surface becomes quite rough, and the diffusive dynamics^[39] in the conformational space slows down seriously. As a result, the simulation trajectories may only explore a small region in the conformational space. Some degrees of freedom, like the solvent motion around the solute molecule and the conformational motion perpendicular to the ϕ – ψ plane, are no longer equilibrated in a single trajectory. Thus, the inter-trajectory difference becomes larger, more states are detected automatically, and more basis functions are required to expand the rugged distribution of the samples, leading to the abnormal behavior of the eigenvalues of H and S².

Manipulating the eigenvectors of H can provide further information of the simulation data. The projections of {Gⁱ} to the second and the third eigenvectors of H pick out two completely separated states, and the super-state containing α _R, and [Fig. 8(a), at the current low temperature, the ϕ and ψ angles may not be enough to characterize the conformational motion, thus the notations, α _R, and , are used mainly for the purpose of illustration]. With the aid of the histogram of [see Fig. 8(a), inset], we can roughly classify the trajectories inside the super-state into three groups, i.e., the trajectories in the leftmost two bins, in the rightmost two bins and the remaining ones. The conformational distribution of trajectories in different groups are shown in Fig. 8(c). The two groups of the leftmost or rightmost bins both possess single peak distribution, either in region or α _R region, just like the non-transition trajectories identified in the simulations at higher temperature. In contrast, the trajectories in the remaining group have a significant occupation in the intermediate region between and α _R. Obviously, these trajectories are trapped at the intermediate region due to slow dynamics, and are different from the transition trajectories identified with the similar procedure at 300 K. We plot the calculated with 80 percent trajectory length with the ones with full trajectory length in Fig. 8(c). The behaviors presented in Fig. 6 no longer exist. In terms of kinetics, the transition between the two regions, α _R and , can neither be approximated as Markovian process nor described with the first-order transition process any more. Consequently, with the current simulation data, it is only possible to reconstruct the intrastate equilibrium distribution in the local regions of conformational space, or to study the more localized kinetic behavior.

Figure Option
View Download New Window

Fig. 8. Results for the low-temperature simulation of alanine dipeptide. (a) The projection of trajectories into the space of

and

. The inset is the histogram of

values. The different states, into which the trajectories are roughly classified, are labeled. (b) The distributions along the ψ angle in the groups are shown in panel (a). Shown are the distributions for the trajectories in the leftmost two bins (downward triangles), the rightmost two bins (upward triangles) and the other trajectories (circles). (c) The

values calculated with truncated trajectories (80 percent length) versus the ones calculated with full trajectories. (d) The first 70 eigenvalues of H calculated with normal data set (blue stars), multiple trajectory data set (one-fifth initial configurations and five trajectories for each initial configuration) with the standard RED (red circles), multiple trajectory data set with SI2 method (green squares) are shown.

Due to the numerous local minimums at the low temperature, the α _R and regions are just like two entropicdominated states. We also test the statistics improvement method in this system. With one-fifth initial conformations and five trajectories from each initial configuration, the inter-trajectory difference, thus the statistical noise, is indeed suppressed as shown in Fig. 8(d).

As a summary, in the dipeptide system at T = 300 K with modified potential, we find the redundant basis functions (including functions of both the dihedral angles and the inter-atomic distances) are consistently treated in the RED, and the inclusion of the solvent-related basis functions gives out the expected results. At pretty low temperature of 150 K, the roughness of the free energy surface is reflected in the results of RED. The conformational space can still be roughly divided into parts, however, the kinetics becomes too complicated to be unambiguously characterized.

4. Conclusion

As a systematic method for exploring the conformational space, the RED does not require much a priori knowledge and approximation, thus its results can honestly reflect the thermodynamic and kinetic information of the system. For biological macromolecules, the usually expected hierarchical structure can be extracted out step-by-step. For these complex systems, the structure of the conformational space may be much more complicated, these examples studied here become quite relevant for physically interpreting the results. Along with the applications, we design the S² quantity for checking the completeness of the selected basis functions, and the schemes for statistics improvement. These tactics are also proved to be helpful for portraying the conformational space.

Acknowledgments

Zhou Xin is grateful for the financial support of the Hundred of Talents Program in the Chinese Academy of Sciences. Gong Lin-Chen and Zhou Xin are also grateful for the support of the Max Planck Society and the Korea Ministry of Education, Science and Technology; part of the work was done while they worked at the Asia Pacific Center for Theoretical Physics, Korea under this support.

Reference

View Option

1	Torrie G M and Valleau J P 1977 J. Comput. Phys. 23 187 DOI:10.1016/0021-9991(77)90121-8 [Cited within:1] [JCR: 2.138]
2	Bartels C and Karplus M 1998 J. Phys. Chem. B 102 865 DOI:10.1021/jp972280j [Cited within:1] [JCR: 3.607]
3	Maragakis P, van der Vaart A and Karplus M 2009 J. Phys. Chem. B 113 4664 DOI:10.1021/jp808381s [Cited within:1] [JCR: 3.607]
4	Frantz D D, Freeman D L and Doll J D 1990 J. Chem. Phys. 93 2769 DOI:10.1063/1.458863 [Cited within:1] [JCR: 1.578]
5	Berg B A and Neuhaus T 1991 Phys. Lett. B 267 249 DOI:10.1016/0370-2693(91)91256-U [Cited within:1] [JCR: 0.479]
6	Marinari E and Parisi G 1992 Europhys. Lett. 19 451 DOI:10.1209/0295-5075/19/6/002 [Cited within:1] [JCR: 2.26]
7	Lyubartsev A P, Martsinovski A A, Shevkunov S V and Vorontsov-Velyaminov P N 1992 J. Chem. Phys. 96 1776 DOI:10.1063/1.462133 [Cited within:1] [JCR: 1.578]
8	Grubmüller H 1995 Phys. Rev. E 52 2893 DOI:10.1103/PhysRevE.52.2893 [Cited within:1] [JCR: 2.313]
9	Voter A F 1997 Phys. Rev. Lett. 78 3908 DOI:10.1103/PhysRevLett.78.3908 [Cited within:1] [JCR: 7.943]
10	Miron R A and Fichthorn K A 2004 Phys. Rev. Lett. 93 128301 DOI:10.1103/PhysRevLett.93.128301 [Cited within:1] [JCR: 7.943]
11	Zhou X, Jiang Y, Kremer K, Ziock H and Rasmussen S 2006 Phys. Rev. E 74 035701 DOI:10.1103/PhysRevE.74.035701 [Cited within:1] [JCR: 2.313]
12	Sugita Y and Okamoto Y 2000 Chem. Phys. Lett. 329 261 DOI:10.1016/S0009-2614(00)00999-4 [Cited within:1] [JCR: 2.145]
13	Xu S, Zhou X and Ou-Yang Z C 2012 Commun. Comput. Phys. 12 1293 DOI:10.4208/cicp.120811.200112a [Cited within:1] [JCR: 3.078]
14	Wang F G and Land au D P 2001 Phys. Rev. Lett. 86 2050 DOI:10.1103/PhysRevLett.86.2050 [Cited within:1] [JCR: 7.943]
15	Yan Q and Pablo J J 2003 Phys. Rev. Lett. 90 035701 DOI:10.1103/PhysRevLett.90.035701 [Cited within:1] [JCR: 7.943]
16	Kim J, Straub J E and Keyes T 2006 Phys. Rev. Lett. 97 050601 DOI:10.1103/PhysRevLett.97.050601 [Cited within:1] [JCR: 7.943]
17	Laio A and Parrinello M 2002 Proc. Natl. Acad. Sci. USA 99 12562 DOI:10.1073/pnas.202427399 [Cited within:1]
18	Mitsutake A and Okamoto Y 2009 J. Chem. Phys. 130 214105 DOI:10.1063/1.3127783 [Cited within:1] [JCR: 1.578]
19	Shaw D E, Maragakis P, Lindorff-Larsen K, Piana S, Dror R O, Eastwood M P, Bank J A, Jumper J M, Salmon J K, Shan Y B and Wriggers W 2010 Science 330 341 DOI:10.1126/science.1187409 [Cited within:1]
20	Wu Y L, Xu X H, Yang X J, Zou S and Ren X G 2014 Chin. Phys. B 23 028903 DOI:10.1088/1674-1056/23/2/028903 [Cited within:1] [JCR: 1.148] [CJCR: 1.2429]
21	Voter A F 1998 Phys. Rev. B 57 13985 DOI:10.1103/PhysRevB.57.R13985 [Cited within:1]
22	Shirts M R and Pand e V S 2001 Phys. Rev. Lett. 86 4983 DOI:10.1103/PhysRevLett.86.4983 [Cited within:1] [JCR: 7.943]
23	Huang X H, Bowman G R and Pand e V S 2008 J. Chem. Phys. 128 205106 DOI:10.1063/1.2908251 [Cited within:1] [JCR: 1.578]
24	Noé F and Fischer S 2008 Curr. Opin. Struct. Biol. 18 154 DOI:10.1016/j.sbi.2008.01.008 [Cited within:2]
25	Becker O M and Karplus M 1997 J. Chem. Phys. 106 1495 DOI:10.1063/1.473299 [Cited within:1] [JCR: 1.578]
26	Wales D J 2001 Science 293 2067 DOI:10.1126/science.1062565 [Cited within:1]
27	Krivov S V and Karplus M 2004 Proc. Natl. Acad. Sci. USA 101 14766 DOI:10.1073/pnas.0406234101 [Cited within:1]
28	Rao F and Caflisch A 2004 J. Mol. Biol. 342 299 DOI:10.1016/j.jmb.2004.06.063 [Cited within:1] [JCR: 3.905]
29	Gfeller D, Rios P D L, Caflish A and Rao F 2007 Proc. Natl. Acad. Sci. USA 104 1817 DOI:10.1073/pnas.0608099104 [Cited within:1]
30	Noé F, Horenko I, Schütte C and Smith J C 2007 J. Chem. Phys. 126 155102 DOI:10.1063/1.2714539 [Cited within:1] [JCR: 1.578]
31	Chodera J D, Singhal N, Pand e V S, Dill K A and Swope W C 2007 J. Chem. Phys. 126 155101 DOI:10.1063/1.2714538 [Cited within:1] [JCR: 1.578]
32	Prada-Gracia D, Gomez-Gardenes J, Echenique P and Falo F 2009 PLos Comput. Biol. 5e1000415 DOI:10.1371/journal.pcbi.1000415 [Cited within:1] [JCR: 4.867]
33	Bowman G R, Meng L and Huang X H 2013 J. Chem. Phys. 139 121905 DOI:10.1063/1.4812768 [Cited within:1] [JCR: 1.578]
34	Weber J K, Jack R L and Pand e V S 2013 J. Am. Chem. Soc. 135 5501 DOI:10.1021/ja4002663 [Cited within:2] [JCR: 10.677]
35	Gong L C and Zhou X 2009 Phys. Rev. E 80 026707 DOI:10.1103/PhysRevE.80.026707 [Cited within:7] [JCR: 2.313]
36	Gong L C and Zhou X 2010 J. Phys. Chem. B 114 10266 DOI:10.1021/jp100737g [Cited within:2] [JCR: 3.607]
37	Lu S J and Zhou X 2015 Commun. Theor. Phys. 63 10 DOI:10.1088/0253-6102/63/1/03 [Cited within:1]
38	Drozdov A N, Grossfield A and Pappu R V 2004 J. Am. Chem. Soc. 126 2574 DOI:10.1021/ja039051x [Cited within:1] [JCR: 10.677]
39	Zwanzig R 1988 Proc. Natl. Acad. Sci. USA 85 2029 DOI:10.1073/pnas.85.7.2029 [Cited within:1]
40	Phillips J Cet al. 2005 J. Comput. Chem. 26 1781 DOI:10.1002/(ISSN)1096-987X [Cited within:1] [JCR: 3.835]

1977

2.138

0.0

... Due to those, lots of various advanced simulation techniques, such as umbrella sampling,^{[1#cod#x2013 ...}

1998

3.607

0.0

2009

3.607

0.0