Exploring the relationship between fractal features and bacterial essential genes
Yu Yong-Ming1, Yang Li-Cai1, †, , Zhou Qian2, 3, Zhao Lu-Lu1, Liu Zhi-Ping1
Department of Biomedical Engineering, Shandong University, Jinan 250061, China
Province-Ministry Joint Key Laboratory of Electromagnetic Field and Electrical Apparatus Reliability, Hebei University of Technology, Tianjin 300130, China
Department of Biomedical Engineering, Hebei University of Technology, Tianjin 300130, China

 

† Corresponding author. E-mail: yanglc@sdu.edu.cn

Project supported by the Shandong Provincial Natural Science Foundation, China (Grant No. ZR2014FM022).

Abstract
Abstract

Essential genes are indispensable for the survival of an organism in optimal conditions. Rapid and accurate identifications of new essential genes are of great theoretical and practical significance. Exploring features with predictive power is fundamental for this. Here, we calculate six fractal features from primary gene and protein sequences and then explore their relationship with gene essentiality by statistical analysis and machine learning-based methods. The models are applied to all the currently available identified genes in 27 bacteria from the database of essential genes (DEG). It is found that the fractal features of essential genes generally differ from those of non-essential genes. The fractal features are used to ascertain the parameters of two machine learning classifiers: Naïve Bayes and Random Forest. The area under the curve (AUC) of both classifiers show that each fractal feature is satisfactorily discriminative between essential genes and non-essential genes individually. And, although significant correlations exist among fractal features, gene essentiality can also be reliably predicted by various combinations of them. Thus, the fractal features analyzed in our study can be used not only to construct a good essentiality classifier alone, but also to be significant contributors for computational tools identifying essential genes.

1. Introduction

Essential genes are a minimal set of genes that are considered to be indispensable for the survival of an organism under most favorable conditions.[1] Identifying and predicting gene essentiality for the survival of an organism have both theoretical and practical significance in genome biology and medicine. Because essential genes are indispensable for cell viability, their protein products are attractive drug targets for new antimicrobials. A designed drug that disrupts the function of such a protein will kill the pathogens.[2,3] More broadly, essential genes provide an insight into understanding the origin of life and evolutionary relationships among species, and they are the foundation for revealing the smallest requirements for sustaining a life form, and it is critical for synthetic biology.[4,5]

Today, systematic genome-wide interrogations of essential genes have been conducted by ribonucleic acid (RNA) interference,[6] gene knockouts,[7] and transposon mutagenesis.[8] However, there are several limitations to these experimental methods: they need massive investments in time and resource; based on large-scale experiment, the speed for them to identify essential gene falls far behind that for genome sequence; moreover, the identification results may be flexible, due to different experimental conditions and criteria.[9,10] With a fast-increasing availability of genome sequence and other functional genomics data, a highly efficient computational method becomes an appealing option for circumventing those large-scale experimental methods.

Substantial efforts have been made to explore the features associated with gene essentiality, with the aim of helping design feature-based computational methods. Sequence-based features which quantitatively measure the conservative property of genes have been considered as the most powerful features for gene essentiality identification. For instance, phyletic retention,[11] domain enrichment,[12] and number of paralogs[13] which were introduced to weight the evolutionary conservation of an individual gene or local gene sequence, have been found to be very predictive of essentiality. On the other hand, some studies exist that point out that in bacteria, a higher expression level is connected with slower evolutionary speed of gene sequence.[14] Codon usage bias,[15] leading strand,[16] and fluctuation of gene expression,[17] which is generally used to measure the level of gene expression, have been clearly shown to be associated with essentiality. Moreover, Lin et al., also have suggested that the essential genes with important functions such as translation, transcription, replication, etc. are more crucial and prevailing at leading strand than those non-essential genes from other function subcategories.[18] Hawoong et al. have illustrated that in contrast to essential genes, the fluctuation of non-essential genes expression tend to be more wide.[19] Besides, other sequence-based features such as subcellular location,[20] ORF length,[21] and upstream size[22] each can be used as an index to measure the essentiality of genes as well.

In fact, genes and their protein products in cell are not isolated. They interact with each other and constitute a biological network to accomplish the biological functions of cells. While sequence-base features above extract the information from individual genes, topology-based feature which is introduced to value the properties of protein-protein interaction (PPI) networks is mainly of interest in the information about interacting protein pairs. Based on that, a group of studies have shown that topology-based features such as degree-like features, between-like features, closeness-like features and subgraph-like features, etc. are powerful and prevailing to measure the relationship between essentiality and topological properties of protein in PPI networks.[23] In addition to the features detailed above, some studies also found respectively the biological process annotation,[24] hydrophobicity of protein,[25] and synthetic accessibility,[26] etc. to be important contributors for predicting genes essentiality. Based on the integration of the features detailed above, many computational techniques have been introduced to predict essential genes.[11,19,24,27] However, there are some practical limitations to these computational methods: they are restricted to few specific model organisms, lack easily available powerful features, and perform poorly in classification accuracy. Consequently, exploring features associated with essentiality is still crucial for designing computational tools.

In recent years, some interest has been shown in applying the methods of fractal theory for analyzing gene sequences and protein sequences.[2830] Our previous studies have also found that gene essentiality tends to be more chaotic and less structured and the fractal features, Hurst exponents and information dimension can quantitatively assess those regular complexity.[31,32] Here, while individually assessing the connection to essentiality, we also focus on measuring the relationship between those fractal features and essentiality at integrative context. All the currently available identified essential genes and non-essential genes from the database of essential genes (DEG) are analyzed. Subsequently, statistic checkout analysis is carried out to investigate the correlation and difference in fractal feature between essential genes and non-essential genes. By combining fractal features in different categories, two machine learning algorithms are introduced to calculate the average area under the curve (AUC) to quantitatively weight the predictive utility of fractal features.

2. Material and methods
2.1. Data resource

DEG (http://www.essentialgene.org/) was established in 2004. With the continuous update, DEG can provide all currently available identified essential and non-essential genes and proteins of bacteria archaea and eukaryotes. A total of 27 bacteria (30 records) with their essential proteins and non-essential proteins are collected from DEG 10. The information about analysis objects is listed in Table 1.

Table 1.

Information about the analysis objects. In the table, the superscript “*” indicate that these objects are protein data obtained from the same strain of the same bacterium, identified under different conditions or by different methods. the superscript “Δ” denotes that these objects are protein data obtained from the same bacterium, but from different strains, identified under different conditions or by different methods.

.
2.2. Topological entropy of gene sequences

Topological entropy is a measure of complex regulation of a gene sequence. For a given gene sequence x(s), N corresponds to the length of x(s) and n is defined as a unique integer by the following equation:

Based on deferent values of n, the complexity function Cx is defined as

where Cx represents the number of different n-length sub-words (overlaps allowed) that appear in x(s).

Then for x4n+n−1 the first 4n + n − 1 letters of x, the definition of topological entropy of the finite sequence is

where Htop(x) is the topological entropy of x(s).

2.3. Chaos game representation of gene sequences

Chaos game representation (CGR) is a scale-independent representation method for genomic sequences proposed by Jeffrey in 1990.[33] In the CGR representation of a gene sequence x(s), the four bases A, T, C, and G correspond to the four vertices of a square [0,1] × [0,1]. The first point is plotted halfway between the center of the square and the vertex corresponding to the first base of the sequence x(s), and the i-th point of the plot is placed halfway between the (i − 1)-th point and the vertex corresponding to the i-th base. In this way, the CGR points are generated by an iterated function system defined by the following equation:

where Pi is the coordinates of one point in the CGR square, corresponds to the i-th element of transformed sequences x(s), and P0 = (0, 0); Si corresponds to the vertex coordinates of this transformed element.

2.4. Information dimension of gene CGR representation

Information dimension is an index for characterizing fractal patterns or sets by quantifying their complexity as a ratio of the change in detail to the change in scale. We divide the CGR pattern into ɛ2 square grids of equal size, and then the side length of each grid is 1/ɛ. The numbers of non-empty grids are denoted by Z(ɛ). Dividing the number of points falling in the i-th grid by the total point number in the CGR square yields a probability Pi for the i-th grid. Information function and information dimension for the CGR are respectively defined as

The information function I (ɛ) in a range of log (1/ɛ) has a scaling region, where ɛ is an integer from 2 to 10. The information dimension DI can be obtained from the slope in the scaling region.

2.5. Hurst exponent of gene time series

The Hurst exponent is the measure of the smoothness of fractal time series based on the asymptotic behavior of the rescaled range of the process. Here, we use the method proposed by Milan Randic to convert the CGR gene map into a time series and then, estimate the Hurst exponent of the times series of gene sequence based on rescaled range (R/S) analysis method.[34] It involved the following basic steps.

For a given gene sequence x(s), F is a transformed times series over a total duration N. For a deterministic integer τ, the cumulative total at each point at different times is defined as

where Fi is the value of the time series at time i, μτ is the mean over the whole data set given by

The range R of Γ is given by

The standard deviation (SD) of the values over the whole data set is given as

The rescaled range is given by R/S. The Hurst exponent is estimated by plotting the values of log(R/S) versus log τ, (4 ≤ τ ≤ 0.5N). The slope of the best fitting line gives the estimate of the Hurst exponent.

For a protein sequence, the detailed HP model[35] is used to divide a protein sequence into four sets: non-polar set, negative polar set, uncharged polar set, and positive polar set. Then, each protein sequence could be transformed into a new sequence composed of only four types of characters, like a gene sequence. The way we compute the fractal features of protein sequences is just the same as that we compute the fractal features of gene sequences. In such a way, gene sequence is represented by information dimension of DNA (ID), Hurst exponent of DNA (HD), topological entropy of DNA (TD), and the feature vector of protein sequence contains the information dimension of protein (IP), Hurst exponent of protein (HP) and topological entropy of protein (TP).

2.6. Training and testing set construction and classifier design

For most bacteria, there are much more non-essential genes than essential genes and they are classifications of highly imbalanced datasets. We solve the class imbalance problem through re-sampling, where we first randomly select semi-essential and non-essential genes as two sets, keep all genes of smaller set, and then from other sets randomly pick up the same number of genes as that of the set with fewer genes. These extracted essential and non-essential genes constitute the training sets and accordingly, all the remaining gene data are part of testing set. Naïve Bayes and Random Forest classifier are used to construct classifier model. For each bacterium, we train our classifier by using selected training set and predict the probability of essentiality for each gene in testing set by using the trained classifiers. On the basis of the true gene label and the probability of essentiality, we calculate the AUC of prediction to evaluate the classification performance. Each time, we repeat the random process 100 times, and then obtain the average AUC and its associated SD to assess the accuracy and robustness of our model, which is appropriate for the class unbalance problem.

3. Results
3.1. Assessing predictive utility for essentiality in individual fractal feature

For all 30 analysis objects, these fractal features of all essential and non-essential genes are calculated and then, Student’s T-test analysis is utilized to explore the difference of each fractal feature between essentiality and non-essentiality. Figure 1 shows the results of the mean and significant level of each fractal feature for all essential and non-essential genes in S. sanguinis. For gene and protein sequences, the average information dimensions and topological entropies of essentialities are higher than those of non-essentialities and the results of Hurst exponents shift towards the opposite direction. There are significant differences in mean of fractal feature between essentiality and non-essentiality. It is suitable for most analysis objects. It suggests that essential genes are less edited and have weaker long-range correlation than nonessential genes.

Fig. 1. Mean values and SDs of fractal features of essentiality and non-essentialirty in S. sanguinis. * refers to significant difference existing at the 5% level. NEG: non-essential genes, EG: essential genes, NEP: non-essential proteins, and EP: essential proteins.

In addition to exploring the difference between two kinds of genes, two classifier models are developed to further access the predictive power of individual fractal feature. Figure 2 shows the average AUC score of essentiality prediction in 27 bacteria based on Naïve Bayes and Random Forest classifier respectively. As shown in Fig. 2, the average AUC scores of six fractal features of all 27 bacteria exceed 0.624 and 0.625 for Naïve Bayes and Random Forest respectively, and some bacteria are above 0.7. It is also found that there are above 75% analysis objects whose prediction performance using information dimension outperforms using Hurst exponent and topological entropy, and nevertheless there is no obvious change between using Hurst exponent and using topological entropy.

Fig. 2. Average AUC scores of essentiality prediction in 27 bacteria using Naïve Bayes (left) and Random Forest (right) classifier respectively. The exact values of this figure are provided in Supplement 1.

On the other hand, the trend exists among all fractal features that the average classification accuracies using 100 balanced dataset from Mycobacterium tuberculosis H37Rv, Vibrio cholera N16961, and Salmonella enterica serovar Typhimurium 14028S are generally higher than those of others, improving 20%. Consequently, we propose that these fractal features are highly powerful annotation to make essentiality prediction and a better predictive performance is obtained by information dimension.

3.2. Assessing predictive utility for essentiality in integrative strategies
3.2.1. Correlation among fractal features

Since these fractal features are confirmed to be predictive to essentiality, we are interested in how many fractal features are independent among these fractal features. The Spearman correlation coefficients between pairs of those fractal features of all 27 bacteria are calculated and shown in Fig. 3. There are weak negative correlations between Hurst exponent and information dimension, and between Hurst exponent and topological entropy. Instead, much stronger positive correlation exists between information dimension and topological entropy. It is suitable for gene sequences and protein sequence. It can also be found that the features of gene sequence weakly relate to those of protein sequence and their average correlation coefficients range from −0.25 to 0.25. These results for 27 bacteria remain stable, which probably provides facilitated access for exploring genetic structure.

Fig. 3. Spearman correlation coefficients between pairs of fractal features of all 27 bacteria. Significant correlations generally exist among these fractals at 5% level. The exact values of this figure are provided in Supplement 2.
3.2.2. Prediction of essential gene base on diverse combination of fractal features

For 27 bacteria, Naïve Bayes and Random Forest classifiers are employed to make prediction of essentiality by integrating fractal features in three strategies: fractal features of gene sequences, fractal features of protein sequences, and all fractal features. The average AUC scores of using Naïve Bayes classifier are shown in Fig. 4 and Random Forest classifier is displayed in Fig. 5. From both figures, we find that the average AUC of all bacteria is above 0.62 in either of features integrating strategies and those of some analysis objects are all around 0.7. Specifically, of all the 30 analysis objects, appropriately 83%, 86%, and 93% obtain average AUC scores above 0.6 for the combination of fractal features of gene sequences, of protein sequences, and all of them, respectively.

Fig. 4. Mean values and SDs of AUC scores of genes essentiality prediction in 27 bacteria based on Naïve Bayes classifier trained by different integrations of fractal features. The average AUC scores of 27 bacteria are above 0.62. There are 15 (55.56%) bacteria obtaining better predictive performance with fractal features of gene sequences than that of protein sequences and 26 (96.26%) bacteria having better predictive performance by using all fractal features.
Fig. 5. Mean values and SDs of AUC scores of genes essentiality prediction in 27 bacteria based on Random Forest classifier trained by different integrations of fractal features. The average AUC scores of 27 bacteria are above 0.64. There are 20 (75.07%) bacteria obtaining better predictive performance with fractal features of gene sequences than that of protein sequences and the better predictive performance is obtained by using all fractal features.

On the other hand, there are 15 (55.56%) and 20 (75.07%) bacteria obtaining better predictive performance with fractal features of gene sequences than that of protein sequences, based on Naïve Bayes and Random Forest classifiers respectively. So, we deduce that gene fractal features maybe generally outperform protein sequences for gene essential prediction.

Additionally, for Naïve Bayes classifier, there are 26 (96.26%) bacteria having better predictive performance of all fractal features than those of two other integrating strategies, where the improvement of average AUC score of those bacteria is 0.034 (5.26%). For Random Forest, there are 27 (100%) bacteria of such a kind, and the average improvement of AUC score is 0.021 (3.87%). It suggests that we can construct a classifier with better predictive performance by integrating features of gene and protein sequence.

3.3. Comparative studies

In recent years, several computational methods have been proposed to predict essential genes. In 2009, Acencio and Lemke constructed a decision tree-based meta-classifier based on network topological features, cellular compartments and biological processes to generate various predictors of essential genes.[24] Similarly, Palaniappan and Mukherjee used 52 input genomics, i.e., gene paralogs, sub-cellular location, Codon W features, etc., to learn the parameters of SVM classifier to predict essential genes.[36] In 2014 Ning et al. built an SVM classifier of gene essentiality by using all 158 features including Codon usage, amino acid usage, Codon W features, etc.[37] Tang et al. sought to measure essentiality by using PPI and gene expression data.[38] Here, we use those methods to make essentiality prediction in E. coli and then compare them with our classifier model through AUC score. The comparative result is shown in Fig. 5. As shown in Fig. 5, the AUC score of our model is larger than those of the methods of Tang, Palaniappan and Acencio, and competitive with that of the approach of Ning. Besides the limitation of predictive performance, those methods need a number of features, some of which are not easily obtainable from primary sequence. The fractal features in this paper are intrinsic features and easily extracted from primary gene and protein sequence by computational methods. They are of powerful prediction and our classifier is satisfactorily discriminative between essential genes and non-essential genes.

Fig. 6. Comparative results between our classifier model and other four methods on predicting essential genes in E. coli by AUC score.
4. Discussion and summary

The ability to consistently identify gene essentiality is of great significance to biological knowledge discovery and drug discovery operation, since it is basic requirement for sustaining organism alive. In this study, we assess the relationship between fractal features and essential genes by statistical analysis and machine learning-based methods, in hope of exploring the actual relationship between essentiality and these features, thus helping design computational tools to predict and screen essential genes from the bacterial genome. All currently available identified essential and non-essential genes and proteins in 27 bacteria from DEG database are studied in this paper. We find that essential genes significantly differ from non-essential genes in their fractal features in most bacteria. Because of the class imbalanced problem, we also design balanced training datasets to train Naïve Bayes and Random Forest classifiers with the aim of obtaining reliable results. The good performances of two classifiers imply that these fractal features are of powerful predictions of genes essentiality, either in integrative context or individually, although there are weak positive correlations among some of them.

Some studies have observed that genes essentiality powerfully relate to regulation complexity where the more targets a regulator has, the more important it is.[39] Our study provides facilitated access to quantitatively assess the regulation complexity of gene and protein sequences by fractal features. It is found that essential genes and proteins show higher regulatory complexity and less edited than nonessential genes and proteins. Essential genes are more evolutionarily conserved than non-essential gens and compared with non-essential proteins, essential proteins tend to enrich some function categories, such as transcription, translation and replication. However, they all have complex structures between deterministic and random, like chaotic attractors.

On the other hand, rapid and accurate identification of new essential gens requires learning classifier systems to integrate a subset of gene features and the work of Deng et al.[40] have shown integrations of both intrinsic and context-dependent features of genomic sequences help significantly improve the robustness and accuracy of prediction system. These fractal features in this study are intrinsic features which do not depend on external conditions. More importantly, they can be easily derived from primary genomic sequences, not relying on experimental environment. An additional advantage of our features over other attributes which only make sense to few specific model organisms is that their predictive utility keep reliable for 27 bacteria, which provides significant assist in the gene essentiality prediction in under-studied organisms.

In summary, through exploring the relationship between fractal feature and genes essentiality in different strategies, we come to the conclusion that these fractal features are effective and predictive quantitative indicators for checking bacterial essential genes by themselves and they are also promising contributors for designing computational tools identifying essential genes, either in combination strategy or individually. More broadly, the result obtained from our study will provide valuable information about synthetic biology and facilitate the annotations of microbial genomes and detection of potential drug target.

Reference
1Winzeler E AShoemaker D DAstromoff Aet al. 1999 Science 285 901
2Furney S JAlb M MLopez-Bigas N 2006 BMC Genomics 7 165
3Clatworthy A EPierson EHung D T 2007 Nat. Chem. Biol. 3 541
4Glass J IHutchison C ASmith H OVenter J C 2009 Mol. Syst. Biol. 5 330
5Gibson D GGlass J ILartigue Cet al. 2010 Science 329 52
6Kamath R SFraser A GDong YPoulin GDurbin RGotta MKanapin ALe Bot NMoreno SSohrmann MWelchman D PZipperlen PAhringer J 2003 Nature 421 231
7Giaever GChu A MNi Let al. 2002 Nature 418 387
8Gallagher L ARamage EJacobst M AKaul RBrittnacher MManoil C 2007 P. Natl. Acad. Sci. USA 104 1009
9Tong XCampbell J WBal AZsiG A BKay K AWanner B LGerdes S YOltvai Z N 2004 Biochem. Bioph. Res. Co. 322 347
10Antonia Molina Henares Mde la Torre JGarcia Salamanca AJesus Molina Henares ACarmen Herrera MRamos J LDuque E 2010 Environ. Microbiol. 12 1468
11Gustafson A MSnitkin E SParker S CDelisi CKasif S 2006 BMC Genomics 7 265
12Jehl M AArnold RRattei T 2011 Nucleic Acids Res. 391 D591
13Saha SHeber S and Others 2006Genet. Mol. Res.5224
14Sharp P M 1991 J. Mol. Evol. 33 23
15Li MZheng RZhang HWang JPan Y 2014 Methods 67 325
16Rocha E PDanchin A 2003 Nat. Genet. 34 377
17P A L CPapp B A ZHurst L D2003Nature421496
18Lin YGao FZhang C T 2010 Biochem. Bioph. Res. Co. 396 472
19Hawoong JZoltan N OAlbert-Laszlo B 2003 ComPlexus 1 19
20Han G SYu Z GVo A 2011 Chin. Phys. 20 100504
21Hwang Y CLin C CChang J YMori HJuan H FHuang H C 2009 Mol. Biosyst. 5 1672
22Wang J XPeng WWu F X 2013 Proteom. Clin. Appl. 7 181
23Ning KNg H KSrihari SLeong H WNesvizhskii A I 2010 BMC Bioinformatics 11 505
24Acencio M LLemke N 2009 BMC Bioinformatics 10 290
25Seringhaus MPaccanaro ABorneman ASnyder MGerstein M 2006 Genome Res. 16 1126
26Wunderlich ZMirny L A 2006 Biophys. 91 2304
27Estrada ERodriguez-Velazquez J A 2005 Phys. Rev. 71 56103
28Yu Z GXiao Q JShi LWu Y JAnh V 2010 Chin. Phys. 19 068701
29Yu Z GAnh VMin G ZLong S C 2002 Chin. Phys. 11 1313
30Zhu S MYu Z GVo A 2011 Chin. Phys. 20 100505
31Zhou QYu Y M 2014 Chaos, Solitons and Fractals 69 209
32Zhou QYu Y M 2014 J. Phys. D: Appl. Phys. 47 465401
33Jeffrey H J 1990 Nucleic Acids Res. 18 2163
34Randic M 2008 Chem. Phys. Lett. 456 84
35Yu Z GAnh VLau K S 2004 J. Theor. Biol. 226 341
36Palaniappan KMukherjee S199210th International Conference on Machine Learning and Applications and Workshops (ICMLA)Honolulu, Hawaii, USA18910.1016/10.1109/ICMLA.2011.114
37Ning L WLin HDing HHuang JRao NGuo F B 2014 Genet. Mol. Res 13 4564
38Tang XWang JZhong JPan Y 2014 IEEEACM Trans. Comput. Biol. Bioinform. 11 407
39Yu H YGreenbaum DLu H XZhu X WGerstein M 2004 Trends Genet. 20 227
40Deng JDeng LSu SZhang MLin XWei LMinai A AHassett D JLu L J 2011 Nucleic Acids Res. 39 795