中国物理B ›› 2009, Vol. 18 ›› Issue (1): 370-376.doi: 10.1088/1674-1056/18/1/060

• • 上一篇    下一篇

Chaos game representation (CGR)-walk model for DNA sequences

高洁1, 徐振源2   

  1. (1)School of Science, Jiangnan University, Wuxi 214122, China;School of Information Technology, Jiangnan University, Wuxi 214122, China; (2)School of Science, Jiangnan University, Wuxi 214122, China
  • 收稿日期:2008-04-24 修回日期:2008-08-27 出版日期:2009-01-20 发布日期:2009-01-20
  • 基金资助:
    Project supported by the National Natural Science Foundation of China (Grant No 60575038) and the Natural Science Foundation of Jiangnan University, China (Grant No 20070365).

Chaos game representation (CGR)-walk model for DNA sequences

Gao Jie(高洁)a)b) and Xu Zhen-Yuan(徐振源)a)   

  1. a School of Science, Jiangnan University, Wuxi 214122, China; b School of Information Technology, Jiangnan University, Wuxi 214122, China
  • Received:2008-04-24 Revised:2008-08-27 Online:2009-01-20 Published:2009-01-20
  • Supported by:
    Project supported by the National Natural Science Foundation of China (Grant No 60575038) and the Natural Science Foundation of Jiangnan University, China (Grant No 20070365).

摘要: Chaos game representation (CGR) is an iterative mapping technique that processes sequences of units, such as nucleotides in a DNA sequence or amino acids in a protein, in order to determine the coordinates of their positions in a continuous space. This distribution of positions has two features: one is unique, and the other is source sequence that can be recovered from the coordinates so that the distance between positions may serve as a measure of similarity between the corresponding sequences. A CGR-walk model is proposed based on CGR coordinates for the DNA sequences. The CGR coordinates are converted into a time series, and a long-memory ARFIMA (p, d, q) model, where ARFIMA stands for autoregressive fractionally integrated moving average, is introduced into the DNA sequence analysis. This model is applied to simulating real CGR-walk sequence data of ten genomic sequences. Remarkably long-range correlations are uncovered in the data, and the results from these models are reasonably fitted with those from the ARFIMA (p, d, q) model.

关键词: CGR-walk model, DNA sequence, long-memory, ARFIMA(p, d, q) model

Abstract: Chaos game representation (CGR) is an iterative mapping technique that processes sequences of units, such as nucleotides in a DNA sequence or amino acids in a protein, in order to determine the coordinates of their positions in a continuous space. This distribution of positions has two features: one is unique, and the other is source sequence that can be recovered from the coordinates so that the distance between positions may serve as a measure of similarity between the corresponding sequences. A CGR-walk model is proposed based on CGR coordinates for the DNA sequences. The CGR coordinates are converted into a time series, and a long-memory ARFIMA (p, d, q) model, where ARFIMA stands for autoregressive fractionally integrated moving average, is introduced into the DNA sequence analysis. This model is applied to simulating real CGR-walk sequence data of ten genomic sequences. Remarkably long-range correlations are uncovered in the data, and the results from these models are reasonably fitted with those from the ARFIMA (p, d, q) model.

Key words: CGR-walk model, DNA sequence, long-memory, ARFIMA(p, d, q) model

中图分类号:  (Folding: thermodynamics, statistical mechanics, models, and pathways)

  • 87.15.Cc
05.40.Fb (Random walks and Levy flights) 05.45.-a (Nonlinear dynamics and chaos) 87.14.E- (Proteins) 87.14.G- (Nucleic acids) 87.15.A- (Theory, modeling, and computer simulation)