中国物理B ›› 2008, Vol. 17 ›› Issue (12): 4396-4400.doi: 10.1088/1674-1056/17/12/011

• GENERAL • 上一篇    下一篇

A combined statistical model for multiple motifs search

高丽锋1, 刘鑫2, 官山3   

  1. (1)Chinese Academy of Agriculture Science, Beijing 100081, China; (2)Institute of Theoretical Physics, Beijing 100080, China; (3)Physics science and technology Department, Yangzhou University, Yangzhou 225009, China
  • 收稿日期:2008-01-03 修回日期:2008-02-13 出版日期:2008-12-20 发布日期:2008-12-20
  • 基金资助:
    Project supported by the National Science Foundation of China (Grant No 70671089), and the Key Important Project (No 10635040).

A combined statistical model for multiple motifs search

Gao Li-Feng (高丽锋)a, Liu Xin (刘鑫)b, Guan Shan (官山)c   

  1. a Chinese Academy of Agriculture Science, Beijing 100081, China; b Institute of Theoretical Physics, Beijing 100080, China; c Physics science and technology Department, Yangzhou University, Yangzhou 225009, China
  • Received:2008-01-03 Revised:2008-02-13 Online:2008-12-20 Published:2008-12-20
  • Supported by:
    Project supported by the National Science Foundation of China (Grant No 70671089), and the Key Important Project (No 10635040).

摘要: Transcription factor binding sites (TFBS) play key roles in gene's expression and regulation. They are short sequence segments with definite structure and can be recognized by the corresponding transcription factors correctly. From the viewpoint of statistics, the candidates of TFBS should be quite different from the segments that are randomly combined together by nucleotide. This paper proposes a combined statistical model for finding over-represented short sequence segments in different kinds of data set. While the over-represented short sequence segment is described by position weight matrix, the nucleotide distribution at most sites of the segment should be far from the background nucleotide distribution. The central idea of this approach is to search for such kind of signals. This algorithm is tested on 3 data sets, including binding sites data set of cyclic AMP receptor protein in E.coli, PlantProm DB which is a non-redundant collection of proximal promoter sequences from different species, collection of the intergenic sequences of the whole genome of E.Coli. Even though the complexity of these three data sets is quite different, the results show that this model is rather general and sensible.

Abstract: Transcription factor binding sites (TFBS) play key roles in gene's expression and regulation. They are short sequence segments with definite structure and can be recognized by the corresponding transcription factors correctly. From the viewpoint of statistics, the candidates of TFBS should be quite different from the segments that are randomly combined together by nucleotide. This paper proposes a combined statistical model for finding over-represented short sequence segments in different kinds of data set. While the over-represented short sequence segment is described by position weight matrix, the nucleotide distribution at most sites of the segment should be far from the background nucleotide distribution. The central idea of this approach is to search for such kind of signals. This algorithm is tested on 3 data sets, including binding sites data set of cyclic AMP receptor protein in E.coli, PlantProm DB which is a non-redundant collection of proximal promoter sequences from different species, collection of the intergenic sequences of the whole genome of E.Coli. Even though the complexity of these three data sets is quite different, the results show that this model is rather general and sensible.

Key words: transcription factor binding sites, motif, position weight matrix

中图分类号:  (Regulatory genetic and chemical networks)

  • 87.16.Yc
87.14.E- (Proteins) 87.15.A- (Theory, modeling, and computer simulation) 87.15.B- (Structure of biomolecules) 87.15.Cc (Folding: thermodynamics, statistical mechanics, models, and pathways)