TY - JOUR
T1 - Identification of common motifs in unaligned DNA sequences
T2 - Application to Escherichia coli Lrp regulon
AU - Fraenkel, Yishai M.
AU - Mandel, Yael
AU - Friedberg, Devorah
AU - Margalit, Hanah
PY - 1995/8
Y1 - 1995/8
N2 - We describe a relatively simple method for the identification of common motifs in DNA sequences that are known to share a common function. The input sequences are unaligned and there is no information regarding the position or orientation of the motif. Often such data exists for protein- binding regions, where genetic or molecular information that defines the binding region is available, but the specific recognition site within it is unknown. The method is based on the principle of 'divide and conquer'; we first search for dominant submotifs and then build full-length motifs around them. This method has several useful features: (i) it screens all submotifs so that the results are independent of the sequence order in the data; (ii) it allows the submotifs to contain spacers; (iii) it identifies an existing motif even if the data contains 'noise'. (iv) its running time depends linearly on the total length of the input. The method is demonstrated on two groups of protein-binding sequences: a well-studied group of known CRP-binding sequences, and a relatively newly identified group of genes known to be regulated by Lrp. The Lrp motif that we identify, based on 23 gene sequences, is similar to a previously identified motif based on a smaller data set, and to a consensus sequence of experimentally defined binding sites. Individual Lrp sites are evaluated and compared in regard to their regulation mode.
AB - We describe a relatively simple method for the identification of common motifs in DNA sequences that are known to share a common function. The input sequences are unaligned and there is no information regarding the position or orientation of the motif. Often such data exists for protein- binding regions, where genetic or molecular information that defines the binding region is available, but the specific recognition site within it is unknown. The method is based on the principle of 'divide and conquer'; we first search for dominant submotifs and then build full-length motifs around them. This method has several useful features: (i) it screens all submotifs so that the results are independent of the sequence order in the data; (ii) it allows the submotifs to contain spacers; (iii) it identifies an existing motif even if the data contains 'noise'. (iv) its running time depends linearly on the total length of the input. The method is demonstrated on two groups of protein-binding sequences: a well-studied group of known CRP-binding sequences, and a relatively newly identified group of genes known to be regulated by Lrp. The Lrp motif that we identify, based on 23 gene sequences, is similar to a previously identified motif based on a smaller data set, and to a consensus sequence of experimentally defined binding sites. Individual Lrp sites are evaluated and compared in regard to their regulation mode.
UR - http://www.scopus.com/inward/record.url?scp=0029146491&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/11.4.379
DO - 10.1093/bioinformatics/11.4.379
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
C2 - 8521047
AN - SCOPUS:0029146491
SN - 1367-4803
VL - 11
SP - 379
EP - 387
JO - Bioinformatics
JF - Bioinformatics
IS - 4
ER -