> M bjbj== .WWlpppp^$7 Wh&&~;~~~~~~L&g?pQ0~~Supplementary Materials to the manuscript 'Some statistical properties of regulatory DNA sequences, and their use in predicting regulatory regions in the Drosophila genome: the fluffytail test.' Irina Abnizova, Klaudia Walter, Rene te Boekhorst and Walter R. Gilks
Supplementary introduction and notations
In this section we present the evidence that our main result, namely our ability to distinguish regulatory DNA from coding DNA and noncoding nonregulatory DNA, does not depend strongly on the size of words examined, and is valid for words of length m=3,5,7,9,12 with corresponding mismatches=0,1,2,3,4.
We show that when word length is increased, regulatory regions still remain fluffy ( F>2), exons and masked noncoding and nonregulatory DNA remain unfluffy, while the coefficient of variation (CV) of spatial cluster size for fluffy (not masked) noncoding and nonregulatory DNA remains high (>1.0). We show exhaustive examples of variability of F and CV for the abdominant Anterior regulatory region, knirps regulatory region, random DNA from chromosome 3L (3L4) and internal exon CG10392 (exon 2r4) from chromosome 2R: see corresponding Supplementary files.
Notation used:
Similarity=(m,mim)= e.g. (5,1) where
m = length of the word
mim= number of allowed mismatches
F = coefficient of fluffiness ( measure of similar words abundance in comparison
with background model). It is the number of standard deviations over the
mean:
EMBED Equation.3
where Lmax, original is the number of similar words in the largest list in the original sequence, EMBED Equation.3 is computed as
EMBED Equation.3
where Lmax,i is the sizes of largest clusters in the ith shuffled sequence, EMBED Equation.DSMT4 is their standard deviation, and EMBED Equation.3 is the number of randomisations.
If F>2 we call the sequence fluffy:
CV = coefficient of variation in the size of clusters of adjacent similar words in MSWL (maximal similar words list): measure of scattering of these values
