

Segmentation algorithms to detect compositionally homogeneous domainsThis page is a companion for the paper on comparative testing of DNA segmentation algorithms using benchmark simulations, written by Eran Elhaik (me), Dan Graur and Kresimir Josic. The intention is that this page will host the implementation of the JensenShannon divergence algorithm we describe in the article. For now, we present the file version that we wrote in Matlab, but our hope is to eventually host versions in a variety of languages. In general, we want to make the methods as accessible to the community as possible.
Convert DNA sequence to Xbp fileThis bash script accepts a DNA sequence (no headline). It then count the GC nucleotides per a certain window size (default is 32bp). The output file is in the format of Xbp and is the input file for the IsoPlotter algorithm. Usage information is included in the file and can be viewed using any text editor.CalculateGCWindow.n (Bash) Example: Readme, Input, and Output
AlgorithmsWe tested seven segmentation algorithms from the literature. Below, we describe only the main ideas behind each algorithm. In most cases, the details and the implementation are presented in the original papers. We could not obtain the original code for one of the programs, and reconstructed the program to the best of our ability on the basis of its description. For many several algorithms, it is unclear which parameters we were to use under which circumstances. In such cases, we used the default parameters recommended by the authors. Unless obtained from the authors, algorithms were implemented in Matlab 7.5.
1. Partition sequence file (Xbp format) using the D_{JS} algorithmThis function implements the JensenShannon divergence algorithm (D_{JS}) as described in our paper. In this method, sequences are recursively partitioned by maximizing the difference in GC content between adjacent subsequences, as measured by the JensenShannon divergence statistic (D_{JS}) (Lin 1991). The D_{JS} statistic is calculated over all possible partitioning points and the sequence is partitioned at the position of maximum D_{JS}. The process of segmentation is terminated when the maximal D_{JS} value is smaller than a predetermined threshold of 5.8·10–5. Usage information is included in the file; type 'help D_{JS}Segmentation.m' at the Matlab prompt for more information.D_{JS}Segmentation.m (Matlab)
2. Partition sequence file (Xbp format) using the D_{JS} BIC algorithmD_{JS} BIC. In this method (Li 2001a; Li et al. 2002) sequences are recursively partitioned by maximizing the difference in GC content between adjacent subsequences measured by the JensenShannon divergence statistic (D_{JS}) (Lin 1991). The “segmentation strength” s is a predefined measure of stringency for the segmentation that considers the sequence length and the maximum D_{JS} value (for full description see Supplementary Materials). The process of segmentation continues as long as s is larger than a predetermined threshold s0. We used s0 = 0.D_{JS}BICSegmentation.m (Matlab)
3. Partition sequence file (Xbp format) using the H&M algorithmIn Haiminen and Mannila’s (2007) method, the sequence is partitioned into 100 kb nonoverlapping windows and their GC content is calculated. The algorithm divides the sequence into a predetermined number of domains (k) so that the sum of squares of the Euclidean distance between the GC content of the segments and the sequence GC content is minimized. A serious disadvantage of this algorithm is its quadratic computation time, which precluded the testing of its performance on long sequences (= 100 kb). In addition, determining a realistic value for k, as far as reallife sequences are concerned, is not a straightforward task; however, in our simulation sets, it was simply the number of predetermined domains. We used the segmentation code provided by the authors at http://dx.doi.org/10.1016/j.gene.2007.01.028
4. Partition sequence file (Xbp format) using the Sarment algorithmThis package implements an algorithm that partitions the sequence into k domains using maximal predictive partitioning (MPP) (Guéguen 2005). As in the previous case, the final number of domains (k) has to be determined a priori. Again, in reallife situations, the value of k is unknown; however, in our simulation the value of k is known. We used the segmentation code (version 4) provided by the authors at http://pbil.univlyon1.fr/software/sarment/ (file tar.gz).
5. Partition sequence file (Xbp format) using a slidingwindow algorithmWe used the slidingwindow algorithm as described by Costantini et al. (2006). In this method, the sequence is partitioned into 100 kb nonoverlapping windows. The GC variation within every window is calculated by the GC content standard deviation. The windows are scanned for differences in the standard deviation between every two adjacent windows. Adjacent windows with differences between their standard deviation below a given threshold of 1% were considered concatenated domains (Costantini et al. 2006). We did not observe any cases that would justify using smaller thresholds.CostantiniSegmentation.m (Matlab)
6. Partition sequence file (Xbp format) using the Compositional heterogeneity indexThis method (Nekrutenko and Li 2000) uses a divide andconquer approach (see, Cormen, Leiserson, and Rivest 1990). The sequence is partitioned to n windows of length l, and the CHI measure is calculated for each window. Sequences are then recursively partitioned by maximizing the difference in GC content between adjacent subsequences, measured by the CHI measure
where the GC content of each nucleotide in the window is GCi and the mean GC content of the window is P. The process of segmentation is terminated when the maximized CHI is smaller than a given threshold. In order to estimate the halting threshold parameter, one hundred thousand random sequences, each 1 Mb long, were drawn from a uniform distribution. Each of these sequences was partitioned into two at a random point, and the CHI value for each segment was calculated. We followed the procedure of Cohen et al. (2005) and chose a CHI value corresponding to the upper 5% of the cumulative CHI distribution. We tested a wide range of window sizes (l = 2, 3, 4, 5, and 10). The algorithm appeared to perform best with window sizes of 2. ChiSegmentation.m (Matlab)
7. Partition sequence file (Xbp format) using the IsoFinder algorithmWe unsuccessfully tried to obtain the code for this program (Oliver et al. 2004) from the corresponding author. As an alternative, we reconstructed the IsoFinder algorithm to the best of our understanding of the description in Oliver et al. (2004). In this method, the algorithm uses a slidingpointer that moves from left to right of the sequence. At each point, the mean GC contents to the left and the right of the pointer are compared using the tstatistic. At the maximum tstatistic point (tfilt), the algorithm compares the two subsequences to the left and to the right of the point as follows: both subsequences are divided into windows of size l0 and their window GC contents are compared using Student's ttest to obtain tfilt. The distribution of tfilt values [P(t = tfilt)] was obtained using Monte Carlo simulations (BernaolaGalván et al. 2001). The statistical significance P(t) of the possible partitioning point with tfilt = t is defined as the probability of obtaining the value t or lower values within a random sequence. If the significance of P(t) exceeds a selected threshold P0, the sequence is partitioned at this point into two subsequences and the procedure repeats recursively. We followed Oliver et al. (2004) and used l0 = 3000 bp and P0 = 95%.IsoFinderSegmentation.m (Matlab)
Download all filesTo easily get the most updated versions of the files including sample input and output files, they are available as a downloadable zip file here. Download all files
Matlab compatibilityAll of the Matlab functions here were designed to be compatible with Matlab v7 or Matlab student version (v14). They are not necessarily compatible with older versions of Matlab. That being said, it should be possible to make them compatible as the core functionality does not depend on v7 or v14 features. 

Please post your questions and comments below.
