What it does
This tool is used for the characterization of repetitive sequences in the genome based on the low pass shotgun sequencing (1) All to all sequence comparison of sequence reads is performed using megablast and all hits with similarity above 80% and overlap with over 55% of longer sequence are recorded. The information about the similarity hits is used for construction of the graph where nodes represents sequence reads and edges between nodes corresponding to similarity hits (2). Sequences are divided into clusters in a way that there is r a elatively high number of similarity hits between the sequences belonging to the same cluster while a relatively low number of similarity hits is present between clusters (3). Each cluster with a size above the threshold is characterized by a similarity search against the databases of known repeats and its graphical layout is calculated (4). Assembly is performed by cap3 program on sequences from each cluster. Output consist of HTML report and also a downloadable archive with all analysis results (5) Graph layouts can be analyzed using SeqGrapheR program. The most recent version of SeqGrapheR can be downloaded from the Laboratory of Molecular Cytogenetics website.
Limitation and the performance of the clustering
Currently, the clustering step uses the Louvain method. While this method outperforms the previously used method , in terms of computational time, it still require that the whole graph is loaded into memory. Memory usage is directly proportional to the total number of similarity hits E. The number of similarity hits can be calculated from:
where N is the total number of reads and k is a coefficient which depends on the repetitivenes of the genome. Less reads can be used for highly repetitive genomes and conversely, less repetitive genomes will allow one to use more sequencing data. Based on the previously analyzed data from P. sativum, it is possible to cluster up to 4 millions 100 nt long reads on the computer with 16GB of RAM. At this setting, the whole clustering and subsequent analysis needs approximately 8 days to finish. With the amount 500 thousand sequence reads which, is still sufficient for repeat survey, calculation finishes in about 6 hrs. Also note that there is a considerable amount of data generated. For example, clustering of 4 million P.sativum reads yields 50GiB of uncompressed files. To prevent exhausting of the available memory, each clustering run is preceded by testing to estimate the linit of the number of reads. If the total number of sequences exceed the limit, only a fraction of reads is used for clustering.
Considering the high demands of the clustering, we highly recommend that users first use the clustering on a fraction of the available data and if the results are satisfactory, clustering can be run on a larger set reads to utilize the whole capacity of the galaxy computer cluster. For this purpose, there is also a tool suitable for the selection of a random sample of sequences which can be then passed to clustering.
Sequence read length
Clustering was tested both with Illumina reads with read length of 100 nt and with longer 454 Roche reads ranging from 100-600 nt. For unbiased clustering, it is desired that all reads have the same length. When only reads with various lengths are available, trim the sequence to a uniform length(200 nt recommended) with Trim sequences tool available in FASTX-Toolkit for FASTQ data.
The main results are presented in HTML format which contain a table listing all clusters whose size was above threshold. Detailed results can be downloaded as a compressed archive. It contains the following directories:
contain fasta file and blast database files of the sequences used for clustering. The directory will contain only a subset of sequences in case that the number of sequences was cut down
contigs as assembled by Cap3 program. Each clusters is assembled independently. all contigs are stored as fasta and ACE files.
results of similarity search of individual reads against the database of transposable elements protein domains
most of the results are stored here.
file hitsort_PID90_LCOV55.cls contain lists of sequence reads classified into clusters in the format:
>CL1 number_of_reads_in_CL1 read_id_11 read_id_12 read_id_13 ... >CL2 number_of_reads_in_CL2 read_id_21 read_id_22 read_id_23 ... ...
file hitsort_PID90_LCOV55 contains all pairs of reads for which the similarity was detected above 80% and over length of 55 % of longer sequences. The data are in the folowing format:
read_id_a read_id_b bitscore_a_b read_id_c read_id_d bitscore_c_d read_id_e read_id_f bitscore_e_f ...
clusterConnections.cls contains information about the connection(mutually similar reads) between clusters:
CLXX CLXY number_connections CLXZ CLYY number_connections ...
interconnnected.txt contain list of reads which has the connections to other clusters than they belong to.
>CL1 number_of_reads_with similarities_to_other_clusters read_id_21 read_id_34 read_id_56 ... >CL2 number_of_reads_with similarities_to_other_clusters read_id_31 read_id_62 read_id_53 ... ...
Contains detailed information about individual clusters, each cluster with the size above threshold is listed:
for details see BMC Bioinformatics 2010, 11:378.