Clustering (version 1.0.0)
Input sequences requirements:
- To obtain optimal results, use only high quality sequences 80 bp long or longer. Avoid including sequences shorter than 55 nt.
- Ideally, trim all sequences to the same length.
- Minimum of 5,000 sequences is required for clustering.
- Make sure that all adapters were removed from sequences. Presence of adapters will invalidate clustering results!
- Check RepeatExplorer manual for further details
check if you are using pair reads and input sequences contain both read mates and left mates alternate with their right mates
Sequences are renamed by default. If you want to keep original sequence names, uncheck this option. For paired reads it is required that the left and rigth mates are distinguished by the last character of sequence name. It is also neccessary that all reads are paired and left mates alternate with their right mates!
If you wish to keep part of the sequences name, enter the number of characters which should be kept (1-10) instead of zero. Use this setting if you are doing comparative analysis
Minimal length (in nucleotides) of similarity hits to be considered significant. It can be used to increase default threshold which requires similarity over at least 55%
Clusters with the number of reads above threshold are analyzed in detail. The threshold is set as percentage of reads used for clustering.
Use this option if you want to remove particular sequence from you data, for example tandem repeat
This parameter affect assembly but not clustering

What it does

This tool is used for the characterization of repetitive sequences in the genome based on the low pass shotgun sequencing (1) All to all sequence comparison of sequence reads is performed using megablast and all hits with similarity above 80% and overlap with over 55% of longer sequence are recorded. The information about the similarity hits is used for construction of the graph where nodes represents sequence reads and edges between nodes corresponding to similarity hits (2). Sequences are divided into clusters in a way that there is r a elatively high number of similarity hits between the sequences belonging to the same cluster while a relatively low number of similarity hits is present between clusters (3). Each cluster with a size above the threshold is characterized by a similarity search against the databases of known repeats and its graphical layout is calculated (4). Assembly is performed by cap3 program on sequences from each cluster. Output consist of HTML report and also a downloadable archive with all analysis results (5) Graph layouts can be analyzed using SeqGrapheR program. The most recent version of SeqGrapheR can be downloaded from the Laboratory of Molecular Cytogenetics website.

./static/images/umbr_programs_icons/drawing.png

Limitation and the performance of the clustering

Currently, the clustering step uses the Louvain method. While this method outperforms the previously used method , in terms of computational time, it still require that the whole graph is loaded into memory. Memory usage is directly proportional to the total number of similarity hits E. The number of similarity hits can be calculated from:

E=N^2*k

where N is the total number of reads and k is a coefficient which depends on the repetitivenes of the genome. Less reads can be used for highly repetitive genomes and conversely, less repetitive genomes will allow one to use more sequencing data. Based on the previously analyzed data from P. sativum, it is possible to cluster up to 4 millions 100 nt long reads on the computer with 16GB of RAM. At this setting, the whole clustering and subsequent analysis needs approximately 8 days to finish. With the amount 500 thousand sequence reads which, is still sufficient for repeat survey, calculation finishes in about 6 hrs. Also note that there is a considerable amount of data generated. For example, clustering of 4 million P.sativum reads yields 50GiB of uncompressed files. To prevent exhausting of the available memory, each clustering run is preceded by testing to estimate the linit of the number of reads. If the total number of sequences exceed the limit, only a fraction of reads is used for clustering.

Considering the high demands of the clustering, we highly recommend that users first use the clustering on a fraction of the available data and if the results are satisfactory, clustering can be run on a larger set reads to utilize the whole capacity of the galaxy computer cluster. For this purpose, there is also a tool suitable for the selection of a random sample of sequences which can be then passed to clustering.

Sequence read length

Clustering was tested both with Illumina reads with read length of 100 nt and with longer 454 Roche reads ranging from 100-600 nt. For unbiased clustering, it is desired that all reads have the same length. When only reads with various lengths are available, trim the sequence to a uniform length(200 nt recommended) with Trim sequences tool available in FASTX-Toolkit for FASTQ data.

Output

The main results are presented in HTML format which contain a table listing all clusters whose size was above threshold. Detailed results can be downloaded as a compressed archive. It contains the following directories:

sequences:

contain fasta file and blast database files of the sequences used for clustering. The directory will contain only a subset of sequences in case that the number of sequences was cut down

assembly:

contigs as assembled by Cap3 program. Each clusters is assembled independently. all contigs are stored as fasta and ACE files.

blastx:

results of similarity search of individual reads against the database of transposable elements protein domains

clustering:

most of the results are stored here.

file hitsort_PID90_LCOV55.cls contain lists of sequence reads classified into clusters in the format:

>CL1 number_of_reads_in_CL1
read_id_11    read_id_12    read_id_13 ...
>CL2 number_of_reads_in_CL2
read_id_21    read_id_22    read_id_23 ...
...

file hitsort_PID90_LCOV55 contains all pairs of reads for which the similarity was detected above 80% and over length of 55 % of longer sequences. The data are in the folowing format:

read_id_a    read_id_b     bitscore_a_b
read_id_c    read_id_d     bitscore_c_d
read_id_e    read_id_f     bitscore_e_f
...

clusterConnections.cls contains information about the connection(mutually similar reads) between clusters:

CLXX    CLXY    number_connections
CLXZ    CLYY    number_connections
...

interconnnected.txt contain list of reads which has the connections to other clusters than they belong to.

>CL1 number_of_reads_with similarities_to_other_clusters
read_id_21    read_id_34    read_id_56 ...
>CL2 number_of_reads_with similarities_to_other_clusters
read_id_31    read_id_62    read_id_53 ...
...
subdirectories clusters/dir_CLXXXX:
 

Contains detailed information about individual clusters, each cluster with the size above threshold is listed:

  • list of all reads in the cluster
  • sequences in fasta format
  • contigs assembled from corresponding reads
  • contigs coverage in pdf format
  • contigs in ACE format
  • summary of RepeatMasker search
  • summary of protein domain search
  • GL file with the cluster graph layout which can be analyzed usin SeqGrapheR
  • information about related (connected) clusters)

for details see BMC Bioinformatics 2010, 11:378.