TAREAN - TAndem REpeat ANalyzer is a computational pipeline for unsupervised identification of satellite repeats from unassembled sequence reads. The pipeline uses low-pass whole genome sequence reads and performs graph-based clustering. Resulting clusters, representing all types of repeats, are then examined to identify those containing circular structures indicative of tandem repeats. A poster summarizing TAREAN principles and implementation can be found here.
The analysis requires paired-end reads generated by whole genome shotgun sequencing provided as a single fasta-formatted file. Reads should be of uniform length (optimal size range is 100-200 nt) and the number of analyzed reads should represent less than 1x genome equivalent (genome coverage of 0.01 - 0.50 x is recommended). Reads should be quality-filtered (recommended filtering : quality score >=10 over 95% of bases and no Ns allowed) and only complete read pairs should be submitted for analysis. Paired reads must be interlaced in single fasta file:
example of interlaced input format:>0001_f CGTAATATACATACTTGCTAGCTAGTTGGATGCATCCAACTTGCAAGCTAGTTTGATG >0001_r GATTTGACGGACACACTAACTAGCTAGTTGCATCTAAGCGGGCACACTAACTAACTAT >0002_f ACTCATTTGGACTTAACTTTGATAATAAAAACTTAAAAAGGTTTCTGCACATGAATCG >0002_r TATGTTGAAAAATTGAATTTCGGGACGAAACAGCGTCTATCGTCACGACATAGTGCTC >0003_f TGACATTTGTGAACGTTAATGTTCAACAAATCTTTCCAATGTCTTTTTATCTTATCAT >0003_r TATTGAAATACTGGACACAAATTGGAAATGAAACCTTGTGAGTTATTCAATTTATGTT ...
To prepare quality filtered and interlaced input fasta file from fastq files, use Preprocessing of paired-reads tool.
Sample size defines how many reads should be used in calculation. Default setting with 500,000 reads will enable detection of high copy number satellites within several hours of computation time. For higher sensitivity the sample size can be set higher. Since sample size affects the memory usage, this parameter may be automatically adjusted to lower value during the run. Maximum sample size which can be processed depends on the repetitiveness of analyzed genome.
Perform cluster merging. Families of repetitive elements get frequently split into multiple clusters rather than being represented as a single one. Merging of the split clusters is performed based on the presence of broken paired-end reads. This may improve detection of satellites with longer monomers.
Use custom repeat database. This option allows users to perform similarity comparison of identified repeats to their custom databases. The repeat class should be encoded in FASTA headers of database entries in order to allow correct parsing of similarity hits.
List of clusters identified as putative satellite repeats, their genomic abundance and various cluster characteristics. Length and consensus sequences of reconstructed monomers are also provided and accompanied by detailed output from kmer-based reconstruction including sequences and sequence logos of alternative variants of monomer sequences.
Output includes a HTML summary with table listing of all analyzed clusters. More detailed information about clusters is provided in additional files and directories. All results are also provided as downloadable zip archive. Since the read clustering results in thousands of clusters, the search for satellite repeats is limited to subset of the largest ones corresponding to the most abundant genomic repeats. The pipeline is set to analyze all clusters representing at least 0.01% of the input reads. Besides the satellite repeats, three other groups of clusters are reported in the output (1) LTR-retrotransposons, (2) 45S and 5S rDNA and (3) all remaining clusters passing the size threshold. As the categories 1 and 2 contain sequences with circular graphs, their consensus is calculated in the same way as for the satellite repeats. Additionally log file reporting progress of the computations is provided.