Preprocessing of fastq paired-reads (version 1.0.0)
see below how to correctly set quality cut-off
Percent of bases in sequence that must have quality equal to / higher than cut-off value
Maximum number of Ns in sequence

What it does

This tool is designed to make memory efficient preprocessing of two fastq files. Output of this file can be used as input of RepeatExplorer clustering. Input files can be in GNU zipped archive (.gz extension). Reads are filtered based on the quality, presence of N bases and adapters. Two input fastq files are procesed in parallel. Only complete pair are kept. As the input files are process in chunks, it is required that pair reads are complete and in the same order in both input files. All reads which pass the quality filter fill be writen into output files. If sampling is specified, only sample of sequences will be returned. Cutadapt us run with this options:

--anywhere='AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT'
--anywhere='AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT'
--anywhere='GATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
--anywhere='ATCTCGTATGCCGTCTTCTGCTTG'
--anywhere='CAAGCAGAAGACGGCATACGAGAT'
--anywhere='GTGACTGGAGTTCAGACGTGTGCTCTTCCGATC'
--error-rate=0.05
--times=1 --overlap=15 --discard

Order of fastq files processing

  1. Trimming (optional)
  2. Filter by quality
  3. Discard single reads, keep complete pairs
  4. Cutadapt filtering
  5. Discard single reads, keep complete pairs
  6. Sampling (optional)
  7. Interlacing two fasta files

Quality setting cut-off

To correctly set quality cut-off, you need to know how the quality is encoded in your fastq file, default filtering which is suitable for Sanger and Illumina 1.8 encoding is shown below:

 Default filtering cut-off
           |
           |
           V
 SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.....................................................
 ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......................
 ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................
 .................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ......................
 LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL....................................................
 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
 |                         |    |        |                              |                     |
33                        59   64       73                            104                   126
 0........................26...31.......40
                          -5....0........9.............................40
                                0........9.............................40
                                   3.....9.............................40
 0.2......................26...31........41

S - Sanger        Phred+33,  raw reads typically (0, 40)
X - Solexa        Solexa+64, raw reads typically (-5, 40)
I - Illumina 1.3+ Phred+64,  raw reads typically (0, 40)
J - Illumina 1.5+ Phred+64,  raw reads typically (3, 40)
    with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold)
    (Note: See discussion above).
L - Illumina 1.8+ Phred+33,  raw reads typically (0, 41)