Preprocessing of fastq reads (version 1.0.0)
see below how to correctly set quality cut-off
Percent of bases in sequence that must have quality equal to / higher than cut-off value
Maximum number of Ns in sequence

What it does

This tool is designed to perform preprocessing of fastq file. Input files can be in GNU zipped archive (.gz extension). Reads are filtered based on the quality, presence of N bases and adapters. All reads which pass the quality filter fill be writen into output files. If sampling is specified, only sample of sequences will be returned.

Cutadapt us run with this options:

--anywhere='AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT'
--anywhere='AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT'
--anywhere='GATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
--anywhere='ATCTCGTATGCCGTCTTCTGCTTG'
--anywhere='CAAGCAGAAGACGGCATACGAGAT'
--anywhere='GTGACTGGAGTTCAGACGTGTGCTCTTCCGATC'
--error-rate=0.05
--times=1 --overlap=15 --discard

Order of fastq files processing

  1. Trimming (optional)
  2. Filter by quality
  3. Cutadapt filtering
  4. Sampling (optional)
  5. Interlacing two fasta files

Quality setting cut-off

To correctly set quality cut-off, you need to know how the quality is encoded in your fastq file, default filtering which is suitable for Sanger and Illumina 1.8 encoding is shown below:

 Default filtering cut-off
           |
           |
           V
 SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.....................................................
 ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......................
 ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................
 .................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ......................
 LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL....................................................
 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
 |                         |    |        |                              |                     |
33                        59   64       73                            104                   126
 0........................26...31.......40
                          -5....0........9.............................40
                                0........9.............................40
                                   3.....9.............................40
 0.2......................26...31........41

S - Sanger        Phred+33,  raw reads typically (0, 40)
X - Solexa        Solexa+64, raw reads typically (-5, 40)
I - Illumina 1.3+ Phred+64,  raw reads typically (0, 40)
J - Illumina 1.5+ Phred+64,  raw reads typically (3, 40)
    with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold)
    (Note: See discussion above).
L - Illumina 1.8+ Phred+33,  raw reads typically (0, 41)