None
Published Histories | jirka | Example history #3
Import history

Galaxy History ' Example history #3'

Annotation: Repeat analysis in pea (Pisum sativum) WGS data. This example shows how to process and use paired-end Illumina reads.

DatasetAnnotation
1: pea paired forward REDUCED DATASET from ERR063464_1
239.4 Mb
format: fastq, database: ?
@ERR063464.1 E201_0095:6:1:2672:996#0/1
NCACCAAGCACGACTTTAATTACCATGCCTAAAAACAACTAGACAAAATTTGGAGATTATCAAAAAAAGTCCCATTCAATTTGGATTAGGGATGATTAAA
+
####################################################################################################
@ERR063464.2 E201_0095:6:1:2814:995#0/1
NTTTTCTTAAATCAAACTTGTAAACAAACTTAACTATACTTGACTTAAACTTTCAAAAAGACAAAAAGAACTAACTCATTCAGACCATTTTAGGCCTTTG
Pisum sativum, WGS, Illumina paired-end, reduced dataset (1 million reads from ERR063464_1). FORWARD reads, their names end with ".../1".
2: pea paired reverse REDUCED DATASET from ERR063464_2
239.4 Mb
format: fastq, database: ?
@ERR063464.1 E201_0095:6:1:2672:996#0/2
TNCNGGACAATTTCGGGGCCATATTTGTGATCTACATTGAATGCCGGTTACAACGATATATGATTTGTTCGATTTGGTTGAAAATTGTTCATATCGAATT
+
7#7#8=85>=@B?BE-@DDEF@E<8CD<CC8GGDGG>GEIIIIDIGDA,B2::+4:<=7?BB3--:<45;B92BB=<=3@####################
@ERR063464.2 E201_0095:6:1:2814:995#0/2
CNANACCCAAATATTAAGAAGTTTTCAAATAAAAACTCATAAAAGTCAGAGATCACAGGTAAGGGGGTTGGTTACATAGAGGGACGGGGTCAGCACCCAC
Pisum sativum, WGS, Illumina paired-end, reduced dataset (1 million reads from ERR063464_2). REVERSE reads, their names end with ".../2".
3: FASTQ Groomer on data 1
239.4 MB
format: fastqsanger, database: ?
Info: Groomed 1000000 sanger reads into sanger reads.
Based upon quality and sequence, the input data is valid for: sanger
Input ASCII range: '#'(35) - 'I'(73)
Input decimal range: 2 - 40
@ERR063464.1 E201_0095:6:1:2672:996#0/1
NCACCAAGCACGACTTTAATTACCATGCCTAAAAACAACTAGACAAAATTTGGAGATTATCAAAAAAAGTCCCATTCAATTTGGATTAGGGATGATTAAA
+
####################################################################################################
@ERR063464.2 E201_0095:6:1:2814:995#0/1
NTTTTCTTAAATCAAACTTGTAAACAAACTTAACTATACTTGACTTAAACTTTCAAAAAGACAAAAAGAACTAACTCATTCAGACCATTTTAGGCCTTTG
Grooming FORWARD reads, setting quality score type to "Sanger" (be careful to set this option correctly, based on your input data type). This step has to be performed on any fastq data in order to be used with other tools.
4: FASTQ Groomer on data 2
239.4 MB
format: fastqsanger, database: ?
Info: Groomed 1000000 sanger reads into sanger reads.
Based upon quality and sequence, the input data is valid for: sanger
Input ASCII range: '#'(35) - 'I'(73)
Input decimal range: 2 - 40
@ERR063464.1 E201_0095:6:1:2672:996#0/2
TNCNGGACAATTTCGGGGCCATATTTGTGATCTACATTGAATGCCGGTTACAACGATATATGATTTGTTCGATTTGGTTGAAAATTGTTCATATCGAATT
+
7#7#8=85>=@B?BE-@DDEF@E<8CD<CC8GGDGG>GEIIIIDIGDA,B2::+4:<=7?BB3--:<45;B92BB=<=3@####################
@ERR063464.2 E201_0095:6:1:2814:995#0/2
CNANACCCAAATATTAAGAAGTTTTCAAATAAAAACTCATAAAAGTCAGAGATCACAGGTAAGGGGGTTGGTTACATAGAGGGACGGGGTCAGCACCCAC
Grooming REVERSE reads.
5: Filter by quality on data 3
174.3 MB
format: fastqsanger, database: ?
Info: Quality cut-off: 20
Minimum percentage: 90
Input: 1000000 reads.
Output: 728326 reads.
discarded 271674 (27%) low-quality reads.
@ERR063464.8 E201_0095:6:1:3696:999#0/1
NAATTCAAACCTTTCGATTCTTGAAATTGACATGCGGTTTAGGTAGATCTTGCAACATAGAACACACTGCAACTTGAATCATGCGTTTTCGTTAAGAATT
+
#----77777CCC@@C@@@C@@@@@@C@@C@@@@@58888<<<<<@@@@@@C@@@@C@CC@@@@@@@@@@@C@@@C@@@@@@@@@<::<<::8:::<:<7
@ERR063464.50 E201_0095:6:1:10920:992#0/1
NGTGAATTCCTCATTTGTACCTGAACATTCATCTCTTCGAAAGTCCCAATGTAAAAATCCAAACATGATCATGTGATTAATGGTTGGAAAGGTCTTGATG
[FORWARD] Keeping only the reads with sequence quality at least 20 over 90% of their length.
6: Filter by quality on data 4
142.2 MB
format: fastqsanger, database: ?
Info: Quality cut-off: 20
Minimum percentage: 90
Input: 1000000 reads.
Output: 594021 reads.
discarded 405979 (40%) low-quality reads.
@ERR063464.7 E201_0095:6:1:3604:993#0/2
ANCNTTCTTTCCCCTGCCTGAGTCTATCCTTGATATGTTCATCTTAATCGATGACGGATATTCCCATCCTTGGTATTCTACCTAGTAAAAAGGTAGTTGT
+
:#:#<BBBB?IIIIB-GGGGGGDG>GGGGGIIIIIIIIHIIIIIHIIHIIGIBIIIFHFFBGEGEIBIEDCEE>EECEEEDBBFEBD8CE2A?3??9=?9
@ERR063464.9 E201_0095:6:1:3964:1000#0/2
GNANAAAGATCCTTAATGAGAAGAAGGCAATCAACATACTCCAAAGCGCTCTTAGTATGGACGAGTTCTTCCGTATATCTCAATGTAAATCGGCACAGGA
[REVERSE] Keeping only the reads with sequence quality at least 20 over 90% of their length.
7: FASTQ interlacer pairs from data 5 and data 6
242.6 MB
format: fastqsanger, database: ?
Info: There were 309121 single reads.
Interlaced 506613 pairs of sequences.
@ERR063464.50 E201_0095:6:1:10920:992#0/1
NGTGAATTCCTCATTTGTACCTGAACATTCATCTCTTCGAAAGTCCCAATGTAAAAATCCAAACATGATCATGTGATTAATGGTTGGAAAGGTCTTGATG
+
#,0-)45545@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@:::::@@@@@<<5<:5757788997
@ERR063464.50 E201_0095:6:1:10920:992#0/2
TNTNAAAGTAAATTCATGTGATAATGCAAGACATTATAGGTCCTTTTTGACCAAAGGCATTAAAAGGCAAAAAAGTCCAACTTCAAGTGCCCATAACTCT
Using quality-filtered FORWARD and REVERSE reads as input, FASTQ interlacer finds complete pairs of paired-end reads and produces this file.
8: FASTQ interlacer singles from data 5 and data 6
74.0 MB
format: fastqsanger, database: ?
Info: There were 309121 single reads.
Interlaced 506613 pairs of sequences.
@ERR063464.8 E201_0095:6:1:3696:999#0/1
NAATTCAAACCTTTCGATTCTTGAAATTGACATGCGGTTTAGGTAGATCTTGCAACATAGAACACACTGCAACTTGAATCATGCGTTTTCGTTAAGAATT
+
#----77777CCC@@C@@@C@@@@@@C@@C@@@@@58888<<<<<@@@@@@C@@@@C@CC@@@@@@@@@@@C@@@C@@@@@@@@@<::<<::8:::<:<7
@ERR063464.134 E201_0095:6:1:8034:1005#0/1
NCATAATCATCATATACATATACATTCATATAATTTCATATTTGCATATTCGTATATCTGATTAAATTGTTGGTTTGAGATTCTCATTCCCTGTATGGTG
Singles are produced in cases when one read from a pair did not pass quality filtering. This file can be deleted.
9: FASTQ to FASTA on data 7
1,013,226 sequences
format: fasta, database: ?
Info: 1013226 FASTQ reads were converted to FASTA.
>ERR063464.50 E201_0095:6:1:10920:992#0/1
NGTGAATTCCTCATTTGTACCTGAACATTCATCTCTTCGAAAGTCCCAATGTAAAAATCCAAACATGATCATGTGATTAATGGTTGGAAAGGTCTTGATG
>ERR063464.50 E201_0095:6:1:10920:992#0/2
TNTNAAAGTAAATTCATGTGATAATGCAAGACATTATAGGTCCTTTTTGACCAAAGGCATTAAAAGGCAAAAAAGTCCAACTTCAAGTGCCCATAACTCT
>ERR063464.71 E201_0095:6:1:13978:997#0/1
NTAACTCTAAAACTTGCTCTCGCCCTGATCTAGAATTAATGCCTAATTTACACTGTCCAGTTAAAACCTCAAACTCTCGCTCTATTGATTTTAACTTCTT
Sequences of interlaced pairs are converted to FASTA format.
10: Random selection output (from-FASTQ to FASTA on data 7, with number of sequences 300000)
300,000 sequences
format: fasta, database: ?
>ERR063464.71 E201_0095:6:1:13978:997#0/1
NTAACTCTAAAACTTGCTCTCGCCCTGATCTAGAATTAATGCCTAATTTACACTGTCCAGTTAAAACCTCAAACTCTCGCTCTATTGATTTTAACTTCTT
>ERR063464.71 E201_0095:6:1:13978:997#0/2
ANTNATTTCTTTTGCAAGTCCTACATTCTTTTTCCCTTGCAATTTACCTTTCCCTTTTTAGCATTTAGATATTTTTCGCATAATAGTTTCTACACCGGAA
>ERR063464.106 E201_0095:6:1:4449:1002#0/1
NTATTACCATCATCATGGTTATGTCATTCTCAAGGGGTTCATTGTTCATGATCAGTTTCCTTTGGATTAGGGTTTTGACCTCTGGTCAACCCTAATCAAT
Selecting 300,000 reads for clustering analysis. The option "All sequence reads are paired" has to be checked in the input form in order to select complete pairs of reads (resulting in selection of 150,000 pairs).
11: Archive with clustering results from Random selection output (from-FASTQ to FASTA on data 7, with number of sequences 300000)
1,469,219 lines
format: zip, database: ?
binary/unknown file
Items 11-14 represent output of the clustering analysis (see Manual for further explanation). The clustering analysis was run with the option "All sequence reads are paired".
12: Contigs from Random selection output (from-FASTQ to FASTA on data 7, with number of sequences 300000) based on clustering
21,335 sequences
format: fasta, database: ?
>CL1Contig1 (157-1.8-276)
CTCACCACATCAATCAAAACCCTACAAAACAAACATAGGTCAGACCTAAATGCATCTCAC
AAGGTGAGAACAAACCCAATCATCAATGATGAGTATCCACCTGAAACAACAAAACAAAGT
TAGTTTATGTACAAGACTCAAAACCTAACTAAATGAA
>CL1Contig2 (186-3.1-581)
ATTGGAGTGTTTGAATGTGTGATGTGCAAAAGATTGGGTTCTATTTTGGTCATTTGATTA
13: Log file from (from Random selection output (from-FASTQ to FASTA on data 7, with number of sequences 300000)
7,053 lines
format: txt, database: ?
True
This is clustering pipeline
GRAPH BASED CLUSTERING
**********************************************************************
Data preparation started:
14: HTML summary of graph based clustering of Random selection output (from-FASTQ to FASTA on data 7, with number of sequences 300000)
115.6 KB
format: html, database: ?
HTML file