LoRDEC is a program to correct sequencing errors in long reads from 3rd generation sequencing with high error rate, and is especially intended for PacBio reads. It uses a hybrid strategy, meaning that it uses two sets of reads: the reference read set, whose error rate is assumed to be small, and the PacBio read set, which is then corrected using the reference set. Typically, the reference set contains Illumina reads.
Usually, errors in PacBio reads include many insertions and deletions, and comparatively less substitutions. LoRDEC can correct errors of all these types.
After correction, a larger portion of the sequence of PacBio reads is usable for detection of region of similarity with other sequences, for aligning them to the contigs of an assembly, etc.
The inputs read sets are in FASTA or FASTQ format. The reference read set can be compressed (more exactly gzipped).
The output is the set of corrected reads also in FASTA format. In these corrected sequences: uppercase symbol denote correct nucleotides, while lowercase denote nucleotides left un-corrected.
The correction program needs also two parameters when it is called (so 5 information altogether, see its Usage below):
LoRDEC contains several programs:
lordec-correct [--trials <number of target k-mers>] [--branch <maximum number of branches to explore>] [--errorrate <maximum error rate>] [--threads <number of threads>] -2 <FASTA/Q files> -k <k-mer size> -s <abundance threshold> -i <PacBio FASTA file> -o <output file corrected reads> Typical command:lordec-correct -2 illumina.fasta -k 19 -s 3 -i pacbio.fasta -o pacbio-corrected.fasta
lordec-stats -2 <Short read FASTA/Q file> -k <k-mer size> -s <solid k-mer threshold> -i <PacBio FASTA/Q file> -S <output stat file> [-T <number of threads>]
lordec-trim -i <corrected reads file> -o <trimmed reads file>
lordec-trim-split -i <corrected reads file> -o <trimmed reads file>
lordec-build-SR-graph [-T <number of threads>] -2 <FASTA file> -k <k-mer size> -s <solid k-mer threshold> -g <out graph file>