pandaseq − PAired-eND Assembler for DNA sequences


pandaseq −f forward.fastq −r reverse.fastq [ −6 ] [ −A algorithm ] [ −a ] [ −B ] [ −C module ] [ −d flags ] [ −D penalty ] [ −F ] [ −g log.txt ] [ −G log.txt.bz2 ] [ −i index.fastq ] [ −k kmers ] [ −l minlen ] [ −L maxlen ] [ −N ] [ −o minoverlap ] [ −O maxoverlap ] [ −p forwardprimer ] [ −q reverseprimer ] [ −t threshold ] [ −T threads ] [ { −u | −U } unpaired.txt ] [ −w output.fasta ] [ −W output.fasta.bz2 ]


PANDASEQ assembles paired-end Illumina reads into sequences, trying to correct for errors and uncalled bases. The assembler reads two files in FASTQ format with quality information. If amplification primers were used (e.g., to isolate a variable region of the 16S gene, or the constant regions around zinc finger binding residues), they can be removed from the sequence during assembly. The final sequence will correct any uncalled bases in the overlapping region using the complementary strand. When mismatches occur in the overlapping region, the base with the better quality score is chosen.

The algorithm is as follows:


Find the positions where the forward and reverse primers match best above the threshold and discard the ends of the sequence, including the primer.


Pick and overlap to maximise the probability of the forward and reverse reads having come from a single piece of DNA.


Identify the masking of the end of the read with the quality score B or # as done by CASAVA and adjust the probabilities in this region.


Construct an assembled sequence between the primers and calculate the quality.


Check for various constraints, including quality, length, uncalled bases, and user-supplied modules.



Input files will have their quality encoded as PHRED + 64 instead of the PHRED + 33. PHRED + 64 was used originally in the CASAVA pipeline from version 1.3 through 1.7. In CASAVA 1.8, the score is encoded as PHRED + 33, the default.

−A algorithm

Set the algorithm used for assembly. Currently there are:

Uses the formula described in the original paper (Masella 2012), optionally with an error estimation (ε) provided.


Uses the formula by FastqJoin in the ea-utils package (Aronesty 2013). No parameters are tunable.


Uses the formula by FLASH paper (Magoc 2011). No parameters are tunable.


Uses the formula described in the PEAR paper (Zhang 2013), optionally with the probability of a random base (q) provided.


Uses the formula by RDP paper (Cole 2013). No parameters are tunable.


Uses the formula in the Sitch software, developed by Austin Richardson. No parameters are tunable.


Uses the scoring algorithm from UPARSE/USEARCH (Edgar 2013), optionally with an error estimation (ε) provided.


Strip the primers after assembly, rather than before. Stripping the primers first saves time, but if the overlap region is very large compared to the read, the read may have sequence from the other primer (i.e., the forward read ends with reverse primer, and/or the reverse read ends with forward primer). If the primers are stripped first, the reads will fail to assemble. This option attempts assembly first, then tries to strip the primers, so the heavily overlapping case will assemble. You should only need this if the region of interest is smaller than the whole read. It is undesirable, unless necessary, as it slows assembly down.


Allow input sequences to lack a barcode/tag. Normally, Illumina sequences have barcodes attached to the sequence. This allows the barcode to be missing. The tool panda-checkid(1) will determine if an Illumina identifier is undestood by PANDAseq and if the tag is included.

−C module

Loads an optional validation module to verify sequences are valid before emitting them. See below for more information. You may repeat this option to use multiple validation modules.

−d flags

Set debugging/output flags to provide more details about what PANDAseq is doing. To enable a flag, capitalise it; to disable, include a uncapitalise it. Provide information about the building of a sequence. Show excruciating detail about reconstruction. Show some optional statistics. Show information about building the k-mer table. Provide error about the file parsing. Show every mismatch. The default is BFSrk.

−D penalty

Sometimes, with repetitive sequence, the primer aligns further down the sequence. To avoid this, a primer penalty can be applied. For each base further down the sequence, penalty is subtracted from the proability that the primer aligns to this location. By default, the value is zero, and if used, the value should be rather small; 0.01 seesm to be sufficient in most cases.

−f forward.fastq

The location of the forward reads in FASTQ format. The file may be plain FASTQ, or compressed with gzip(1) or bzip2(1). File compression is automatically detected.


Normally, output will be as a FASTA even though per-base quality information is available. To retain this quality information, this option will output the sequence and the quality information in FASTQ format with quality scores encoded as PHRED + 33 (even if the input scores are PHRED + 64). The meaning of the quality score is conceptually different from the input quality scores for the overlap region, but this may not matter depending on your downstream application. If you intend to use this information for further quality filtering, especially by a program expecting Illumina reads, you are not using this data correctly.

−g log.txt

Log all output to a plain text file, log.txt, instead of standard error.

−i index.fastq

If the index/barcode reads are in a separate FASTQ file, read them an apply them to the input reads.

−G log.txt.bz2

Log all output to a bzip2(1) compressed text file, log.txt.bz2, instead of standard error.


This option is ignored. It used to indicate that input files specified by -f and -r are compressed by bzip2(1). This is automatically detected now.

−k kmers

Sets the number of sequence locations for a particular k-mer. When attempting to align the sequences, the assembler will store the location of every k-mer in a table. If the same k-mer is present multiple times, only the first ones will be stored until the table is full; when this occurs, an FML error is emitted. If the sequences are highly repetitive, lost positions can prevent good alignments; this can be alleviated by increasing this amount. This should be small (no more than 10; the default is 2), or the k-mer table will be extremely large, using a large amount of RAM per thread. Try increasing the value until FML errors go away.

−l minlen

Sets the minimum length for a sequence, after primers are removed. By default, all sequences are kept. With this option, sequences shorter than desired can be discarded.

−L maxlen

Sets maximum length for a sequence, after primers are removed. By default, all sequences are kept. With this option, sequences longer than desired can be discarded.


Eliminate all sequences with uncalled nucleotides in the output. Otherwise, during assembly, uncalled bases (Ns) from unpaired regions may be emitted.

−o minoverlap

Sets the minimum overlap between forward and reverse reads. By default, this is at least one nucleotide of overlap. Raising this number does not generally increase the quality of the output as alignments with small overlaps tend to score poorly and are discarded anyway.

−O maxoverlap

Sets the maximum overlap between forward and reverse reads. By default, this is the read length. In highly overlapping sequences (i.e., those where the end of one read precede the start of the other), this parameter should be set to the sum of the input reads, or a value larger than that.

−p forwardprimer

Strip out primers from the start of the sequence. If the data contains a forward primer (e.g., a conserved region to amplify a 16S variable region), specifying it here will cause the primer to be located in the read and the primer, and any sequence before it, will be discarded. It is also possible to specify a number and the same number of leading bases will be stripped from the sequence. It may be useful to user a number if the sequence has many uncalled bases in the primer region, preventing a nucleotide primer from matching.

−q reverseprimer

Strip out primers from the end of the sequence. The primer is specified as it appears in the reverse read (i.e., it is a reverse complement of what it would be in the alignment).

−r reverse.fastq

FASTQ file containing the reverse reads. See -f for more information.

−t threshold

The score, between zero and one, that a sequence must meet to be kept in the output. Any alignments lower than this will be discarded as low quality. Increasing this number will not necessarily prevent uncalled bases (Ns) from appearing in the final sequence. It is also used as the threshold to match primers, if primers are supplied. The default value is 0.6.

−T threads

The number of threads to spawn. This will only be avilable if PANDAseq was compiled with pthreads(7). In most cases, PANDAseq is IO-bound, not CPU-bound; therefore, adding more CPU capacity would have no effect. Try monitoring a running copy of PANDAseq with top(1); watch the CPU% for the PANDAseq process and the overall system CPU waiting time (%wa in the banner at the top). If waiting time is low and CPU% is very high, then multi-threading may increase speed. If the CPU waiting time is high, threading will simply not help.

Note that using multiple threads prevents sequences from being output in the same order as the original file. If you a filtering reads downstream, consider using the filter validation module as matching them up may be difficult.

−[U|u] unpaired.txt

Write sequences for which the optimal alignment cannot be computed to a file as concatenated pairs. For downstream processing or to stare at wistfully. If -U is used, the quality scores will be included.

−w output.fasta

Write all assembled sequences to a FASTA (or FASTQ) file, output.fasta, instead of standard output.

−W output.fasta.bz2

Write all assembled sequences to a bzip2(1) compressed FASTA (or FASTQ) file, output.fasta, instead of standard output.


At the end of reconstruction, several statistics are output on lines beginning with STAT.


The number of reads in the input files.


The number of sequences where there exists no overlap with a probability above the threshold.


The number of sequences where the reads are unsatisfactory (too short to assemble).


The number of sequences where the fast hashing algorithm could not figure out the optimal overlap, and so every possible overlap had to be considered. Nothing is necessarily wrong with these sequences; they just take longer to assemble. Very repetitive patterns can cause PANDAseq to spend more time investigating overlaps that are likely wrong, resulting the processing time of the file to be quite long if there are many sequences in this category. If they are a significant percentage of the input data, try increasing the size of the k-mer table, using the -k option; this will cause PANDAseq to use more memory, but it may be faster.


The number of sequences where the forward primer could not be aligned. This is only done when -p is supplied and a nucleotide sequence.


The number of sequences where the reverse primer could not be aligned. This is only done when -q is supplied and a nucleotide sequence.


The number of sequences where the quality score of the reconstruction is below the threshold. This says nothing about the quality scores of the individual bases in the forward and reverse reads.


The number of sequences containing uncalled/degenerate/N bases in the final reconstruction (it is immaterial if there are uncalled bases in the reads.) This is only done when -N is provided.


The number of sequences where the final reconstructed sequence is too short. This is only done when -l is provided.


The number of sequences where the final reconstructed sequence is too long. This is only done when -L is provided.


The number of sequences output.


The number of sequences assembled for each possible overlapping length. The first number is the number of sequences with only one overlapping base, the second with two overlapping bases, and so on.


During output, the assembler may output any of the following errors.

The name of the input read did not follow the known Illumina standard formats. Older versions of CASAVA produce sequences with IDs that look like HWUSI-EAS1661_9323_FC619KG:7:1:1190:15190#ATCACG/1, where the fields are instrument:lane:tile:x:y#tag/direction. Newer version of CASAVA produce IDs that look like HWI-ST822:85:C05C3ACXX:1:1101:1171:2104 3:N:0:TAGACA, where the fields are instrument:run:flowcell:lane:tile:x:y direction:filtered:flags:tag. If your sequence headers do not look like either of these, either Illumina has created yet-another header format or, more likely, your sequence headers have been manipulated by some upstream processing, possibly at your sequencing centre. PANDAseq needs the original Illumina probabilities; not ones manipulated by other programs. We’re very picky about that. Sometimes, for mysterious reasons, the sequences lack the barcoding tag. The -B option will cause the lack of barcode to be ignored. This will obviously invalidate the use of validation modules that depend on the barcode.


An invalid letter was found in a nucleotide read. Likely caused by incorrect or corrupt input files.


The an unexpected character or end of the input file was detected. Likely caused by incorrect or corrupt input files.


The end of the input file was detected before it was expected. Likely caused by incorrect or corrupt input files.


The k-mer table is too small to hold a read of the size requested. This is a bug or platform-dependent behaviour. Please file a ticket either way.


The sequence is discarded because the quality is too low given the supplied threshold.


The reconstruction parameters do not produce a valid sequence. Instead, they produce a negative-length sequence. This read pair is discarded.


A FASTQ record has no sequence data. Likely caused by incorrect or corrupt input files.


The input file was not found or could not be read.


The forward primer could not be matched to the forward read. Either the primer is incorrect or the read is low quality or the sequence provided is not the correct original molecule.


Quality information is missing from the FASTQ file. This data is required to reconstruct the sequence.


The reverse primer could not be matched to the reverse read. See NOFP.


Sequences from FASTQ files are not pairing correctly given their sequence names. Likely, the files are mismatched.


An out of memory condition has occurred. Given the memory available, assembly of this sequence is not possible. As Illumina sequencing gets longer, the amount of memory needed can be adjusted. Please file a ticket.


The read length is too long for this version of PANDAseq. PANDAseq needs to be recompiled with a longer allowable seqeuence length; this length is kept short to improve performance.


The nth command line argument that generated this output, for posterity.


The best overlap parameter for a sequence.


The parameters of a reconstructed base.


A mismatch has been identified in the reconstruction.


Information about a module.


An overlap possibility, with probability, as been identified.


The proposed reconstruction parameters.


The version of PANDAseq that generated this output, for posterity.


Some information about the assembly process. See above.


A k-mer has been identified in the forward read.


A duplicate k-mer has been identified in the forward read and discarded. This might cause failure to assemble a sequence if repeated too often. See the -k option to correct this.


A k-mer has been identified in the reverse read.


Something truly unexpected has happened. This probably involves an validation module.


This will assemble a data from a run in lane 7:

pandaseq -f s_7_1.fastq.bz2 -r s_7_2.fastq.bz2 > s_7.fasta

This will assemble data from lane 7, stripping conserved regions around the prokaryotic 16S V3 region and store the results in s_7.fasta.bz2 and store the logging output s_7.log.bz2.

pandaseq -f s_7_1.fastq.bz2 -r s_7_2.fastq.bz2 -p CCTACGGGAGGCAGCAG -q ATTACCGCGGCTGCTGG -G s_7.log.bz2 | bzip2 > s_7.fasta.bz2


Validation modules are capable of making decisions about whether or not to keep output sequences. For example, one could write a module to check secondary structure of a RNA, or that a coding sequence contains no stop codons. To create a module, please see pandaxs(1). Invoking a module can be done using the -C option on the command line. As many modules as desired may be added. The path to the module may be followed by a colon (on Windows, a semicolon) and arguments. For example, the following will include all sequences after HWI-ST822:85:C05C3ACXX:1:1101:1171:2104 3:N:0:TAGACA in the input file:

pandaseq -f s_7_1.fastq.bz2 -r s_7_2.fastq.bz2 -C "after:HWI-ST822:85:C05C3ACXX:1:1101:1171:2104 3:N:0:TAGACA" > s_7.fasta


There are some included modules:

Assemble only the sequences after (and including) the sequence specified. This is done in file order.


Assemble only the sequences before (and excluding) the sequence specified. This is done in file order.


This can be used to only include sequences with perfect overlap regions. You shouldn’t want to do it. The whole point is to fix sequences which are probably good. Moreover, assuming that the sequencer is right in the overlap region and in the non-overlapping regions requires an unsound leap in statistics. My dislike has been appropriately embodied in the name of this validation module.


Sometimes, an assembled sequence can have zero length. Some downstream applications do not like this, so this filter allows removing any such sequences.


Output only the sequences whose identifiers match those in the file specified, one per line. If the file is missing, sequences are read from standard input.


Check the PHRED score of every base in the output sequence and make sure it is at least value. The threshold is based on the sequence as a whole, but this is based on the individual base scores, as they would be seen with the -F option.


Check that the number of “bits saved” (Cole, et al. 2013) is above the provided value.


Sometimes, libraries are constructed with a mix of primer sequences. The allows separating the primer mix. To do this given primers, run PANDAseq twice: once with the each primer given to -p or -q and the other primer given as the primer option to this module. The direction must be specified as either f or p for the forward primer or r or q for the reverse primer. This module will reject and reads that match the primer, eliminating them from the output.


Produces a histogram of the number of overlaps that were examined for each of the sequences that assembled. This does not indicate the number of overlaps that were examined for discarded sequences.


Perform the false-positive test described in section 2.2 of Zhang 2013.


Only include sequences in the output with one of the tags specified. This can be used to demultiplex sequences. This will not work well with -B option.


pandaseq-checkid(1), pandaxs(1), gzip(1), bzip2(1).

Andre P Masella, Andrea K Bartram, Jakub M Truszkowski, Daniel G Brown and Josh D Neufeld. PANDAseq: paired-end assembler for Illumina sequences. BMC Bioinformatics 2012, 13:31.

E. Aronesty. Comparison of Sequencing Utility Programs TOBioiJ (2013); doi:10.2174/1875036201307010001

J. Zhang, K. Kobert, T. Flouri, and A. Stamatakis. PEAR: A fast and accurate Illumina Paired-End reAd mergeR Bioinformatics 2013 : btt593v1-btt593.

J. R. Cole, Q. Wang, J. A. Fish, B. Chai, D. M. McGarrell, Y. Sun, C. T. Brown, A. Porras-Alfaro, C. R. Kuske, and J. M. Tiedje. Ribosomal Database Project: data and tools for high throughput rRNA analysis Nucl. Acids Res. Database issue: first published online 27 Nov 2013; doi: 10.1093/nar/gkt1244

T. Magoc, and S. Salzberg. FLASH: Fast length adjustment of short reads to improve genome assemblies. Bioinformatics 27:21 (2011), 2957-63.a

R. C. Edgar. personal communication