Blue

Blue is a fast, accurate short-read error-correction tool based on k-mer consensus and context. It will correct both Illumina and 454-like data, and accepts sequence data files in both FASTQ and FASTA formats. Blue is made available under the General Public License and comes with absolutely no warranty. Blue is written in C# and runs natively on Windows, and with mono on Linux.

An article describing Blue was published on June 11 2014. Blue: correcting sequencing errors using consensus and context.

Download

Blue has been improved a number of times since the last official release in 2015, and Blue v2 is now publicly available at GitHub as part of the WorkingDogs family of bioinformatics tools. Blue and related tools can be downloaded from GitHub.

The significant difference between Blue v1 and Blue v2 is a change in intent, moving from reducing the number of errors in a set of reads to producing only correct reads. Blue v1 tried correct what errors could be safely corrected, and passed the resulting 'improved' reads on to other tools. The 'good' parameter in Blue v1 caused corrected reads with insufficient numbers of 'good' kMers after correction to be discarded, ensuring that the most erroneous and uncorrectable reads were dropped rather than being passed on to the next tool in a pipeline.

Blue v2 takes this further and tries to ensure that all reads are either completely corrected or discarded. It uses trimming to remove any uncorrectable tails of reads, and drops any reads that are too short, or appear to still contain errors after correction and trimming. Blue is quite averse to 'rewriting' reads. Incremental kMer-based algorithms can almost always infer the correct following base for any given kMer, but undetected/uncorrected errors in a read can result in chimeric 'corrected' reads whose start comes from the read being corrected and whose tail comes from elsewhere in the genome. Blue avoids such 'rewriting' by tracking the number of corrections being made to a read, and will abandon a correction attempt if too many changes are being made too close together. Blue will also abandon a read if reaches a point where there are no valid 'next kMers'. Blue v1 would pass on these abandoned but partially corrected reads, after checking that the read had sufficient good kMers. Blue v2 will trim such reads back to the last good kMer, and discard it if it is now too short (as defined by the repurposed 'good' parameter). The overall result of this change, and numerous others, is much improved accuracy (in relative terms). On the E. coli DH10B dataset referred to in the paper, Blue now has 99.98% of the corrected (and possibly trimmed) reads aligning with zero changes against the reference sequence, up from 99.90% previously (with '–good' used in both cases).

Previous versions

Version 1.1.3

Linux – source and compiled code
Windows – Visual Studio 2012 solutions including compiled code
PDFs Documentation; Changes 1.1.3

Version 1.1.2

Linux – source and compiled code
Windows – Visual Studio 2012 solutions including compiled code
PDFs Documentation; Changes 1.1.2

Version 1.1.0

Linux – source and compiled code
Windows – Visual Studio 2012 solutions including compiled code
PDFs Documentation; Changes 1.1.0

Version 1.0.1

Linux – source and compiled code
Windows – Visual Studio 2012 solutions including compiled code
Sample histogram from tiling ERA000206 TXT | XLSX
Documentation (PDF); Changes 1.0.1

Version 1.0.0

Linux – source and compiled code
Windows – Visual Studio 2012 solutions including compiled code
Sample histogram from tiling ERA000206 TXT | XLSX
Documentation (PDF)

Contact

For questions, comments and bug reports please contact Paul Greenfield.