Extensive comments and review about the recent bake-off of de novo genome assemblers "GAGE"

During this week’s Genomics seminar at the Genome Cafe in the Biostats department, Steven Salzberg gave a talk on his team new published paper: GAGE: A critical evaluation of genome assemblies and assembly algorithms. I worked on a few assembly projects during my time at Winter Genomics, but that was not the main reason why I was immediately submerged into his talk. I think that it was due to his bold comments since comparing genome assemblers is a, hmm…, delicate issue. I really like the confidence he has on his work and the way he projects it when he talks. It might be too preachy for some, but I like it. Plus it helps that I completely agreed on two key points that differentiate GAGE from it’s competitors: dnGASP and Assemblathon.

First, GAGE uses real data sets instead of simulated ones. I know that some might argue that a given data set can have specific properties that are not general or that it’s biased to a certain assembler. It also feels a bit funny, because I started out assembling simulated data too. It certainly had its uses as I learnt a lot. But once you encounter a real data set you learn how complicated things can be, and it can be quite messy as no one gives you perfectly clean data. I haven’t read much about GAGE’s competitors, but regardless of how they simulate their data, I completely agree with Steven that GAGE has the advantage by using four real data sets. Plus, they were quite sensible when choosing the four data sets as they are Illumina data (the most common) with frequently used read sizes and library types. Note that even the bacterial genome have more than one replicon. 

The second key point is that the GAGE team made public all the data and assembly recipes available through their official site (which has a great summary in from of a FAQ explaining the project and key differences). They have certainly made an effort to guarantee the reproducibility of their results, which is hard to do and hasn’t been done before. It’s a sad feeling that it took so long for someone to focus on reproducibility. So it feels wrong that they have to stress out how unique this feature is on their paper, but they definitely had to. Hm… can anyone reproduce the human genome assembly? I’m not talking about someone reading the paper and doing it on their institution computers, but someone from the author team. I hope the changelog is saved at least in some kind of repository.

Another important difference between GAGE and say Assemblaton, is that the for GAGE an in-house team ran the assemblers instead of asking the authors of each program to fine tune their results. If you had asked me a year ago, I would surely had supported the idea of asking the authors to run their programs. After all, even if you read all the documentation it’s the authors who know the best tricks on how to use their assemblers (or should be very good users). Yet, I can see the point that in reality it’s not the authors who run their code for each application. It’s a person or team of bioinformaticians (or a biologist struggling to death with UNIX) that has read the manual & papers (hopefully) from a few tools and decided which is his favorite one. During this process they probably ran a few of the assemblers with a small parameter scan and compared the results. The GAGE pipeline is very similar and hence feels much real. They obviously did this process in a more rigorous way and made sure the conditions allowed comparing the assemblers.

One of the steps common to all of their recipes was to run Quake: quality-aware detection and correction of sequencing errors. I didn’t know about this specific tool before, but I did know about the idea. Basically, you plot the distribution of the k-mers multiplicity from your data and do something to those that are possible errors (those k-mers that are unique or have very low multiplicity compared to the expected value); most commonly you try to correct them and if you can’t, you discard them. That’s a very broad explanation and I’m sure that interested readers will download the original paper. 

Anyhow, the point is that they cleaned the data sets prior to using any assembler. I couldn’t agree more to the sentence:

High-quality data can produce dramatic differences in the results

Running some kind of preprocessing cleaning tool should help, but you can’t do miracles with crappy data. 

This post is getting huge, so I’ll jump to some points I’d like to highlight though it’ll still be very long.

First, I’m amazed by the simple concept that is “N50 corrected”. It does look complicated to calculate, but the idea of splitting contigs when an error (at least a 5 bp indel) is found (they have Sanger-sequence reference genomes for 3 of them) before calculating the N50 size is just great. It’s simple and very effective. By using this statistic and comparing it to the original N50 size you can clearly detect aggressive assemblers that don’t mind adding errors vs highly conservative ones. Then, comparing “N50 corrected” vs the number of errors (as in figure 6) is VERY informative. I just love that figure!

The result is a bit frustrating because the winner is ALLPATHS-LG. Don’t get me wrong, I think that they are doing great work (they introduced new statistics for comparing assemblies, they are exploiting library preparation more than the rest, etc) but it’s simply hard to come by a data set that meets ALLPATHS-LG’ requirements.

Second, no matter which assembler you use, your result is going to contain lots of errors. There is no way around it, it’s a fact! Hm… unless you want tiny contigs (very conservative assemblers) which aren’t really useful. I think that it’ll be important to stress this out to consumers of the technology instead of fueling their wild dreams of high-quality finished (not draft) de novo assemblies. 

Third, de novo genome assembly (specially for large genomes) is still a complicated endeavor. You shouldn’t take it lightly!

As you might have noticed, I found this paper (and the talk) to be very stimulating and interesting. And as I forgot to do so during the talk, thank you Salzberg et al. for taking a huge step in the right direction. It’ll surely help those working on the field and that at some point asked themselves:

  • What will an assembly based on short reads look like?
  • Which assembly software will produce the best results?
  • What parameters should be used when running the software?

A couple more notes:

  • It’s sad, but Velvet didn’t perform as well as I hoped. I had been convinced for a while that it is one of the best assemblers out there. Plus I’m surprised that it couldn’t run on a 256 GB machine for the bumble bee data set.
  • Take a look at the E-size statistic on the methods section of the paper. It’s interesting that it correlates well with N50 size. At times, I haven’t been too eager to select a best assembly on N50 size as it might not be one of the longest assemblies and I felt like I was wasting data. But it is a very reasonable summary statistic for such a hard problem.
  • I’m still curious on whether gap-closers like IMAGE: Iterative Mapping and Assembly for Gap Elimination (I don’t know why they didn’t include the name on their paper title >.<) correctly increase scaffold/contig length by re-assembling local border paired-reads.
  • Check the websites of the competitors. GAGE looks more complete to me and I’m quite surprised dnGASP doesn’t include links to the other sites! The GAGE twitter might be a bit too much (plus it seems abandoned).
  • Here is a news commentary which talks about the three competitors.
  • I found it funny when Salzberg declared a winner in the talk, but it’s surely takes some guts to do so and I agree that it had to be done after such a rigorous comparison (or “bake-off” ^_^).
  • From the advice section in Genomics 2011 (check my long post about it), “Published is better than perfect”. Results from a bake-off like GAGE are going to change quickly since new updates are released quickly, but they are very helpful!!!