Genome Graphic Generator

Summary

This is a write-up of a tool I made to represent personal genomes in a single, nice-looking graphic. It can turn any sequence into a graphic, but the focus is on representing whole genomes and the minute differences between them that make them unique. You can read the explanation below, skip to the examples, or to the tool itself, now an installable Galaxy tool.

Rationale

The cost of DNA sequencing has been plummeting in the last few years. More than plummeting. It dropped through the floor, and then said a quick hello to Moore's Law as it flew straight past even that exponential trend. The best way to emphasize this is with a chart from the NHGRI's page at genome.gov/sequencingcosts.

This looks like a logarithmic trend, but note that it's already on a logarithmic scale. Moore's Law is a straight line. And before 2007 the cost per genome was already keeping pace with it. At that point a genome would cost you 10 million dollars. But then the price went off a cliff, and within four short years it was under 10 thousand dollars:

This means that my longtime desire to have my own genome sequenced is quickly becoming possible. And when I think about the moment I finally get my genome, I realized that what would really be nice is some way to visualize the whole thing at once. Some image that I can say is me. That poses two problems: how do you generate some sort of graphic that both summarizes a whole genome and retains the rare differences between it and everyone else's? With 3 billion nucleotides, obviously not all of them can be shown in a graphic that will fit in one screen. But if you squash it down to a manageable size, averaging thousands of bases into each pixel, you will inevitably get the same result for every human you run into. That's because the differences that make you unique are so rare that they're swamped by the bases common to everyone.

That's why this is a difficult problem to solve. It's especially difficult to solve it in a simple way that requires no more than a single file with your genome sequence. I've taken a crack at it, and here I'm showing what I've come up with. My approach takes into account every single nucleotide, while still emphasizing the small amount of variation in every genome. Simply put, a change in any single nucleotide will produce a wildly different image. So your graphic will be uniquely yours and represent you.

The Result

Here is the product of my approach. Since I don't yet have my genome, this was produced from the standard reference genome, hg19.

I use a recursive algorithm that partitions the image into eighths, colors them, then partitions each eighth, colors those sub-partitions, then mixes them with the first level, and so on. This prevents the random noise you would get by breaking it into tiny bits and coloring each one independently. Instead, there are larger regions which maintain some continuity even though their constituent parts diverge.

Features

Just to show how much the image changes with a different genome, here are two different genomes side-by-side. The first is the same as above, and the second has one chromosome replaced with a version that differs by only a few tiny mutations.

And because of the recursive nature of the algorithm, the image can be generated at smaller or larger sizes and maintain the same level of detail. Here is the same genome at a number of different sizes. Note that this is not simply zoomed out. That would lose too much sharpness by averaging pixels. These are generated specially at each size, with only the necessary pixels added.

Early Experiments

Because of the way the algorithm generates multiple layers which have to be mixed, an important tuning factor is the weight given to each layer. I tried several different algorithms to determine the weighting, and the result I've shown is what I thought worked the best. But here I'll show some other options I tried, some of which I think would be better in certain contexts. Feedback is definitely appreciated, since I know I don't have the most artistic eye.

This version emphasizes the small details, and I think it really shows the vast amount of information I'm attempting to summarize.

This one is on the other extreme, but it does make the pattern very clear and identifiable. It might work best for small icons.