Radio Freethinker

Vancouver's Number 1 Skeptical Podcast and Radio Show

A Mountain of Data can mean Anything

Posted by Jenna Capyk on December 20, 2011

Although we  hear about DNA sequencing technology in everything from forensic televisions shows to descriptions of new scientific discoveries, the implications of this technology, and what it requires of scientists, is not always that well understood. The power and limitations of all of the sequencing going on in labs around the globe was recently brought to my attention through one of my research projects and I’d like to provide a breakdown of the data-overflow we have been arguably reluctant to acknowledge as a scientific community.

What is genome sequencing?
As most of us know, all of the “blueprints” to build a human being are contained in our DNA. DNA is a long, string-like molecule made of four different components or bases, which are usually abbreviated A, C, G, and T. You can think of this like a chain built of different coloured links that can occur in any order. The order, or sequence, of these different pieces holds the coded information. When we talk about genome sequencing, we’re talking about determining the sequence (written out as a string of As, Cs, Gs, and Ts) of all of the DNA in a human cell (or a plant, bacterial, squid, or any other kind of cell). Just to give you an idea of scale, the E. coli genome would have about 4.6 million letters if you wrote them out, while the human genome corresponds to a list of about 3.2 billion As, Cs, Gs, and Ts.

What does this technology allow us to do scientifically? (Bioinformatics)
Some of the applications of DNA sequencing are obvious. As we all know, the sequence of an individual’s DNA is completely unique to that person (unless they have an identical twin), and your DNA sequence can be used to specifically identify you. Because our DNA holds the information for our biological makeup, it also holds the information about any congenital diseases, or perhaps risk factors for other conditions that also have environmental factors. As genome sequencing becomes more and more accessible to the average individual there are issues with privacy and other policy-related questions that we have to think about.

Genome sequences also represent an entirely different treasure trove of information to be mined by scientists. Evolution proceedes by random DNA mutation events that eventually lead to speciation and the amazing variety of beings we know in our world today. This means that by comparing the DNA of different species, we can trace their evolutionary trajectory and classify their genetic level of relatedness. We’ve all heard that we are very closely “related” to the chimpanzee, for example. Genome sequencing  technology allows us to evaluate these types of relationships for all biological entities.. This type of analysis can be valuable in a lot of research contexts.

The problems with too much data
This all sounds really great. Lots of data, lots of information: awesome. Right? The problem is that when I say lots of data, I mean LOTS of data. If you’ll remember, the bacterial E. coli genome has several million base-pairs. As more and more genomes are sequenced (mostly bacterial at this point) this number is multiplying rapidly. We’re talking about information for billions upon billions of “letters” in DNA sequences. This amount of data can be difficult to process. The problem is basically being able to see the “signal” through the “noise”. With this much data, it is hard to look at all the data at once. How do we pick out what is significant? what is normal? what is connected?

Obviously analysis like this can’t be done manually, so researchers have developed tools to look at this bioinformatic data. Many of these tools have been adopted as standard tools in this field. On the one hand, adopting standardized methods allows datasets to be processed more quickly: it’s like having a known routine that you perform to get out your answer. It also allows different scientists to compare results produced in different laboratories. If everyone were doing their own thing in isolation it would be hard to advance any science.

The problem with standardized tools, however, is that the challenges that arise in biological research tend not to be “standard” in nature. A lot of the time, different methods are required to approach each unique problem. Perhaps more important than this, many of the most popular biological data processing tools were designed at a time when less data was available. As more and more sequence information becomes available these programs remain powerful tools, but greater care needs to be taken with how they are applied. When people get too comfortable using a standard set of tools in a pre-determined way, they might not take a close enough look at how well those tools are doing the job. This automatism can lead to major misunderstandings about the biological world. If we are going to go to all the trouble of generating so much data, we should do our best to listen to what the data can tell us.

An example, recent project
As a more concrete example, I’ve been working for the past four and a half years on my PhD in biochemistry and most of this time has been spend studying a specific type of protein. Reading the literature, I had a specific understanding of these proteins. This was the same understanding, in fact, as most scientists who worked with them: they only occurred in bacteria, they only performed a specific reaction, most of them had a second subunit. The bottom line is that there was a certain “stereotype” for what these proteins were assumed to be and everything else was thought to be an outlier example. I like to think of this like an alien looking at one person on the street and thinking that all people have an umbrella, red hair, and are kind of fat. In reality, of course, this is only a small subset of people.

For years people had been using the standard tools to look at the sequences for these proteins in those vast databases. How this works: you plug in a sequence and the program spits out a bunch of things that look like it. Because there was so much data, the program was spitting out things that looked a LOT like it, so people thought that it was an accurate representation of the typical protein. To go back to the alien example, the alien performing the same type of search would mean taking the red-haired, umbrella-laden fat man and asking his flying saucer computer to search for things that looked like that. If the alien asked for the top 50 similar things in a specific park, he might see a lot of variety in the types of people the program spit out. As you increase the number of people you search (the amount of data), however, the search results are going to be more and more similar. Search for 50 “things” that look like the red-haired fat umbrella man in the whole world and you’re probably going to come up with a room of near doppelgangers.

Moving from fat men back to proteins, I started to challenge the idea that all of these proteins were really so similar and went about collecting all of the known sequences of these proteins to analyze them. This isn’t easy, as, like I said, it’s a LOT of data. Even for this specific type of protein there are thousands of sequences. When all was said and done, however, a very different picture emerged. The “stereotype” profile of the protein that people held previously turned out to be a very small subset of the actual group. Much like how assuming that all people look like the pudgy umbrella man, using this profile to describe the whole family of proteins turned out to be very wrong.

This example shows how estimating diversity in a population is more and more difficult as you get more data. This is mainly true because it’s hard to look at all the data at once. It also shows that using the wrong tools can give you the wrong information, but that it might be hard to KNOW that it’s wrong without having an idea of the right information. That is to say that with enough data and insufficient tools, the results of an experiment can tell you almost anything.

What needs to be done
So what can we do about this? It seems as though I’ve been arguing that the problem is too much data. Does this mean we should stop generating data? No. The situation in bioinformatics that I’ve described can be a bit counterintuative. Consider that in most instances we criticize studies for not having ENOUGH data. More data has the potential to give us more information IF (and it’s a big “if”) we take care to analyze it properly. The same way that conspiracy theories and some superstitions spring from seeing patterns in noise, in the “clumpiness” of data, a mountain of data can be made to mean anything. As with anything you want information from, you have to approach it analytically, not automatically, and use the right tools for the situation.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s