There are six billion base pairs in the human genome. Base pairs are represented by four different letters: A, T, C and G. These stand for the chemical structure of the each of the base pairs. The sequences of these base pairs form together to make genes. Genes then tell cells how to make corresponding proteins. The proteins then perform the functions of the cell. They can also cause disease based on possible mutations in the original genes. All this medical discovery depends on computing in DNA and Human Genome Research.
So, what is the role of computing in DNA and human genome research?

The structure of DNA showing with detail the structure of the four bases, adenine, cytosine, guanine and thymine, and the location of the major and minor groove. (Photo credit: Wikipedia)
The human genome contains between 20,000 and 25,000 genes. A draft of the first human genome to be sequenced was published in 2000 and the first complete human genome was finished in 2003. The public, mainly government funded sector and the private sector competed to be the first to sequence the human genome. In the years that have followed, the genomes of many more individuals have been sequenced. Companies are improving technology so that the price for sequencing a genome is approaching $1000.
The plethora of genomic information being generated makes storing the data the first and foremost concern for genomic research. Computers are required for this massive storage problem. In order for scientific research to be most effective, however, this data must be available to other scientists studying any range of problems. The prominent, peer-reviewed journal, “Science” has recently mandated all data associated with articles published in their journal to be stored permanently and available for the public to review and perform further research. This addresses two problems. The integrity of published work can only be upheld when other scientists are able to review and repeat the original results. A recent scandal at Duke University where many published articles were redacted and clinical trials canceled due to fraudulent results. Full access to the original data would have made the fraud apparent much earlier. The second issue is that further research depends on data from previous research to be available so that it can be expanded upon.
Is data storage the only use of computing in DNA and human genome research?
Data storage is only one large use of computing in DNA and human genome research. One area in which computers have made an enormous contribution is in determining which areas of the genome are genes and code for proteins and which areas of the genome do not code for proteins. The non-coding regions of the DNA can do many different things. They may be involved in regulating how much of a specific gene is made in to protein or if it is made into protein at all. Using computer algorithms, scientists have been able to suggest what sections of DNA are used for. This helps identify genes that may be important for scientists to focus their research on.
In a similar manner, there are genes that are in the same “family”. These genes are similar and the proteins they encode might perform similar functions in the cell. Researchers studying a particular function or disease may have found one gene that is involved in the process they are studying. Using a technique called Bioinformatics, they can use computers in order to identify similar genes. The genes they identify using a computer have a high likelihood of being involved in the same biological function as the original gene.

JGI Illumina Genome Sequencers (Photo credit: Lawrence Berkeley National Laboratory)
The human genome is largely the same as in animals such as chimpanzees. The genomes of individual humans are even more similar. Within the human population, there are individual base pairs that are different within the population. These are called Single Nucleotide Polymorphisms (SNPs). Some of these are called silent mutations. If they are located within a region of the DNA that codes for a gene, they may result in the exact same protein.
When a SNP results in a protein that is different from one person to another, there is a potential for a different phenotype between people. A phenotype is an observable characteristic. One or, more likely, many SNPs together may lead to an individual being more likely to develop a certain disease. This could lead to a person being more susceptible to diabetes, heart disease or a plethora of less common disease. Scientist can look at populations of people that have, for example, diabetes and those that don’t. Using complicated computer algorithms, potential differences in the DNA between the populations can be identified. The problem with a disease such as diabetes is that there are likely many causes and genes involved and there are a wide range of natural occurring variations in the genomes of different individuals. As computer models become more sophisticated and more individual humans have their genomes sequenced, scientists can more accurately identify those genes involved in diseases such as diabetes.
Microarray analysis is another area of human genome research that relies heavily on computer analysis. Not all genes are expressed all the time. Thus, just because a certain version of a gene is found in someone’s genome does not mean that gene is always turned on. Furthermore, genes can be expressed at different levels. A person’s cells can have more of one protein than another. One way to determine gene expression levels is through Microarray analysis.
A gene is first transcribed into an mRNA strand. This stands for messenger RNA. The mRNA exists in the nucleus, where DNA is found and is translated into protein. The more mRNA in a cell, the more protein can be produced. A Microarray is a small chip containing the DNA from many different genes a scientist is interested in studying. The mRNA from the cell is then added to the chip, and the mRNA corresponding to a gene will bind or stick to the DNA. This allows scientists to determine how much of a gene is actually being expressed. This however leaves scientists with a huge amount of data to process. The statistical significance of expression levels of certain genes can only be determined with computers. Furthermore, computers are used to make meaningful comparisons between patients. Some genes may be expressed at abnormally high levels in cancer patients or biopsies of cancerous tissues. Computers are used to do this analysis because the human mind is not capable of processing such large swaths of data.
As human genomic research continues, computers will become increasingly more important. Recently the field of epigenetics has blossomed. Epigenetics is the study of modifications to DNA that provides a code on top of the DNA code. Computers are the only way we can truly begin to decode the complexities involved in our genetic code.