Institute for Systems Biology
  Home: Technology: Data Generation: DNA Sequencing Print Page
Technology
Microarrays
MPSS
SNP
Microsatellite
Quantitative Proteomics
Affinity Purification
Mass Spectrometry Analysis
Peptide Fractionation
Imaging, Cytometry and Microfluidics
DNA Sequencing
 DNA Sequencing
DNA Sequencing

General Technique

DNA is a polymer comprised of four building blocks (typically called nucleotides or bases) named A, C, G, and T, depending on the specific biochemical composition. DNA sequencing is the method by which the order of nucleotides in a segment of DNA is determined. Depending on the technique used to determine the sequence, the length of the "sequence read" can vary from about 20 to more than 1,000 bases. If the region to be sequenced exceeds the length of a typical sequencing read, then a set of overlapping sequence reads must be obtained and aligned in order to reconstruct the complete sequence of the long region. For example, human chromosome 14, which was reported to be 87,410,661 bases in length (see reference below) required the alignment of hundreds of thousands of individual sequence reads.

DNA sequencing mimics the basic process used to copy DNA in a cell during chromosomal replication, except that the procedure is done in a tube or microtiter plate using a minimal set of components. Most DNA sequencing techniques require that there be a "template", (i.e., a biological sample of the DNA whose sequence is to be determined); a "primer", (i.e., a short oligonucleotide that is complementary to a region of the template and capable of being extended); and a DNA polymerase enzyme that successively adds building blocks on to a primer, as directed by the template strand; and the four building blocks themselves. The technique also must embody a method by which the order of the building blocks added to the primer can be detected. Using the detection method of choice, the sequence of the DNA strand complementary to the template is, thereby, determined.

Most large-scale DNA sequencing facilities use fluorescent dyes to label and detect the four bases, and capillary electrophoresis to separate DNA molecules on the basis of size so that the base located at each position in the sequence can be identified. More specifically, for a small percentage of the molecules of each building block added to the sequencing reaction, the building block is chemically modified and labeled with a distinguishable dye such that when a modified building block is randomly added to the DNA strand being extended from the primer, the replication "terminates", with the result that the sequencing reaction contains a mixture of molecules of varying sizes. Because the end of each terminated molecule contains a dye-labeled base, the sequence of the strand complementary to the template can be determined so long as there is a method of separating molecules that differ in length by one building block, that is, the base at position 317 complementary to the template must be recognized separately from the bases at positions 316 and 318.

DNA Sequencing

The image above shows a set of sequencing lanes, where electrophoresis is used to separate molecules differing by one base. Laser detection is used to identify the bases at each position. The sequence is "read" from the bottom up, using a key where "A" is green, "C" is blue, "G" is yellow, and "T" is red. For this portion of the read, lane 1 has sequence TGGCCGCCG; lane 2 reads TAGTATCA; lane 3 reads CCAGNTTTA, and so forth. Using software provided by the manufacturers of sequencing machines, the signal/noise ratios of the dyes is determined for each position so that the proper base can be "called". The order of the bases is displayed in a "chromatogram" or "trace" file.

DNA Sequencing

Over the years, instrumentation for DNA sequencing has improved dramatically in terms of read length and throughput. At the Institute for Systems Biology (ISB), we are using the Applied Biosystems 3730 XL Capillary Sequencer, which can run plates of 96 or 384 samples in a couple, of hours and produce read lengths of more than 700 bases.

Purpose/use/application of the technique:

DNA sequencing is typically used to assist researchers with one or more of the following applications:

1) De novo genome sequence determination. The paradigm case of this application is the human genome project, whose aim was to delineate the order of bases in each of the 24 human chromosomes for a reference genome. Genome sequences have been produced for chimpanzee, rhesus macaque, dog, mouse, rat, chicken, frog, pufferfish, rice, fruit fly, roundworm, yeast, several fungi, and multiple species of bacteria and viruses. Determination of long stretches of accurate and contiguous genomic sequence requires extensive collections of sequence reads, and subsequent assembly and editing of the reads to reconstruct the genome sequence. A genome also can be "sampled" by obtaining sequence data from ends of BAC or fosmid clones or from whole genome shotgun data at low coverage.

2) Gene expression profiling. To determine which genes are transcribed in a given set of tissues, mRNA is extracted and copied into DNA (cDNA) using reverse transcriptase. The resulting cDNAs are cloned, and the clone ends from a cDNA library are sequenced to generate EST, which provide an expression profile for the tissue from which the mRNA was extracted. Because some genes are transcribed at high levels, cDNA libraries are often subjected to subtraction procedures to remove the common transcripts so that the EST data will reveal a larger number of unique transcripts. Genes transcribed at very low levels may not be detected in an EST dataset.

3) Detection of sequence variation. Once a reference sequence has been obtained for a region of interest (e.g., a gene believed to be involved with a disease), variations of the sequence as found in different individuals or closely related species can be identified by selectively resequencing a small portion of known sequence. Variations occur as SNPs; size differences (insertions/deletions); copy number differences (duplications) and rearrangements (inversions, translocations). Once a variation has been discovered by resequencing, its prevalence in a population can be determined by high-throughput genotyping procedures.

4) Environmental sequencing. Rather than focusing on an individually isolated genome, environmental sequencing is aimed as sampling populations of genomes as are found, for example, in bodies of water or different types of soil. In this way, a qualitative analysis of sequence and gene diversity can be obtained from organisms that cannot be cultured using conventional techniques.

Example(s) of projects at ISB that use this technique:

The Multi-megabase Sequencing Center (MSC) at the ISB was one of the 20 partners in the International Human Genome Sequencing Consortium. Our group at the ISB sequenced about 1percent of the genome, primarily on chromosomes 14 and 15. The MSC has participated in other sequencing consortia, most notably, frog and pufferfish. The ISB DNA sequencing facility is assisting several researchers with sequence variation discovery. For example, Alan Aderem´s group is probing for variations in Toll-like Receptor genes that might explain susceptibility to infectious diseases. Peter Small´s group is studying isolates of tuberculosis that are prevalent in different geographical regions. Leroy Hood´s group is sequencing the Laminin Receptor gene in various strains of mice for the purpose of understanding its role in prion protein metabolism.

Ongoing area of technology development:

From its inception, DNA sequencing technology has continually undergone advancements, because the data is so valuable to researchers. Over the years, extensive personnel and financial resources have been amassed and expended by academic labs and commercial ventures for the purpose of:

  • increasing read length and accuracy
  • increasing throughput
  • decreasing cost

Parallel processing and automation of sample handling procedures have enabled significant gains in throughput and cost-savings over the past 10 years, but sequencing is still expensive and difficult for smaller groups to support.

For obtaining long sequence reads, which are desired for de novo genomic sequencing, electrophoresis-based technologies remain the method of choice. For high-throughput sequence sampling projects, in which read length and accuracy are not so important, non-electrophoresis-based technologies are emerging. For example, a company called 454 Life Sciences (http://www.454.com/) performs "sequencing by synthesis" on tiny beads to which a template is attached. Addition of each base is detected as it occurs, rather than after-the-fact, thereby eliminating the need for separating molecules according to size.

For a recent review of DNA sequencing technology development, see the following feature article in Genomics and Proteomics: "DNA sequencing: a race toward the $1000 genome" by Emma Hitt.

The ISB is not currently developing sequencing technologies. Instead we are partnering with academic collaborators and companies who are better equipped to perform the required technology development.

Representative publication(s):

Heilig R, Eckenberg R, Petit JL, Fonknechten N, Da Silva C, Cattolico L, Levy M, Barbe V, De Berardinis V, Ureta-Vidal A, Pelletier E, Vico V, Anthouard V, Rowen L, Madan A, Qin S, Sun H, Du H, Pepin K, Artiguenave F, Robert C, Cruaud C, Bruls T, Jaillon O, Friedlander L, Samson G, Brottier P, Cure S, Segurens B, Aniere F, Samain S, Crespeau H, Abbasi N, Aiach N, Boscus D, Dickhoff R, Dors M, Dubois I, Friedman C, Gouyvenoux M, James R, Madan A, Mairey-Estrada B, Mangenot S, Martins N, Menard M, Oztas S, Ratcliffe A, Shaffer T, Trask B, Vacherie B, Bellemere C, Belser C, Besnard-Gonnet M, Bartol-Mavel D, Boutard M, Briez-Silla S, Combette S, Dufosse-Laurent V, Ferron C, Lechaplais C, Louesse C, Muselet D, Magdelenat G, Pateau E, Petit E, Sirvain-Trukniewicz P, Trybou A, Vega-Czarny N, Bataille E, Bluet E, Bordelais I, Dubois M, Dumont C, Guerin T, Haffray S, Hammadi R, Muanga J, Pellouin V, Robert D, Wunderle E, Gauguet G, Roy A, Sainte-Marthe L, Verdier J, Verdier-Discala C, Hillier L, Fulton L, McPherson J, Matsuda F, Wilson R, Scarpelli C, Gyapay G, Wincker P, Saurin W, Quetier F, Waterston R, Hood L, Weissenbach J. (2003) The DNA sequence and analysis of human chromosome 14.. Nature 421: 601-607. Epub Jan 1 2003.

Alan Aderem

 Related Information


HOME | ABOUT ISB | NEWS | CAREERS | CONTACT ISB | SITE MAP | TERMS OF USE | PURCHASE TERMS | INTRANET
© 2009, Institute for Systems Biology, All Rights Reserved