- Looking for biomarkers in different types of tissues requires comparing massive amounts of gene expression data.
- In order to compare ‘digital transcriptome’ data, they have to be normalized or adjusted to a common standard of measurement.
- ISB researchers developed new algorithmic methods that outperform existing methods for normalizing gene expression data from different samples.
By Dr. Martin Shelton
Identifying genes that are expressed at one level in disease and at another level in healthy tissue — so called differentially expressed genes — is one of the first steps in biomarker identification. RNA-seq, which is the application of next-generation sequencing to study transcribed RNA sequences, allows researchers to measure the expression levels of thousands of genes from multiple samples.
However, before comparing gene expression data from different samples to identify differentially expressed genes, the data must be re-scaled, or normalized, to the same standard of measurement to account for differences in the depth of sequencing for each sample. In a paper (Optimal Scaling of Digtial Transcriptomes) published today in PLOS ONE, researchers at ISB compare several methods currently used to normalize gene expression data and describe new algorithms developed at ISB that outperform commonly used normalization methods.
“The two most common normalization methods either trust a single gene to be ‘constant’ across all samples, or assume that all cells have the same total amount of transcribed RNA,” explained Dr. Gustavo Glusman, lead author of the paper. “Both assumptions are often wrong. Our methods can normalize digital transcriptomes without relying on either.” The new methods yield robustly normalized expression values, which is a prerequisite for the identification of differentially expressed and tissue-specific genes as potential biomarkers.