A modification to an “ace” gene prediction program now enables scientists to predict the very beginnings of gene transcription start sites and where the first splice occurs thereby defining the first exon of the gene.
While genomics researchers 15 years ago paid little attention to parts of the genome outside the coding regions, they have discovered some strange functions in UTR that have provoked second and third thoughts.
For instance, it recently was discovered that huntingtin, a gene associated with Huntington’s disease, has a second protein segment encoded upstream of the main one. This protein in the so-called untranslated region is involved in regulating the gene.
Running the modified TWINSCAN, on both the human and fruit fly genomes, Brent and colleagues predicted about 25,000 transcription-start sites, compared with a known 6,000.
“In the human genome, we found many extra exons on genes that were already known, or in some cases, spliced UTRs on genes that weren’t even known to exist before,” Brent said.
The system takes advantage of the scarcity of the CG sequence, finding so-called CpG “islands” known to be more common near the transcription-start site. It also has a knack for recognizing sequences that indicate splice sites.
Over the past two years, TWINSCAN has been finding and predicting genes in numerous genomes that other gene prediction systems have missed. The addition of N-SCAN to the handy system — it scans two genomes simultaneously, with potential to scan three or more — strengthens it for predicting both coding and non-coding DNA.
“Like any multiple choice question, if you can learn something about one of the choices, it helps you with the other one,” Brent said. “By making this integrated model that looks for both kinds of exons in both parts of the gene, we’re able to convert a blind guessing game to a multiple choice question – is it a UTR exon or a protein-coding exon? These kinds of questions are easier to answer now.”